classification
Title: Robotparser fails to parse some robots.txt
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: lukasz.langa Nosy List: acooke, benmezger, ezio.melotti, lukasz.langa, mher.movsisyan, orsenthil, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2013-03-12 10:58 by benmezger, last changed 2013-05-29 13:01 by orsenthil.

Files
File name Uploaded Description Edit
parser.patch mher.movsisyan, 2013-03-18 21:19 review
parser2.patch mher.movsisyan, 2013-03-19 07:11 review
Messages (15)
msg184017 - (view) Author: Ben Mezger (benmezger) Date: 2013-03-12 10:58
I am trying to parse Google's robots.txt (http://google.com/robots.txt) and it fails when checking whether I can crawl the url /catalogs/p? (which it's allowed) but it's returning false, according to my question on stackoverflow -> http://stackoverflow.com/questions/15344253/robotparser-doesnt-seem-to-parse-correctly

Someone has answered it has to do with the line "rllib.quote(urlparse.urlparse(urllib.unquote(url))[2])" in robotparser's module, since it removes the "?" from the end of the url. 

Here is the answer I received -> http://stackoverflow.com/a/15350039/1649067
msg184525 - (view) Author: Mher Movsisyan (mher.movsisyan) Date: 2013-03-18 21:19
Attaching patch.
msg184609 - (view) Author: Mher Movsisyan (mher.movsisyan) Date: 2013-03-19 07:11
The second patch only normalizes the url. From http://www.robotstxt.org/norobots-rfc.txt it is not clear how to handle multiple rules with the same prefix.
msg184614 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-19 07:35
I left a couple of comments on rietveld.
msg185311 - (view) Author: andrew cooke (acooke) Date: 2013-03-27 00:10
what is rietveld?

and why is this marked as "easy"?  it seems like it involves issues that aren't described well in the spec - it requires some kind of canonical way to describe urls with (and without) parameters to solve completely.
msg185312 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-27 00:14
Rietveld is the review tool.  You can access it by clicking on the "review" link at the right of the patch.  You should have received an email as well when I made the review.
msg185313 - (view) Author: andrew cooke (acooke) Date: 2013-03-27 00:19
thanks (only subscribed to this now, so no previous email).

my guess is that google are assuming a dumb regexp so

   http://example.com/foo?

in a rule does not match

   http://example.com/foo

and also i realised that http://google.com/robots.txt doesn't contain any url with multiple parameters.  so perhaps i was wrong about needing a canonical representation (ie parameter ordering).
msg185314 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-03-27 02:36
Well, the code is easy.  Figuring out what the code is supposed to do turns out to be hard, but we didn't know that when we marked it as easy :)  

I want to do more research before OKing a fix for this.  (There is clearly a bug, I'm just not certain what the correct fix is.)
msg187523 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-21 20:30
Lucaz pointed out on IRC that the problem is that the current robotparser is implementing an outdated robots.txt standard.  He may work on fixing that.
msg187552 - (view) Author: Mher Movsisyan (mher.movsisyan) Date: 2013-04-22 10:54
Can you share the link of the new robots.txt standard? I may help to implement it.
msg187557 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-04-22 12:39
I haven't a clue, that was part of the research I was going to do but haven't done yet (and probably won't for now...I'll wait to see if you or Lukaz pick it up first :).

I see he didn't nosy himself on the issue yet, though, so I've done that.  Maybe he'll respond.
msg187560 - (view) Author: Ɓukasz Langa (lukasz.langa) (Python committer) Date: 2013-04-22 13:16
robotparser implements http://www.robotstxt.org/orig.html, there's even a link to this document at http://docs.python.org/3/library/urllib.robotparser.html. As mher points out, there's a newer version of that spec formed as RFC: http://www.robotstxt.org/norobots-rfc.txt. It introduces Allow, specifies how percentage encoding should be treated and how to handle expiration.

Moreover, there is a de facto standard agreed by Google, Yahoo and Microsoft in 2008, documented by their respective blog posts:

http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html

http://www.ysearchblog.com/2008/06/03/one-standard-fits-all-robots-exclusion-protocol-for-yahoo-google-and-microsoft/

http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx

For reference, there are two third-party robots.txt parsers out there implementing these extensions:

- https://pypi.python.org/pypi/reppy
- https://pypi.python.org/pypi/robotexclusionrulesparser

We need to decide how to incorporate those new features while maintaining backwards compatibility concerns.
msg187561 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2013-04-22 13:29
My suggestion for this issue is going ahead with patch2 of Mher. It does  a simple normalization and does the right thing.

The case in the question is an empty query string and behavior or Allow and Disallow for that and patch addresses that. (I don't know why this *bug* was not detected earlier)

Robotparser implements the updated one ( www.robotstxt.org/norobots-rfc.txt) - You can check for Allow string verification in both code and tests.

That said, if updating robotparser further to more compliant with many cases which the 3rd party modules adhere, +1 to that. I suggest that be taken as a different issue and not be confused with this bug.
msg190304 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-29 12:59
New changeset 30128355f53b by Senthil Kumaran in branch '3.3':
#17403: urllib.parse.robotparser normalizes the urls before adding to ruleline.
http://hg.python.org/cpython/rev/30128355f53b

New changeset e954d7a3bb8a by Senthil Kumaran in branch 'default':
merge from 3.3
http://hg.python.org/cpython/rev/e954d7a3bb8a

New changeset bcbad715c2ce by Senthil Kumaran in branch '2.7':
#17403: urllib.parse.robotparser normalizes the urls before adding to ruleline.
http://hg.python.org/cpython/rev/bcbad715c2ce
msg190306 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2013-05-29 13:01
This is fixed in default, 3.3 and 2.7. I will merge this change to 3.2 code line before closing this. I shall raise a new request for updating robotparser with other goodies.
History
Date User Action Args
2013-05-29 13:01:33orsenthilsetmessages: + msg190306
2013-05-29 12:59:11python-devsetnosy: + python-dev
messages: + msg190304
2013-04-22 13:29:16orsenthilsetnosy: + orsenthil
messages: + msg187561
2013-04-22 13:16:24lukasz.langasetassignee: lukasz.langa
messages: + msg187560
2013-04-22 12:39:17r.david.murraysetnosy: + lukasz.langa
messages: + msg187557
2013-04-22 10:54:26mher.movsisyansetmessages: + msg187552
2013-04-21 20:31:25r.david.murraysetkeywords: - easy
2013-04-21 20:30:35r.david.murraysetmessages: + msg187523
2013-03-27 02:36:19r.david.murraysetnosy: + r.david.murray
messages: + msg185314
2013-03-27 00:19:23acookesetmessages: + msg185313
2013-03-27 00:14:20ezio.melottisetmessages: + msg185312
2013-03-27 00:10:40acookesetnosy: + acooke
messages: + msg185311
2013-03-19 07:35:51ezio.melottisetstage: test needed -> patch review
messages: + msg184614
versions: + Python 3.2, Python 3.3, Python 3.4
2013-03-19 07:11:22mher.movsisyansetfiles: + parser2.patch

messages: + msg184609
2013-03-18 21:19:27mher.movsisyansetfiles: + parser.patch

nosy: + mher.movsisyan
messages: + msg184525

keywords: + patch
2013-03-14 07:48:31ezio.melottisetkeywords: + easy
nosy: + ezio.melotti

stage: test needed
2013-03-12 10:58:24benmezgercreate