classification
Title: robotparser doesn't handle URL's with query strings
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: mikejs, orsenthil, skybrian
Priority: normal Keywords: patch

Created on 2009-06-23 04:25 by skybrian, last changed 2010-07-28 16:37 by orsenthil. This issue is now closed.

Files
File name Uploaded Description Edit
6325.diff mikejs, 2010-07-27 05:18
Messages (3)
msg89622 - (view) Author: Brian Slesinsky (skybrian) Date: 2009-06-23 04:25
If a robots.txt file contains a rule of the form:

  Disallow: /some/path?name=value

This pattern will never match a URL passed to can_fetch(), as far as I
can tell.

It's arguable whether this is a bug. The 1994 robots.txt protocol is
silent on whether to treat query strings specially and just says "any
URL that starts with this value will not be retrieved". The 1997 draft
standard talks about the path portion of a URL but doesn't give any
examples about how to treat the '?' character in a robots.txt pattern.

Google extends the protocol to allow wildcard characters in a way that
doesn't treat the '?' character specially. See:
http://www.google.com/support/webmasters/bin/answer.py?answer=40360&cbid=-1rdq1gi8f11xx&src=cb&lev=answer#3

I'll leave aside whether to implement pattern matching, but it seems
like a good idea to do something reasonable when a robots.txt pattern
contains a literal '?', and treating it as a literal character seems
simplest.

Cause: in robotparser.can_fetch(), there is this code which seems to
take only the path (stripping the query string).

 url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"

Also, when parsing patterns in the robots.txt file, a '?' character
seems to be automatically URL-escaped. There's nothing in a standards
doc about doing this so I think that might be a bug too.

Tested with python 2.4. I looked at the code in Subversion head and it
doesn't look like there were any changes on the trunk.
msg111687 - (view) Author: Michael Stephens (mikejs) Date: 2010-07-27 05:18
Supplied patch matches rules with query params.
msg111831 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-07-28 16:37
I modified the patch slightly (so that it takes care of path, query, params and fragments).

Fixed in r83209,r83210 and r83211.

I also think that we need to move the robotparser to allow regexs in the allow and disallow patterns. ( Shall open an issue in the tracker, if it is not already present).
History
Date User Action Args
2010-07-28 16:37:55orsenthilsetstatus: open -> closed
resolution: fixed
messages: + msg111831

stage: resolved
2010-07-27 05:18:38mikejssetfiles: + 6325.diff

nosy: + mikejs
messages: + msg111687

keywords: + patch
2010-07-10 23:30:35BreamoreBoysetassignee: orsenthil
versions: + Python 3.1, Python 3.2, - Python 2.6, Python 2.5, Python 2.4
nosy: + orsenthil
2009-06-23 04:25:48skybriancreate