This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Robotparser incorrectly applies regex
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: calvin, cmalamas, loewis
Priority: normal Keywords:

Created on 2002-02-26 17:14 by cmalamas, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (5)
msg9427 - (view) Author: Costas Malamas (cmalamas) Date: 2002-02-26 17:14
Robotparser uses re to evaluate the Allow/Disallow 
directives: nowhere in the RFC is it specified that 
these directives can be regular expressions. As a 
result, directives such as the following are mis-
interpreted:
User-Agent: *
Disallow: /.

The directive (which is actually syntactically 
incorrect according to the RFC) denies access to the 
root directory, but not the entire site; it should 
pass robotparser but it fails (e.g. 
http://www.pbs.org/robots.txt)

From the draft RFC 
(http://www.robotstxt.org/wc/norobots.html):
"The value of this field specifies a partial URL that 
is not to be visited. This can be a full path, or a 
partial path; any URL that starts with this value will 
not be retrieved. For example, Disallow: /help 
disallows both /help.html"

Also the final RFC excludes * as valid in the path 
directive (http://www.robotstxt.org/wc/norobots-
rfc.html).

Suggested fix (also fixes bug #522898):
robotparser.RuleLine.applies_to becomes:

    def applies_to(self, filename):
        if not self.path:
           self.allowance = 1
        return self.path=="*" or self.path.find
(filename) == 0
msg9428 - (view) Author: Bastian Kleineidam (calvin) Date: 2002-02-27 14:11
Logged In: YES 
user_id=9205

Patch is not good:
>>> print RuleLine("/tmp", 0).applies_to("/")
1
>>>
This would apply the filename "/" to rule "Disallow: /tmp".

I think it should be:
return self.path=="*" or filename.startswith(self.path)
msg9429 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-02-28 15:25
Logged In: YES 
user_id=21627

This has been fixed in robotparser.py 1.11.
msg9430 - (view) Author: Costas Malamas (cmalamas) Date: 2002-03-06 12:09
Logged In: YES 
user_id=71233

calvin is right; the patch was incorrect.  A better one 
(and more tested by now):

    def applies_to(self, filename):
        if not self.path:
           self.allowance = 1
        return self.path=="*" or urllib.quote
(filename).startswith(self.path)
msg9431 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-03-06 12:18
Logged In: YES 
user_id=21627

Can you please review the code which is currently in CVS? I
believe it fixes your problem, as well as a number of other
problems.
History
Date User Action Args
2022-04-10 16:05:02adminsetgithub: 36164
2002-02-26 17:14:30cmalamascreate