Message 9427 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	cmalamas
Recipients
Date	2002-02-26.17:14:30
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Robotparser uses re to evaluate the Allow/Disallow directives: nowhere in the RFC is it specified that these directives can be regular expressions. As a result, directives such as the following are mis- interpreted: User-Agent: * Disallow: /. The directive (which is actually syntactically incorrect according to the RFC) denies access to the root directory, but not the entire site; it should pass robotparser but it fails (e.g. http://www.pbs.org/robots.txt) From the draft RFC (http://www.robotstxt.org/wc/norobots.html): "The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html" Also the final RFC excludes * as valid in the path directive (http://www.robotstxt.org/wc/norobots- rfc.html). Suggested fix (also fixes bug #522898): robotparser.RuleLine.applies_to becomes: def applies_to(self, filename): if not self.path: self.allowance = 1 return self.path=="*" or self.path.find (filename) == 0

Robotparser uses re to evaluate the Allow/Disallow 
directives: nowhere in the RFC is it specified that 
these directives can be regular expressions. As a 
result, directives such as the following are mis-
interpreted:
User-Agent: *
Disallow: /.

The directive (which is actually syntactically 
incorrect according to the RFC) denies access to the 
root directory, but not the entire site; it should 
pass robotparser but it fails (e.g. 
http://www.pbs.org/robots.txt)

From the draft RFC 
(http://www.robotstxt.org/wc/norobots.html):
"The value of this field specifies a partial URL that 
is not to be visited. This can be a full path, or a 
partial path; any URL that starts with this value will 
not be retrieved. For example, Disallow: /help 
disallows both /help.html"

Also the final RFC excludes * as valid in the path 
directive (http://www.robotstxt.org/wc/norobots-
rfc.html).

Suggested fix (also fixes bug #522898):
robotparser.RuleLine.applies_to becomes:

    def applies_to(self, filename):
        if not self.path:
           self.allowance = 1
        return self.path=="*" or self.path.find
(filename) == 0

History
Date	User	Action	Args
2007-08-23 13:59:28	admin	link	issue523041 messages
2007-08-23 13:59:28	admin	create