Message9427
Robotparser uses re to evaluate the Allow/Disallow
directives: nowhere in the RFC is it specified that
these directives can be regular expressions. As a
result, directives such as the following are mis-
interpreted:
User-Agent: *
Disallow: /.
The directive (which is actually syntactically
incorrect according to the RFC) denies access to the
root directory, but not the entire site; it should
pass robotparser but it fails (e.g.
http://www.pbs.org/robots.txt)
From the draft RFC
(http://www.robotstxt.org/wc/norobots.html):
"The value of this field specifies a partial URL that
is not to be visited. This can be a full path, or a
partial path; any URL that starts with this value will
not be retrieved. For example, Disallow: /help
disallows both /help.html"
Also the final RFC excludes * as valid in the path
directive (http://www.robotstxt.org/wc/norobots-
rfc.html).
Suggested fix (also fixes bug #522898):
robotparser.RuleLine.applies_to becomes:
def applies_to(self, filename):
if not self.path:
self.allowance = 1
return self.path=="*" or self.path.find
(filename) == 0 |
|
Date |
User |
Action |
Args |
2007-08-23 13:59:28 | admin | link | issue523041 messages |
2007-08-23 13:59:28 | admin | create | |
|