Message54745
Logged In: YES
user_id=1119995
I've also discovered that robotparser can get confused by
files with BOMs (byte order marks). At minimum it should
ignore BOMs, at best it should use them as clues as to the
file's encoding. It does neither, and instead treats the BOM
as character data. That's especially problematic when the
robots.txt file consists of this:
[BOM]User-agent: *
Disallow: /
In that case, robotparser fails to recognize the string
"User-agent", so the disallow rule is ignored, which in turn
means it treats the file as empty and all robots are
permitted everywhere which is the exact opposite of what the
author intended. If the first line is a comment, then
robotparser doesn't get confused regardless of whether or
not there's a BOM.
I created a sample robots.txt file exactly like the one
above; it contains a utf-8 BOM. The example below uses this
file which is on my Web site.
>>> import robotparser
>>> rp=robotparser.RobotFileParser()
>>>
rp.set_url("http://semanchuk.com/philip/boneyard/robots/robots.txt.bom")
>>> rp.read()
>>> rp.can_fetch("foobot", "/") # should return False
True
>>>
My robot parser module doesn't suffer from the BOM bug
(although it doesn't use BOMs to decode the file, either,
which it really ought to). As I said before, You're welcome
to steal code from it or copy it wholesale (it is GPL).
Also, I'll be happy to open a different bug report if you
feel like this should be a separate issue.
|
|
Date |
User |
Action |
Args |
2007-08-23 16:11:42 | admin | link | issue1437699 messages |
2007-08-23 16:11:42 | admin | create | |
|