This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author osvenskan
Recipients
Date 2006-04-06.15:34:54
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=1119995

I've also discovered that robotparser can get confused by
files with BOMs (byte order marks). At minimum it should
ignore BOMs, at best it should use them as clues as to the
file's encoding. It does neither, and instead treats the BOM
as character data. That's especially problematic when the
robots.txt file consists of this:
[BOM]User-agent: *
Disallow: / 

In that case, robotparser fails to recognize the string
"User-agent", so the disallow rule is ignored, which in turn
means it treats the file as empty and all robots are
permitted everywhere which is the exact opposite of what the
author intended. If the first line is a comment, then
robotparser doesn't get confused regardless of whether or
not there's a BOM.

I created a sample robots.txt file exactly like the one
above; it contains a utf-8 BOM. The example below uses this
file which is on my Web site.

>>> import robotparser
>>> rp=robotparser.RobotFileParser()
>>>
rp.set_url("http://semanchuk.com/philip/boneyard/robots/robots.txt.bom")
>>> rp.read()
>>> rp.can_fetch("foobot", "/")  # should return False
True
>>> 

My robot parser module doesn't suffer from the BOM bug
(although it doesn't use BOMs to decode the file, either,
which it really ought to). As I said before, You're welcome
to steal code from it or copy it wholesale (it is GPL).
Also, I'll be happy to open a different bug report if you
feel like this should be a separate issue.

History
Date User Action Args
2007-08-23 16:11:42adminlinkissue1437699 messages
2007-08-23 16:11:42admincreate