This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author osvenskan
Recipients
Date 2006-03-07.16:32:03
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=1119995

Thanks for looking at this. I have some followup comments. 

The list at robotstxt.org is many years stale (note that
Google's bot is present only as Backrub which was still a
server at Stanford at the time:
http://www.robotstxt.org/wc/active/html/backrub.html) but
nevertheless AFAICT it is the most current bot list on the
Web. If you look carefully, the list *does* contain a
non-ASCII entry (#76 --easy to miss in that long list). That
Finnish bot is gone but it has left a legacy in the form of
many robots.txt files that were created by automated tools
based on the robotstxt.org list. Google helps us here:
http://www.google.com/search?q=allintext%3AH%C3%A4m%C3%A4h%C3%A4kki+disallow+filetype%3Atxt

And by Googling for some common non-ASCII words and letters
I can find more like this one (look at the end of the
alphabetical list):
http://paranormal.se/robots.txt

Robots.txt files that contain non-ASCII are few and far
between, it seems, but they're out there.

Which leads me to a nitpicky (but important!) point about
Unicode. As you point out, the spec doesn't mention Unicode;
it says nothing at all on the topic of encodings. My
argument is that just because the spec doesn't mention
encodings doesn't let us off the hook because the HTTP
1.0/1.1 specs are very clear that iso-8859-1, not US-ASCII,
is the default for text content delivered via HTTP. By my
interpretation, this means that the robots.txt examples
provided above are compliant with published specs, therefore
code that fails to interpret them does not comply. There's
no obvious need for robotparser to support full-blown
Unicode, just iso-8859-1. 

You might be interested in a replacement for this module
that I've implemented. It does everything that robotparser
does and also handles non-ASCII plus a few other things. It
is GPL; you're welcome to copy it in part or lock, stock and
barrel. So far I've only tested it "in the lab" but I've
done fairly extensive unit testing and I'll soon be testing
it on real-world data. The code and docs are here:
http://semanchuk.com/philip/boneyard/rerp/

Comments & feedback would be most welcome.

History
Date User Action Args
2007-08-23 16:11:41adminlinkissue1437699 messages
2007-08-23 16:11:41admincreate