Message 54742 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	osvenskan
Recipients
Date	2006-03-07.16:32:03
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=1119995 Thanks for looking at this. I have some followup comments. The list at robotstxt.org is many years stale (note that Google's bot is present only as Backrub which was still a server at Stanford at the time: http://www.robotstxt.org/wc/active/html/backrub.html) but nevertheless AFAICT it is the most current bot list on the Web. If you look carefully, the list does contain a non-ASCII entry (#76 --easy to miss in that long list). That Finnish bot is gone but it has left a legacy in the form of many robots.txt files that were created by automated tools based on the robotstxt.org list. Google helps us here: http://www.google.com/search?q=allintext%3AH%C3%A4m%C3%A4h%C3%A4kki+disallow+filetype%3Atxt And by Googling for some common non-ASCII words and letters I can find more like this one (look at the end of the alphabetical list): http://paranormal.se/robots.txt Robots.txt files that contain non-ASCII are few and far between, it seems, but they're out there. Which leads me to a nitpicky (but important!) point about Unicode. As you point out, the spec doesn't mention Unicode; it says nothing at all on the topic of encodings. My argument is that just because the spec doesn't mention encodings doesn't let us off the hook because the HTTP 1.0/1.1 specs are very clear that iso-8859-1, not US-ASCII, is the default for text content delivered via HTTP. By my interpretation, this means that the robots.txt examples provided above are compliant with published specs, therefore code that fails to interpret them does not comply. There's no obvious need for robotparser to support full-blown Unicode, just iso-8859-1. You might be interested in a replacement for this module that I've implemented. It does everything that robotparser does and also handles non-ASCII plus a few other things. It is GPL; you're welcome to copy it in part or lock, stock and barrel. So far I've only tested it "in the lab" but I've done fairly extensive unit testing and I'll soon be testing it on real-world data. The code and docs are here: http://semanchuk.com/philip/boneyard/rerp/ Comments & feedback would be most welcome.

Logged In: YES
user_id=1119995

Thanks for looking at this. I have some followup comments.

The list at robotstxt.org is many years stale (note that
Google's bot is present only as Backrub which was still a
server at Stanford at the time:
http://www.robotstxt.org/wc/active/html/backrub.html) but
nevertheless AFAICT it is the most current bot list on the
Web. If you look carefully, the list *does* contain a
non-ASCII entry (#76 --easy to miss in that long list). That
Finnish bot is gone but it has left a legacy in the form of
many robots.txt files that were created by automated tools
based on the robotstxt.org list. Google helps us here:
http://www.google.com/search?q=allintext%3AH%C3%A4m%C3%A4h%C3%A4kki+disallow+filetype%3Atxt

And by Googling for some common non-ASCII words and letters
I can find more like this one (look at the end of the
alphabetical list):
http://paranormal.se/robots.txt

Robots.txt files that contain non-ASCII are few and far
between, it seems, but they're out there.

Which leads me to a nitpicky (but important!) point about
Unicode. As you point out, the spec doesn't mention Unicode;
it says nothing at all on the topic of encodings. My
argument is that just because the spec doesn't mention
encodings doesn't let us off the hook because the HTTP
1.0/1.1 specs are very clear that iso-8859-1, not US-ASCII,
is the default for text content delivered via HTTP. By my
interpretation, this means that the robots.txt examples
provided above are compliant with published specs, therefore
code that fails to interpret them does not comply. There's
no obvious need for robotparser to support full-blown
Unicode, just iso-8859-1.

You might be interested in a replacement for this module
that I've implemented. It does everything that robotparser
does and also handles non-ASCII plus a few other things. It
is GPL; you're welcome to copy it in part or lock, stock and
barrel. So far I've only tested it "in the lab" but I've
done fairly extensive unit testing and I'll soon be testing
it on real-world data. The code and docs are here:
http://semanchuk.com/philip/boneyard/rerp/

Comments & feedback would be most welcome.

History
Date	User	Action	Args
2007-08-23 16:11:41	admin	link	issue1437699 messages
2007-08-23 16:11:41	admin	create