Message 54740 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	osvenskan
Recipients
Date	2006-02-23.21:07:54
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
One-line summary: If the robotparser module encounters a robots.txt file that contains non-ASCII characters AND I pass a Unicode user agent string to can_fetch(), that function crashes with a TypeError under Python 2.4. Under Python 2.3, the error is a UnicodeDecodeError. More detail: When one calls can_fetch(MyUserAgent, url), the robotparser module compares the UserAgent to each user agent described in the robots.txt file. If isinstance(MyUserAgent, str) == True then the comparison does not raise an error regardless of the contents of robots.txt. However, if isinstance(MyUserAgent, unicode) == True, then Python implicitly tries to convert the contents of the robots.txt file to Unicode before comparing it to MyUserAgent. By default, Python assumes a US-ASCII encoding when converting, so if the contents of robots.txt aren't ASCII, the conversion fails. In other words, this works: MyRobotParser.can_fetch('foobot', url) but this fails: MyRobotParser.can_fetch(u'foobot', url) I recreated this with Python 2.4.1 on FreeBSD 6 and Python 2.3 under Darwin/OS X. I'll attach examples from both. The URLs that I use in the attachments are from my Web site and will remain live. They reference robots.txt files which contain an umlaut-ed 'a' (0xe4 in iso-8859-1). They're served up using a special .htaccess file that adds a Content-Type header which correctly identifies the encoding used for each file. Here's the contents of the .htaccess file: AddCharset iso-8859-1 .iso8859-1 AddCharset utf-8 .utf8 A suggested solution: AFAICT, the construction of robots.txt is still defined by "a consensus on 30 June 1994 on the robots mailing list" [http://www.robotstxt.org/wc/norobots.html] and a 1996 draft proposal [http://www.robotstxt.org/wc/norobots-rfc.html] that has never evolved into a formal standard. Neither of these mention character sets or encodings which is no surprise considering that they date back to the days when the Internet was poor but happy and we considered even ASCII a luxury and we were grateful to have it. ("ASCII? We used to dream of having ASCII. We only had one bit, and it was a zero. We lived in a shoebox in the middle of the road..." etc.) A backwards-compatible yet forward-looking solution would be to have the robotparser module respect the Content-Type header sent with robots.txt. If no such header is present, robotparser should try to decode it using iso-8859-1 per section 3.7.1 of the HTTP 1.1 spec (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1) which says, 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.' Section 3.6.1 of the HTTP 1.0 spec says the same. Since ISO-8859-1 is a superset of US-ASCII, robots.txt files that are pure ASCII won't be affected by the change.

One-line summary: If the robotparser module encounters
a robots.txt file that contains non-ASCII characters
AND I pass a Unicode user agent string to can_fetch(),
that function crashes with a TypeError under Python
2.4. Under Python 2.3, the error is a UnicodeDecodeError. 

More detail:
When one calls can_fetch(MyUserAgent, url), the
robotparser module compares  the UserAgent to each user
agent described in the robots.txt file. If
isinstance(MyUserAgent, str) == True then the
comparison does not raise an error regardless of the
contents of robots.txt. However, if
isinstance(MyUserAgent, unicode) == True, then Python
implicitly tries to convert the contents of the
robots.txt file to Unicode before comparing it to
MyUserAgent. By default, Python assumes a US-ASCII
encoding when converting, so if the contents of
robots.txt aren't ASCII, the conversion fails. In other
words, this works:
MyRobotParser.can_fetch('foobot', url)
but this fails:
MyRobotParser.can_fetch(u'foobot', url)

I recreated this with Python 2.4.1 on FreeBSD 6 and
Python 2.3 under Darwin/OS X. I'll attach examples from
both. The URLs that I use in the attachments are from
my Web site and will remain live. They reference
robots.txt files which contain an umlaut-ed 'a' (0xe4
in iso-8859-1). They're served up using a special
.htaccess file that adds a Content-Type header which
correctly identifies the encoding used for each file.
Here's the contents of the .htaccess file:

AddCharset iso-8859-1 .iso8859-1
AddCharset utf-8 .utf8


A suggested solution:
AFAICT, the construction of robots.txt is still defined
by "a consensus on 30 June 1994 on the robots mailing
list" [http://www.robotstxt.org/wc/norobots.html] and a
1996 draft proposal
[http://www.robotstxt.org/wc/norobots-rfc.html] that
has never evolved into a formal standard. Neither of
these mention character sets or encodings which is no
surprise considering that they date back to the days
when the Internet was poor but happy and we considered
even ASCII a luxury and we were grateful to have it.
("ASCII? We used to dream of having ASCII. We only had
one bit, and it was a zero. We lived in a shoebox in
the middle of the road..." etc.) A backwards-compatible
yet forward-looking solution would be to have the
robotparser module respect the Content-Type header sent
with robots.txt. If no such header is present,
robotparser should try to decode it using iso-8859-1
per section 3.7.1 of the HTTP 1.1 spec
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1)
which says, 'When no explicit charset parameter is
provided by the sender, media subtypes of the "text"
type are defined to have a default charset value of
"ISO-8859-1" when received via HTTP. Data in character
sets other than "ISO-8859-1" or its subsets MUST be
labeled with an appropriate charset value.' Section
3.6.1 of the HTTP 1.0 spec says the same. Since
ISO-8859-1 is a superset of US-ASCII, robots.txt files
that are pure ASCII won't be affected by the change.

History
Date	User	Action	Args
2007-08-23 16:11:40	admin	link	issue1437699 messages
2007-08-23 16:11:40	admin	create