Issue 1437699: allow unicode arguments for robotparser.can_fetch

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/42942

classification

Title:	allow unicode arguments for robotparser.can_fetch
Type:	enhancement	Stage:	test needed
Components:	Unicode	Versions:	Python 3.2

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	georg.brandl, lemburg, osvenskan, terry.reedy
Priority:	normal	Keywords:

Created on 2006-02-23 21:07 by osvenskan, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
PythonSessionsShowingRobotParserError.txt	osvenskan, 2006-02-23 21:07	Interactive Python sessions (2.4 & 2.3) showing how to recreate

Messages (9)
msg54740 - (view)	Author: Philip Semanchuk (osvenskan) *	Date: 2006-02-23 21:07
One-line summary: If the robotparser module encounters a robots.txt file that contains non-ASCII characters AND I pass a Unicode user agent string to can_fetch(), that function crashes with a TypeError under Python 2.4. Under Python 2.3, the error is a UnicodeDecodeError. More detail: When one calls can_fetch(MyUserAgent, url), the robotparser module compares the UserAgent to each user agent described in the robots.txt file. If isinstance(MyUserAgent, str) == True then the comparison does not raise an error regardless of the contents of robots.txt. However, if isinstance(MyUserAgent, unicode) == True, then Python implicitly tries to convert the contents of the robots.txt file to Unicode before comparing it to MyUserAgent. By default, Python assumes a US-ASCII encoding when converting, so if the contents of robots.txt aren't ASCII, the conversion fails. In other words, this works: MyRobotParser.can_fetch('foobot', url) but this fails: MyRobotParser.can_fetch(u'foobot', url) I recreated this with Python 2.4.1 on FreeBSD 6 and Python 2.3 under Darwin/OS X. I'll attach examples from both. The URLs that I use in the attachments are from my Web site and will remain live. They reference robots.txt files which contain an umlaut-ed 'a' (0xe4 in iso-8859-1). They're served up using a special .htaccess file that adds a Content-Type header which correctly identifies the encoding used for each file. Here's the contents of the .htaccess file: AddCharset iso-8859-1 .iso8859-1 AddCharset utf-8 .utf8 A suggested solution: AFAICT, the construction of robots.txt is still defined by "a consensus on 30 June 1994 on the robots mailing list" [http://www.robotstxt.org/wc/norobots.html] and a 1996 draft proposal [http://www.robotstxt.org/wc/norobots-rfc.html] that has never evolved into a formal standard. Neither of these mention character sets or encodings which is no surprise considering that they date back to the days when the Internet was poor but happy and we considered even ASCII a luxury and we were grateful to have it. ("ASCII? We used to dream of having ASCII. We only had one bit, and it was a zero. We lived in a shoebox in the middle of the road..." etc.) A backwards-compatible yet forward-looking solution would be to have the robotparser module respect the Content-Type header sent with robots.txt. If no such header is present, robotparser should try to decode it using iso-8859-1 per section 3.7.1 of the HTTP 1.1 spec (http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1) which says, 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.' Section 3.6.1 of the HTTP 1.0 spec says the same. Since ISO-8859-1 is a superset of US-ASCII, robots.txt files that are pure ASCII won't be affected by the change.
msg54741 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2006-03-06 03:01
Logged In: YES user_id=593130 To me, this is not a bug report but at best an RFE. The reported behavior is what I would expect. I read both module doc and the referenced web page and further links. The doc does not mention Unicode as allowed and the 300 registered UserAgents at http://www.robotstxt.org/wc/active/html/index.html all have ascii names. So I recomment closing this as a bug report but will give ML a chance to respond. If switched instead to Feature Request, I would think it would need some 'in the wild' evidence of need.
msg54742 - (view)	Author: Philip Semanchuk (osvenskan) *	Date: 2006-03-07 16:32
Logged In: YES user_id=1119995 Thanks for looking at this. I have some followup comments. The list at robotstxt.org is many years stale (note that Google's bot is present only as Backrub which was still a server at Stanford at the time: http://www.robotstxt.org/wc/active/html/backrub.html) but nevertheless AFAICT it is the most current bot list on the Web. If you look carefully, the list does contain a non-ASCII entry (#76 --easy to miss in that long list). That Finnish bot is gone but it has left a legacy in the form of many robots.txt files that were created by automated tools based on the robotstxt.org list. Google helps us here: http://www.google.com/search?q=allintext%3AH%C3%A4m%C3%A4h%C3%A4kki+disallow+filetype%3Atxt And by Googling for some common non-ASCII words and letters I can find more like this one (look at the end of the alphabetical list): http://paranormal.se/robots.txt Robots.txt files that contain non-ASCII are few and far between, it seems, but they're out there. Which leads me to a nitpicky (but important!) point about Unicode. As you point out, the spec doesn't mention Unicode; it says nothing at all on the topic of encodings. My argument is that just because the spec doesn't mention encodings doesn't let us off the hook because the HTTP 1.0/1.1 specs are very clear that iso-8859-1, not US-ASCII, is the default for text content delivered via HTTP. By my interpretation, this means that the robots.txt examples provided above are compliant with published specs, therefore code that fails to interpret them does not comply. There's no obvious need for robotparser to support full-blown Unicode, just iso-8859-1. You might be interested in a replacement for this module that I've implemented. It does everything that robotparser does and also handles non-ASCII plus a few other things. It is GPL; you're welcome to copy it in part or lock, stock and barrel. So far I've only tested it "in the lab" but I've done fairly extensive unit testing and I'll soon be testing it on real-world data. The code and docs are here: http://semanchuk.com/philip/boneyard/rerp/ Comments & feedback would be most welcome.
msg54743 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2006-03-18 10:17
Logged In: YES user_id=849994 Turning into a Feature Request.
msg54744 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-03-20 19:33
Logged In: YES user_id=38388 Reassigning to Skip: I don't use robotparser. Skip, perhaps you can have a look ? (Didn't you write the robotparser ?)
msg54745 - (view)	Author: Philip Semanchuk (osvenskan) *	Date: 2006-04-06 15:34
Logged In: YES user_id=1119995 I've also discovered that robotparser can get confused by files with BOMs (byte order marks). At minimum it should ignore BOMs, at best it should use them as clues as to the file's encoding. It does neither, and instead treats the BOM as character data. That's especially problematic when the robots.txt file consists of this: [BOM]User-agent: * Disallow: / In that case, robotparser fails to recognize the string "User-agent", so the disallow rule is ignored, which in turn means it treats the file as empty and all robots are permitted everywhere which is the exact opposite of what the author intended. If the first line is a comment, then robotparser doesn't get confused regardless of whether or not there's a BOM. I created a sample robots.txt file exactly like the one above; it contains a utf-8 BOM. The example below uses this file which is on my Web site. >>> import robotparser >>> rp=robotparser.RobotFileParser() >>> rp.set_url("http://semanchuk.com/philip/boneyard/robots/robots.txt.bom") >>> rp.read() >>> rp.can_fetch("foobot", "/") # should return False True >>> My robot parser module doesn't suffer from the BOM bug (although it doesn't use BOMs to decode the file, either, which it really ought to). As I said before, You're welcome to steal code from it or copy it wholesale (it is GPL). Also, I'll be happy to open a different bug report if you feel like this should be a separate issue.
msg115006 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-26 16:34
No comments on this for 4 1/2 years. Is this still valid and/or is anyone still interested?
msg115022 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-08-26 18:58
While Python is 'GPL compatible', whatever that means, it cannot incorporate GPLed code in the PSF distribution. Code must be contributed under one on the two licenses in the contributor agreement. Philip, can you contribute a patch appropriate to 3.x? In 3.x, robotparser is urllib.robotparser. Under the 'be generous what you accept' principle, expansion of accepted names would seem to be good. DOC PATCH NEEDED The doc says "For more details on the structure of robots.txt files, see http://www.robotstxt.org/orig.html ." That link seems not to exist. The safest link is to the site. The specific replacement is http://www.robotstxt.org/robotstxt.html .
msg121019 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-11-12 04:49
The .../orig.html link now works and was last updated in August. It has a link to .../robotstext.html.

History
Date	User	Action	Args
2022-04-11 14:56:15	admin	set	github: 42942
2014-02-03 19:40:45	BreamoreBoy	set	nosy: - BreamoreBoy
2010-11-12 04:49:50	terry.reedy	set	messages: + msg121019
2010-08-26 18:58:50	terry.reedy	set	messages: + msg115022 stage: test needed
2010-08-26 16:34:26	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg115006 versions: + Python 3.2
2008-04-13 03:34:12	skip.montanaro	set	assignee: skip.montanaro -> nosy: - skip.montanaro
2006-02-23 21:07:54	osvenskan	create