Author nagle
Recipients
Date 2007-05-04.06:11:55
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
The code in urllib.quote fails on Unicode input, when
called by robotparser with a Unicode URL.

Traceback (most recent call last):
File "./sitetruth/InfoSitePage.py", line 415, in run
pagetree = self.httpfetch() # fetch page
File "./sitetruth/InfoSitePage.py", line 368, in httpfetch
if not self.owner().checkrobotaccess(self.requestedurl) : # if access disallowed by robots.txt file
File "./sitetruth/InfoSiteContent.py", line 446, in checkrobotaccess
return(self.robotcheck.can_fetch(config.kuseragent, url)) # return can fetch
File "/usr/local/lib/python2.5/robotparser.py", line 159, in can_fetch
url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"
File "/usr/local/lib/python2.5/urllib.py", line 1197, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xe2'

   That bit of code needs some attention.  
- It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now.
- The initialization may not be thread-safe; a table is being initialized on first use.

"robotparser" was trying to check if a URL with a Unicode character in it was allowed.  Note the "KeyError: u'\xe2'" 
History
Date User Action Args
2007-08-23 14:53:34adminlinkissue1712522 messages
2007-08-23 14:53:34admincreate