Message 31944 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	nagle
Recipients
Date	2007-05-04.06:11:55
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
The code in urllib.quote fails on Unicode input, when called by robotparser with a Unicode URL. Traceback (most recent call last): File "./sitetruth/InfoSitePage.py", line 415, in run pagetree = self.httpfetch() # fetch page File "./sitetruth/InfoSitePage.py", line 368, in httpfetch if not self.owner().checkrobotaccess(self.requestedurl) : # if access disallowed by robots.txt file File "./sitetruth/InfoSiteContent.py", line 446, in checkrobotaccess return(self.robotcheck.can_fetch(config.kuseragent, url)) # return can fetch File "/usr/local/lib/python2.5/robotparser.py", line 159, in can_fetch url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/" File "/usr/local/lib/python2.5/urllib.py", line 1197, in quote res = map(safe_map.__getitem__, s) KeyError: u'\xe2' That bit of code needs some attention. - It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now. - The initialization may not be thread-safe; a table is being initialized on first use. "robotparser" was trying to check if a URL with a Unicode character in it was allowed. Note the "KeyError: u'\xe2'"

The code in urllib.quote fails on Unicode input, when
called by robotparser with a Unicode URL.

Traceback (most recent call last):
File "./sitetruth/InfoSitePage.py", line 415, in run
pagetree = self.httpfetch() # fetch page
File "./sitetruth/InfoSitePage.py", line 368, in httpfetch
if not self.owner().checkrobotaccess(self.requestedurl) : # if access disallowed by robots.txt file
File "./sitetruth/InfoSiteContent.py", line 446, in checkrobotaccess
return(self.robotcheck.can_fetch(config.kuseragent, url)) # return can fetch
File "/usr/local/lib/python2.5/robotparser.py", line 159, in can_fetch
url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"
File "/usr/local/lib/python2.5/urllib.py", line 1197, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xe2'

   That bit of code needs some attention.  
- It still assumes ASCII goes up to 255, which hasn't been true in Python for a while now.
- The initialization may not be thread-safe; a table is being initialized on first use.

"robotparser" was trying to check if a URL with a Unicode character in it was allowed.  Note the "KeyError: u'\xe2'"

History
Date	User	Action	Args
2007-08-23 14:53:34	admin	link	issue1712522 messages
2007-08-23 14:53:34	admin	create