This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Dubslow
Recipients Dubslow, ezio.melotti, vstinner
Date 2014-02-08.04:34:22
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1391834064.13.0.416589140445.issue20559@psf.upfronthosting.co.za>
In-reply-to
Content
The following code will produce a UnicodeEncodeError about a character being non-ascii:

    from urllib import request, parse, error
    url = 'http://en.wikipedia.org/wiki/Antonio Vallejo-Nájera'
    req = request.Request(url)
    response = request.urlopen(req)

This fails as follows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/urllib/request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 469, in open
    response = self._open(req, data)
  File "/usr/lib/python3.3/urllib/request.py", line 487, in _open
    '_open', req)
  File "/usr/lib/python3.3/urllib/request.py", line 447, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.3/urllib/request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.3/http/client.py", line 1067, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.3/http/client.py", line 1095, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.3/http/client.py", line 959, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)

I examined the library code in question: line 958 in http/client.py, the line before the one that barfs, contains the following comment: 

# Non-ASCII characters should have been eliminated earlier

I added a print statement to the library code:

    print(request)
    self._output(request.encode('ascii'))

This prints the following: 

>>> response = request.urlopen(req)
GET /wiki/Antonio Vallejo-Nájera HTTP/1.1
Traceback (most recent call last): ...

I confirmed that the 27th character as mentioned in the traceback is in fact the á in the last name. Clearly either urllib or http is not properly sanitizing the url -- unfortunately, my knowledge is useless as to determining where the actual error is; hopefully this report contains enough detail to make it easy enough.
History
Date User Action Args
2014-02-08 04:34:24Dubslowsetrecipients: + Dubslow, vstinner, ezio.melotti
2014-02-08 04:34:24Dubslowsetmessageid: <1391834064.13.0.416589140445.issue20559@psf.upfronthosting.co.za>
2014-02-08 04:34:23Dubslowlinkissue20559 messages
2014-02-08 04:34:22Dubslowcreate