This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib/http fail to sanitize a non-ascii url
Type: behavior Stage:
Components: Library (Lib), Unicode Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Dubslow, eric.araujo, ezio.melotti, iritkatriel, martin.panter
Priority: normal Keywords:

Created on 2014-02-08 04:34 by Dubslow, last changed 2022-04-11 14:57 by admin.

Messages (5)
msg210587 - (view) Author: (Dubslow) Date: 2014-02-08 04:34
The following code will produce a UnicodeEncodeError about a character being non-ascii:

    from urllib import request, parse, error
    url = 'http://en.wikipedia.org/wiki/Antonio Vallejo-Nájera'
    req = request.Request(url)
    response = request.urlopen(req)

This fails as follows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/urllib/request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 469, in open
    response = self._open(req, data)
  File "/usr/lib/python3.3/urllib/request.py", line 487, in _open
    '_open', req)
  File "/usr/lib/python3.3/urllib/request.py", line 447, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.3/urllib/request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.3/http/client.py", line 1067, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.3/http/client.py", line 1095, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.3/http/client.py", line 959, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)

I examined the library code in question: line 958 in http/client.py, the line before the one that barfs, contains the following comment: 

# Non-ASCII characters should have been eliminated earlier

I added a print statement to the library code:

    print(request)
    self._output(request.encode('ascii'))

This prints the following: 

>>> response = request.urlopen(req)
GET /wiki/Antonio Vallejo-Nájera HTTP/1.1
Traceback (most recent call last): ...

I confirmed that the 27th character as mentioned in the traceback is in fact the á in the last name. Clearly either urllib or http is not properly sanitizing the url -- unfortunately, my knowledge is useless as to determining where the actual error is; hopefully this report contains enough detail to make it easy enough.
msg210590 - (view) Author: (Dubslow) Date: 2014-02-08 05:11
Follow up -- I need to use urllib.parse.quote to safely encode a url -- though if I may be so bold, I submit that since much of the goal of Python 3 was to make unicode "just work", I the (stupid) user shouldn't have to remember to safely encode unicode urls...

A reasonable way to do it would be to insert the following in place of urllib/request.py line 469 (which is OpenerDirector.open()):

    response = self._open(req, data)

would become

    try:
        response = self._open(req, data)
    except UnicodeDecodeError as e:
        req.full_url = quote(req.full_url, safe='/%')
        response = self._open(req, data)

That's untested of course, but hopefully it'll encourage discussion.
msg211235 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2014-02-14 18:49
Even if Python 3’s text model is based on Unicode, some data formats have their own rules.  There’s a long debate about whether URIs should be bytes or text; it looks like unlike web browsers, urllib/httplib don’t try to be smart with the URIs they are given but just require them to be properly formatted, i.e. not containing any space or characters that are not %-encoded.

Is the documentation clear about this behaviour?  If not, it would probably be simpler to improve the documentation rather than change the behaviour.
msg285717 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-01-18 11:14
See also Issue 3991 with proposals for handling non-ASCII as new features.
msg408270 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-12-10 23:44
Reproduced on 3.11.
History
Date User Action Args
2022-04-11 14:57:58adminsetgithub: 64758
2021-12-11 00:31:42vstinnersetnosy: - vstinner
2021-12-10 23:44:54iritkatrielsetnosy: + iritkatriel

messages: + msg408270
versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.3
2017-01-18 11:14:27martin.pantersetnosy: + martin.panter
messages: + msg285717
2014-02-14 18:49:07eric.araujosetnosy: + eric.araujo
messages: + msg211235
2014-02-08 05:11:56Dubslowsetmessages: + msg210590
2014-02-08 04:34:24Dubslowcreate