Issue 20559: urllib/http fail to sanitize a non-ascii url

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64758

classification

Title:	urllib/http fail to sanitize a non-ascii url
Type:	behavior	Stage:
Components:	Library (Lib), Unicode	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Dubslow, eric.araujo, ezio.melotti, iritkatriel, martin.panter
Priority:	normal	Keywords:

Created on 2014-02-08 04:34 by Dubslow, last changed 2022-04-11 14:57 by admin.

Messages (5)
msg210587 - (view)	Author: (Dubslow)	Date: 2014-02-08 04:34
The following code will produce a UnicodeEncodeError about a character being non-ascii: from urllib import request, parse, error url = 'http://en.wikipedia.org/wiki/Antonio Vallejo-Nájera' req = request.Request(url) response = request.urlopen(req) This fails as follows: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.3/urllib/request.py", line 156, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.3/urllib/request.py", line 469, in open response = self._open(req, data) File "/usr/lib/python3.3/urllib/request.py", line 487, in _open '_open', req) File "/usr/lib/python3.3/urllib/request.py", line 447, in _call_chain result = func(args) File "/usr/lib/python3.3/urllib/request.py", line 1268, in http_open return self.do_open(http.client.HTTPConnection, req) File "/usr/lib/python3.3/urllib/request.py", line 1248, in do_open h.request(req.get_method(), req.selector, req.data, headers) File "/usr/lib/python3.3/http/client.py", line 1067, in request self._send_request(method, url, body, headers) File "/usr/lib/python3.3/http/client.py", line 1095, in _send_request self.putrequest(method, url, *skips) File "/usr/lib/python3.3/http/client.py", line 959, in putrequest self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128) I examined the library code in question: line 958 in http/client.py, the line before the one that barfs, contains the following comment: # Non-ASCII characters should have been eliminated earlier I added a print statement to the library code: print(request) self._output(request.encode('ascii')) This prints the following: >>> response = request.urlopen(req) GET /wiki/Antonio Vallejo-Nájera HTTP/1.1 Traceback (most recent call last): ... I confirmed that the 27th character as mentioned in the traceback is in fact the á in the last name. Clearly either urllib or http is not properly sanitizing the url -- unfortunately, my knowledge is useless as to determining where the actual error is; hopefully this report contains enough detail to make it easy enough.
msg210590 - (view)	Author: (Dubslow)	Date: 2014-02-08 05:11
Follow up -- I need to use urllib.parse.quote to safely encode a url -- though if I may be so bold, I submit that since much of the goal of Python 3 was to make unicode "just work", I the (stupid) user shouldn't have to remember to safely encode unicode urls... A reasonable way to do it would be to insert the following in place of urllib/request.py line 469 (which is OpenerDirector.open()): response = self._open(req, data) would become try: response = self._open(req, data) except UnicodeDecodeError as e: req.full_url = quote(req.full_url, safe='/%') response = self._open(req, data) That's untested of course, but hopefully it'll encourage discussion.
msg211235 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2014-02-14 18:49
Even if Python 3’s text model is based on Unicode, some data formats have their own rules. There’s a long debate about whether URIs should be bytes or text; it looks like unlike web browsers, urllib/httplib don’t try to be smart with the URIs they are given but just require them to be properly formatted, i.e. not containing any space or characters that are not %-encoded. Is the documentation clear about this behaviour? If not, it would probably be simpler to improve the documentation rather than change the behaviour.
msg285717 - (view)	Author: Martin Panter (martin.panter) *	Date: 2017-01-18 11:14
See also Issue 3991 with proposals for handling non-ASCII as new features.
msg408270 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-12-10 23:44
Reproduced on 3.11.

History
Date	User	Action	Args
2022-04-11 14:57:58	admin	set	github: 64758
2021-12-11 00:31:42	vstinner	set	nosy: - vstinner
2021-12-10 23:44:54	iritkatriel	set	nosy: + iritkatriel messages: + msg408270 versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.3
2017-01-18 11:14:27	martin.panter	set	nosy: + martin.panter messages: + msg285717
2014-02-14 18:49:07	eric.araujo	set	nosy: + eric.araujo messages: + msg211235
2014-02-08 05:11:56	Dubslow	set	messages: + msg210590
2014-02-08 04:34:24	Dubslow	create