Issue 3991: urllib.request.urlopen does not handle non-ASCII characters

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48241

classification

Title:	urllib.request.urlopen does not handle non-ASCII characters
Type:	behavior	Stage:	patch review
Components:	Extension Modules	Versions:	Python 3.8, Python 3.7, Python 3.6, Python 3.4, Python 3.5

process

Status:	open	Resolution:
Dependencies:	9679	Superseder:
Assigned To:		Nosy List:	Graham.Oliver, a.badger, ajaksu2, ezio.melotti, janssen, martin.panter, orsenthil, r.david.murray, remi.lapeyre, thezulk, vstinner
Priority:	normal	Keywords:	easy, patch

Created on 2008-09-28 18:47 by a.badger, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
non_ascii_path.diff	ajaksu2, 2009-02-08 21:50	Calls quote() on the request path if path.encode('ascii') fails	review
issue3991.diff	thezulk, 2013-02-23 17:58		review
issue3991_2017-01-27.diff	thezulk, 2017-01-27 21:58	Diff against 3.7.0a0	review

Messages (14)
msg73982 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-09-28 18:47
Tested on python-3.0rc1 -- Linux Fedora 9 I wanted to make sure that python3.0 would handle url's in different encodings. So I created two files on an apache server which were named ½ñ.html. One of the filenames was encoded in utf-8 and the other in latin-1. Then I tried the following:: from urllib.request import urlopen url = 'http://localhost/u/½ñ.html' urlopen(url.encode('utf-8')).read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python3.0/urllib/request.py", line 350, in open req.timeout = timeout AttributeError: 'bytes' object has no attribute 'timeout' The same thing happens if I give None for the two optional arguments (data and timeout). Next I tried using a raw Unicode string: >>> urlopen(url).read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python3.0/urllib/request.py", line 359, in open response = self._open(req, data) File "/usr/lib/python3.0/urllib/request.py", line 377, in _open '_open', req) File "/usr/lib/python3.0/urllib/request.py", line 337, in _call_chain result = func(args) File "/usr/lib/python3.0/urllib/request.py", line 1082, in http_open return self.do_open(http.client.HTTPConnection, req) File "/usr/lib/python3.0/urllib/request.py", line 1068, in do_open h.request(req.get_method(), req.get_selector(), req.data, headers) File "/usr/lib/python3.0/http/client.py", line 843, in request self._send_request(method, url, body, headers) File "/usr/lib/python3.0/http/client.py", line 860, in _send_request self.putrequest(method, url, *skips) File "/usr/lib/python3.0/http/client.py", line 751, in putrequest self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128) So, in python-3.0rc1, this method is badly broken.
msg74046 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-09-29 20:47
As I read RFC 2396, 1.5: "A URI is a sequence of characters from a very limited set, i.e. the letters of the basic Latin alphabet, digits, and a few special characters." 2.4: "Data must be escaped if it does not have a representation using an unreserved character; this includes data that does not correspond to a printable character of the US-ASCII coded character set, or that corresponds to any US-ASCII character that is disallowed, as explained below." So your URL string is invalid. You need to escape the characters properly. (RFC 2396 is what the HTTP RFC cites as its authority on URLs.)
msg74053 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-09-29 22:27
Possibly. This is a change from python-2.x's urlopen() which escaped the URL automatically, though. I can see the case for having the user call an escape function themselves instead of having urlopen() perform the escape for them. However, that function would need to be written. (The present parse.quote() method only quotes correctly if only the path component is passed; there's no function to take a full URL and quote it appropriately.) Without such a function, a whole lot of code bases will have to reinvent the wheel creating functions to parse the path out, run it through urllib.parse.quote() and then pass the result to urlib.urlopen().
msg74085 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-09-30 17:43
It's not immediately clear to me how an auto-quote function can be written; as you say (and as the URI spec points out), you have to take a URL apart before quoting it, and you can't parse an invalid URL, which is what the input is. Best to think of this as a difference from 2.x.
msg74088 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-09-30 18:11
The purpose of such a function would be to take something that is not a valid uri but 1) is a common way of expressing the way to get to the resource and 2) follows certain rules and turns that into something that is a valid uri. non-ASCii strings in the path are a good example of this since there is a well defined method to encode the strings into the URL if you are given a character encoding to apply to it. My first, naive thought is that if the input can be parsed by urlparse(), then there is a very good chance that we have the ability to escape the string properly. Looking at the invalid uri that I gave, for instance, if you additionally specified an encoding for the path element there's no reason a function couldn't do the escaping. What are example inputs that you are concerned about? I'll see if I can come up with code that works with them.
msg74110 - (view)	Author: Bill Janssen (janssen) *	Date: 2008-10-01 03:24
I'm not concerned about any example inputs. I was just trying to explain why this isn't a bug. On the other hand, the IRI spec (RFC 3897) is another thing we might try to implement for Python.
msg74117 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-10-01 06:39
Oh, that's cool. I've been fine with this being a request for a needed function to quote and unquote full urls rather than a bug in urlopen(). I think iri's are a distraction here, though. The RFC for iris even says that specifications that call for uris and do not mention iris should not take iris. So there's definitely a need for a function to quote a full uri.
msg81423 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-02-08 21:50
I think Toshio's usecase is important enough to deserve a fix (patch attached) or a special-cased error message. IMO, newbies trying to fix failures from urlopen may have a hard time figuring out the maze: urlopen -> _opener -> open -> _open -> _call_chain -> http_open -> do_open (and that's before leaving urllib!). >>> from urllib.request import urlopen >>> url = 'http://localhost/ñ.html' >>> urlopen(url).read() Traceback (most recent call last): [...] UnicodeEncodeError: 'ascii' codec can't encode character '\xf1' in position 5: ordinal not in range(128) If the newbie isn't completely lost by then, how about: >>> from urllib.parse import quote >>> urlopen(quote(url)).read() Traceback (most recent call last): [...] ValueError: unknown url type: http%3A//localhost/%C3%B1.html
msg182785 - (view)	Author: Andreas Åkerlund (thezulk) *	Date: 2013-02-23 17:58
This is a patch against 3.2 adding urllib.parse.quote_uri It splits the URI in 5 parts (protocol, authentication, hostname, port and path) then runs urllib.parse.quote on the path and encodes the hostname to punycode if it's not in ascii. It's not perfect, but should be usable in most cases. I created some test cases aswell.
msg219325 - (view)	Author: Graham Oliver (Graham.Oliver)	Date: 2014-05-29 01:21
hello I came across this bug when using 'ā' in a url To get around the problem I used the 'URL encoded' version '%C4%81' instead of 'ā' See this page http://www.charbase.com/0101-unicode-latin-small-letter-a-with-macron I tried using the 'puny code' for 'ā' 'xn--yda' but that didn't work
msg285721 - (view)	Author: Martin Panter (martin.panter) *	Date: 2017-01-18 11:40
Issue 9679: Focusses on encoding just the DNS name Issue 20559: Maybe a duplicate, or opportunity for better documentation or error message as a bug fix? Andreas’s patch just proposes a new function called quote_uri(). It would need documentation. We already have a quote() and quote_plus() function. Since it sounds like this is for IRIs (https://tools.ietf.org/html/rfc3987), would it be more appropriate to call it quote_iri()? See revision cb09fdef19f5, especially the quote(safe=...) parameter, for how I avoided the double encoding problem.
msg286386 - (view)	Author: Andreas Åkerlund (thezulk) *	Date: 2017-01-27 21:58
Changed the patch after pointers from vadmium. And quote_uri is changed to quote_iri as martin.panter thought it was more appropriate.
msg286423 - (view)	Author: Martin Panter (martin.panter) *	Date: 2017-01-29 01:06
I’m not really an expert on non-ASCII URLs / IRIs. Maybe it is obvious to other people that this is a good general implementation, but for me to thoroughly review it I would need time to research the relevant RFCs, other implementations, suitability for the URL schemes listed at <https://docs.python.org/dev/library/urllib.parse.html>, security implications, etc. One problem problem with using urlunsplit() is it would strip empty URL components, e.g. quote_iri("http://example/file#") -> "http://example/file". See Issue 22852. This is highlighted by the file:///[. . .] → file:/[. . .] test case. FYI Martin Panter and vadmium are both just me, no need to get too excited. :) I just updated my settings for Rietveld (code review), so hopefully that is more obvious now.
msg286444 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-01-29 14:09
I believe the last time this subject was discussed the conclusion was that we really needed a full IRI module that conformed to the relevant RFCs, and that putting something on pypi would be one way to get there. Someone should research the existing packages. It might be that we need something simpler than what exists, but whatever we do should be informed by what exists, I think.

History
Date	User	Action	Args
2022-04-11 14:56:39	admin	set	github: 48241
2019-02-20 20:02:12	remi.lapeyre	set	nosy: + remi.lapeyre
2019-02-20 19:38:11	dfrojas	set	type: enhancement -> behavior components: + Extension Modules, - Library (Lib), Unicode versions: + Python 3.4, Python 3.5, Python 3.6, Python 3.8
2017-01-29 14:09:42	r.david.murray	set	nosy: + r.david.murray messages: + msg286444
2017-01-29 01:06:54	martin.panter	set	messages: + msg286423
2017-01-27 21:58:31	thezulk	set	files: + issue3991_2017-01-27.diff messages: + msg286386
2017-01-18 11:53:01	martin.panter	link	issue29305 superseder
2017-01-18 11:40:38	martin.panter	set	versions: + Python 3.7, - Python 3.2, Python 3.3, Python 3.4 nosy: + martin.panter messages: + msg285721 dependencies: + unicode DNS names in urllib, urlopen stage: test needed -> patch review
2014-05-29 01:21:55	Graham.Oliver	set	nosy: + Graham.Oliver messages: + msg219325
2013-02-25 18:41:05	vstinner	set	nosy: + vstinner
2013-02-23 17:58:20	thezulk	set	files: + issue3991.diff nosy: + thezulk messages: + msg182785
2012-10-02 06:01:47	ezio.melotti	set	versions: + Python 3.2, Python 3.3, Python 3.4, - Python 3.0
2009-04-22 18:47:58	ajaksu2	set	priority: normal
2009-02-12 18:33:27	ajaksu2	set	nosy: + orsenthil
2009-02-12 02:37:16	ajaksu2	set	keywords: + easy components: + Library (Lib) stage: test needed
2009-02-09 07:27:01	ezio.melotti	set	nosy: + ezio.melotti
2009-02-08 21:50:19	ajaksu2	set	files: + non_ascii_path.diff keywords: + patch messages: + msg81423 nosy: + ajaksu2
2008-10-01 06:39:58	a.badger	set	messages: + msg74117
2008-10-01 03:24:17	janssen	set	type: enhancement messages: + msg74110
2008-09-30 18:11:23	a.badger	set	messages: + msg74088
2008-09-30 17:44:00	janssen	set	messages: + msg74085
2008-09-29 22:27:44	a.badger	set	messages: + msg74053
2008-09-29 20:47:31	janssen	set	nosy: + janssen messages: + msg74046
2008-09-28 18:47:16	a.badger	create