This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: unicode DNS names in urllib, urlopen
Type: enhancement Stage: patch review
Components: Library (Lib), Unicode Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: baikie, berker.peksag, christian.heimes, cvrebert, demian.brecht, flox, gdamjan, loewis, nagle, ncoghlan, orsenthil, r.david.murray, vstinner
Priority: normal Keywords: patch

Created on 2010-08-25 07:39 by loewis, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
issue9679.patch demian.brecht, 2015-03-13 22:27 review
Messages (10)
msg114884 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-08-25 07:39
Copy of issue 1027206; support in the socket module was provided, but this request remains:

Also other modules should support unicode hostnames.
(httplib already does) but urllib and urllib2 don't.
msg114886 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-08-25 07:47
From msg60564: it's not clear to me what this request really means. It could mean that Python should support IRIs, but then, I'm not sure whether this support can be in urllib, or whether a separate library would be needed.
msg114899 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-08-25 13:08
There was a discussion about IRI on python-dev in the middle of a discussion about adding a coercable bytes type, but I can't find it. I believe the conclusion was that the best solution for IRI support was a new library that implements the full IRI spec.  It is possible that we could just add IDNA support to urllib, but it isn't clear that that work would be worth it when what is really needed is full IRI support.

See also issue1500504, though my guess based on the python-dev discussion and my experience with email is that an IRI library will need to be carefully designed with the py3k bytes/string separation in mind.
msg162722 - (view) Author: John Nagle (nagle) Date: 2012-06-13 18:51
A "IRI library" is not needed to fix this problem.  It's already fixed in the sockets library and the http library.  We just need consistency in urllib2.  

urllib2 functions which take a "url" parameter should apply "encodings.idna.ToASCII" to each label of the domain name.  

urllib2 function which return a "url" value (such as "geturl()") should apply "encodings.idna.ToUnicode" to each label of the domain name.

Note that in both cases, the conversion function must be applied to each label (field between "."s) of the domain name only.  Applying it to the entire domain name or the entire URL will not work. 

If there are future changes to domain syntax, those should go into "encodings.idna", which is the proper library for domain syntax issues.
msg162723 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-13 19:10
I doubt that unicode domain support in urllib would be of much use without full IRI support.  I would think that a domain that uses unicode is highly likely to have URLs that use unicode.

However that doesn't mean a patch along the lines you suggest would be rejected out of hand, especially if someone can provide a real web site where it would be helpful.
msg162752 - (view) Author: John Nagle (nagle) Date: 2012-06-14 05:07
The current convention is that domains go into DNS lookup as punycode, and the port, query, and fragment fields of the URL are encoded with percent-escapes.  See

http://lists.w3.org/Archives/Public/ietf-http-wg/2011OctDec/0155.html

Python needs to get with the program here.
msg162780 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-14 12:49
As I said, patches to improve the situation are welcome, and if they match with current internet practices they will likely be accepted.

It is still the case that such URLs are likely to require extra work on the part of the application to deal with the other unicode parts (your linked reference reinforces that).  So, IMO it would be *better* if someone would do an IRI module.  But the fact that nobody has stepped up for that should not prevent us from improving the situation in other ways.
msg162974 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2012-06-16 14:08
The werkzeug.urls module has examples of such conversion IRI-to-URI:
https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/urls.py#L109,L205
msg237426 - (view) Author: John Nagle (nagle) Date: 2015-03-07 07:45
Three years later, I'm converting to Python 3. Did this get fixed in Python 3?
msg238060 - (view) Author: Demian Brecht (demian.brecht) * (Python triager) Date: 2015-03-13 22:27
Here's a simple patch that adds functionality matching that in http.client to urllib.request. As pointed out by John, I see no reason why urllib and http.client shouldn't have consistent handling if IDNs independent of IRIs (although IRI encoding would be a nice addition as well).
History
Date User Action Args
2022-04-11 14:57:05adminsetgithub: 53888
2017-01-18 11:40:38martin.panterlinkissue3991 dependencies
2015-03-13 22:32:09berker.peksagsetnosy: + berker.peksag

versions: + Python 3.5, - Python 3.3, Python 3.4
2015-03-13 22:27:48demian.brechtsetstage: patch review
2015-03-13 22:27:27demian.brechtsetfiles: + issue9679.patch
keywords: + patch
messages: + msg238060
2015-03-13 10:18:44demian.brechtsetnosy: + demian.brecht
2015-03-07 07:45:53naglesetmessages: + msg237426
2013-07-05 23:02:39christian.heimessetnosy: + christian.heimes

versions: + Python 3.4
2012-06-16 14:08:23floxsetmessages: + msg162974
2012-06-14 12:49:59r.david.murraysetmessages: + msg162780
2012-06-14 05:07:21naglesetmessages: + msg162752
2012-06-13 20:19:50cvrebertsetnosy: + cvrebert
2012-06-13 19:10:14r.david.murraysetmessages: + msg162723
versions: + Python 3.3, - Python 3.2
2012-06-13 18:51:10naglesetnosy: + nagle
messages: + msg162722
2010-08-25 13:09:05r.david.murraysetkeywords: - patch, buildbot
2010-08-25 13:08:40r.david.murraysetnosy: + r.david.murray, ncoghlan

messages: + msg114899
stage: patch review -> (no value)
2010-08-25 07:47:53loewissetmessages: + msg114886
2010-08-25 07:44:42loewislinkissue1027206 superseder
2010-08-25 07:39:27loewiscreate