classification
Title: IDNA2008 encoding missing
Type: security Stage: needs patch
Components: Library (Lib), SSL Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Lukasa, SamWhited, Socob, berker.peksag, christian.heimes, era, loewis, marten, r.david.murray, underrun
Priority: critical Keywords:

Created on 2013-02-27 01:32 by marten, last changed 2017-01-09 18:33 by Socob.

Files
File name Uploaded Description Edit
idna_translate.py marten, 2013-02-27 01:32
Messages (15)
msg183104 - (view) Author: Marten Lehmann (marten) Date: 2013-02-27 01:32
Since Python 2.3 the idna encoding is available for Internationalized Domain Names. But the current encoding doesn't work according to the latest version of the spec.

There is a new IDNA2008 specification (RFCs 5890-5894). Although I'm not very deep into all the changes, I know that at least the nameprep has changed. For example, the German sharp S ('ß') isn't replaced by 'ss' any longer.

The attached file shows the difference between the expected translation and the actual translation.
msg183106 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-27 02:26
How are they handling interoperability?
msg183144 - (view) Author: Marten Lehmann (marten) Date: 2013-02-27 12:25
At least from the GNU people, two separate projects exists for this matter:

libidn, the original IDNA translation (http://www.gnu.org/software/libidn/)
libidn2, the IDNA2008 translation (http://www.gnu.org/software/libidn/libidn2/manual/libidn2.html)

Btw.: Does Python provide a way to decode the ASCII-representation back to UTF-8?

>>> name.encode('idna')
'xn--mller-kva.com'

>>> name.encode('idna').decode('utf-8')
u'xn--mller-kva.com'

Otherwise I'd look for Python bindings of libidn2 or idnkit-2.
msg183147 - (view) Author: Marten Lehmann (marten) Date: 2013-02-27 12:29
For the embedded Python examples, please prepend the following lines:

from __future__ import unicode_literals
name='müller.com'

So regarding interoperability: Usually you only use one implementation in your code and hopefully the latest release, but in case someone needs to old one, maybe there should be a separate encodings.idna2008 class.
msg183149 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-27 12:37
Does this mean the differences are only in the canonicalization of unicode values?  IDNA is a wire protocol, which means that an application can't know if it is being asked to decode an idna1 or idna2 string unless there's something in the protocol that tells it.  But if the differences are only on the encoding side, and an idna1 decoder will "do the right thing" with the idna2 string, then that would be interoperable.  I'll have to read the standard, but I don't have time right now :)

idna is a codec:

>>> b'xn--mller-kva.com'.decode('idna')
'müller.com'

(that's python3, it'll be a unicode string in python2, obviously).
msg183159 - (view) Author: Marten Lehmann (marten) Date: 2013-02-27 16:39
IDNA2008 should be backwards compatible. I can try to explain it in a practical example:

DENIC was the first registry that actually used IDNA2008 - at a time, where not even libidn2 officially included the changes required for it. This was mainly due to the point, that the German Latin Small Letter Sharp S ('ß') was treated differently to other German Umlauts ('ä', 'ö', 'ü') in the original IDNA spec: It was not punycoded, because the nameprep already replaced it by 'ss'. Replacing 'ß' with 'ss' is in general correct in German (e.g. if your keyboard doesn't allow to enter 'ß'), but then 'ä' would have to be replaced by 'ae', 'ö' by 'oe' and 'ü' by 'ue' as well. 

Punycoding 'ä', 'ö', 'ü', but not 'ß' was inconsistent and it wouldn't allow to register a domain name like straße.de, because it was translated to strasse.de. Therefor DENIC supported IDNA2008 very early to allow the registration of domain names containing 'ß'.

The only thing I'm aware of in this situation is, that previously straße.de was translated to strasse.de, while with IDNA2008 it's being translated to xn--strae-oqa.de. So people that have hardcoded a URL containing 'ß' and who are expecting it to be translated to 'ss' would fail, because with IDNA2008 it would be translated to a different ASCII-hostname. But those people could just change 'ß' to 'ss' in their code and everything would work again.

On the contrary, people that have registered a domain name containing 'ß' in the meantime couldn't access it right now by specifying the IDN version, because it would be translated to the wrong domain name with the current Python IDNA encoding. So the current IDNA-Encoding should be upgraded to IDNA2008.
msg183160 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-27 16:52
That doesn't sound like interoperability to me, that sounds like backward incompatibility :(.  I hope you are right that it only affects people with hardcoded domain names, but that is still an issue.

In any case, since this is a new feature it can only go into Python3.4, however we decide to do it.
msg183199 - (view) Author: Marten Lehmann (marten) Date: 2013-02-28 04:08
I found an interesting link about this issue:

http://www.unicode.org/faq/idn.html

I also checked a domain name of a client that ends with 'straße.de': IE, Firefox and Chrome still use IDNA2003, Opera already does IDNA2008.

In IDNA2008 a lot of characters aren't allowed any longer (like symbols or strike-through letters). But I think this doesn't have any practical relevance, because even while IDNA2003 formally allowed these characters, domain name registries disallowed to register internationalized domain names containing any of these characters.

Most registries restricted the allowed characters very strong, e.g. in the .de zone you cannot use Japanese characters, only those in use within the German language. Some other registries expect you to submit a language property during the domain registration and then only special characters within that language are allowed in the domain name. Also, most registries don't allow to register a domain name that mixes different languages.

So IDNA2008 is the future and hopefully shouldn't break a lot. I don't know of any real life use of the IDNA encoding other than DNS / URLs. I don't know how many existing modules in PyPI working with URLs already make use of the current encodings.idna class but I guess it would cause more work if they all would have to change their code to use name.encode('idna2008') or work with an outdated encoding in the end if unchanged than just silentely switching to IDNA2008 for encodings.idna and add encodings.idna2003 for those who really need the old one for some reason. Reminds me a bit on the range() / xrange() thing. Now the special new xrange() is the default and called just range() again. I guess in some years we'll look back on the IDNA2003/2008 transition the same way.
msg183202 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-28 04:20
Ah, excellent, that document looks like exactly what I was looking for.

Now, when someone is going to get around to working on this, I don't know.

(Note that the xrange/range change was made at the Python2/Python3 boundary, where we broke backward compatibility.  I doubt that we are ever going to do that kind of transition again, but we do have ways to phase in changes in the default behavior over time.)
msg205009 - (view) Author: (era) Date: 2013-12-02 13:07
At least the following existing domain names are rejected by the current implementation, apparently because they are not IDNA2003-compatible.

XN----NNC9BXA1KSA.COM
XN--14-CUD4D3A.COM
XN--YGB4AR5HPA.COM
XN---14-00E9E9A.COM
XN--MGB2DAM4BK.COM
XN--6-ZHCPPA1B7A.COM
XN--3-YMCCH8IVAY.COM
XN--3-YMCLXLE2A3F.COM
XN--4-ZHCJXA0E.COM
XN--014-QQEUW.COM
XN--118-Y2EK60DC2ZB.COM

As a workaround, in the code where I needed to process these, I used a fallback to string[4:].decode('punycode'); this was in a code path where I had already lowercased the string and established that string[0:4] == 'xn--'.

As a partial remedy, supporting a relaxed interpretation of the spec somehow would be useful; see also (tangentially) issue #12263.
msg205034 - (view) Author: Marten Lehmann (marten) Date: 2013-12-02 17:14
There's nice library called idna on PyPI doing idna2008: https://pypi.python.org/pypi/idna/0.1

I'd however prefer this standard encoding to be part of standard python.
msg217092 - (view) Author: Derek Wilson (underrun) Date: 2014-04-23 22:00
It is worth noting that the do exist some domains that have been registered in the past that work with IDNA2003 but not IDNA2008.

There definitely needs to be IDNA2008 support, for my use case I need to attempt IDNA2008 and then fall back to IDNA2003.

When support for IDNA2008 is added, please retain support for IDNA2003.

I would say that ideally there would be a codec that could handle both - attempt to use IDNA2008 and on error fallback to idna2003. I realize this isn't "official" but it would certainly be useful.
msg217218 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-04-26 21:20
I would propose this approach:

1. Python should implement both IDNA2008 and UTS#46, and keep IDNA2003
2. "idna" should become an alias for "idna2003".
3. The socket module and all other place that use the "idna" encoding should use "uts46" instead.
4. Pre-existing implementations of IDNA 2008 should be used as inspirations at best; Python will need a new implementation from scratch, one that puts all relevant tables into the unicodedata module if they aren't there already. This is in particular where the idna 0.1 library fails. The implementation should refer to the relevant parts of the specification, to be easily reviewable for correctness.

Contributions are welcome.
msg278493 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2016-10-11 14:52
I'm considering lack of IDNA 2008 a security issue for applications that perform DNS lookups and X.509 cert validation. Applications may end up connecting to the wrong machine and even validate the cert correctly.

Wrong:

>>> import socket
>>> u'straße.de'.encode('idna')
'strasse.de'
>>> socket.gethostbyname(u'straße.de'.encode('idna'))
'72.52.4.119'

Correct:
>>> import idna
>>> idna.encode(u'straße.de')
'xn--strae-oqa.de'
>>> socket.gethostbyname(idna.encode(u'straße.de'))
'81.169.145.78'
msg279904 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2016-11-02 08:16
I reported the issue for curl, CVE-2016-8625 https://curl.haxx.se/docs/adv_20161102K.html
History
Date User Action Args
2017-01-09 18:33:47Socobsetnosy: + Socob
2016-11-02 08:16:45christian.heimessetmessages: + msg279904
2016-10-13 14:00:56SamWhitedsetnosy: + SamWhited
2016-10-12 16:15:01Lukasasetnosy: + Lukasa
2016-10-11 14:52:46christian.heimessetpriority: high -> critical
type: enhancement -> security
messages: + msg278493
2016-09-26 14:16:19christian.heimessetassignee: christian.heimes ->
2016-09-26 13:53:11christian.heimessetpriority: normal -> high
assignee: christian.heimes
components: + SSL
versions: + Python 3.7, - Python 3.5
2015-05-15 14:51:06christian.heimessetnosy: + christian.heimes
2015-03-25 18:16:02berker.peksagsetnosy: + berker.peksag

versions: + Python 3.5, - Python 3.4
2014-04-26 21:20:43loewissetnosy: + loewis
messages: + msg217218
2014-04-23 22:00:47underrunsetnosy: + underrun
messages: + msg217092
2013-12-02 17:14:07martensetmessages: + msg205034
2013-12-02 13:07:32erasetnosy: + era
messages: + msg205009
2013-02-28 04:20:16r.david.murraysetmessages: + msg183202
2013-02-28 04:08:46martensetmessages: + msg183199
2013-02-27 16:52:49r.david.murraysetstage: needs patch
messages: + msg183160
versions: - Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.3, Python 3.5
2013-02-27 16:39:54martensetmessages: + msg183159
2013-02-27 12:37:23r.david.murraysetmessages: + msg183149
2013-02-27 12:29:21martensetmessages: + msg183147
2013-02-27 12:25:21martensetmessages: + msg183144
2013-02-27 02:26:01r.david.murraysetnosy: + r.david.murray
messages: + msg183106
2013-02-27 01:32:46martencreate