Title: Encoding str to IDNA with ellipsis decomposes to empty labels
Created on 2014-03-30 17:57 by chfoo, last changed 2022-04-11 14:58 by admin.

msg215189 - (view) Author: Christopher Foo (chfoo) Date: 2014-03-30 17:57
When encoding a string with the IDNA codec I expected that it will always raise an exception with empty labels. When I do this

    >>> 'example.c…'.encode('idna').decode('ascii')

it returns


instead of raising UnicodeError. The original string ends with U+2026 HORIZONTAL ELLIPSIS if you can't see it clearly. These strings are coming from web pages in a web crawler.

I tested this on Python 3.4, 3.3.2, 2.7.5, 2.6.9.
msg215198 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-03-30 19:53
I believe this behavior is correct wrt. RFC 3490. In the input, the last label is "c…", which is not empty. It is passed to ToASCII, which normalizes the ellipsis to "...". If UseSTD3ASCIIRules was true, conversion would fail as it yields "." (\x2E). However, Python choses not to set UseSTD3ASCIIRules (and instead leaves it to the DNS server to decide whether the name is valid).

I believe this is actually a bug in the RFC, which should ban "." from the the set of conversion results regardless of UseSTD3ASCIIRules. However, since this RFC is superseded, you probably won't get anybody to confirm this view.
msg215199 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-30 20:50
For whatever it is worth, it looks like rfc 5892 marks U+2026 as disallowed.
