Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDNA2008 encoding is missing #61507

Open
marten mannequin opened this issue Feb 27, 2013 · 37 comments
Open

IDNA2008 encoding is missing #61507

marten mannequin opened this issue Feb 27, 2013 · 37 comments
Labels
3.10 only security fixes stdlib Python modules in the Lib dir topic-SSL type-feature A feature request or enhancement

Comments

@marten
Copy link
Mannequin

marten mannequin commented Feb 27, 2013

BPO 17305
Nosy @loewis, @gpshead, @tiran, @bitdancer, @njsmith, @asvetlov, @ambv, @socketpair, @berkerpeksag, @Lukasa, @wumpus, @miss-islington, @epicfaace, @akulakov, @case
PRs
  • bpo-17305: Link to the third-party idna package. #25208
  • [3.9] bpo-17305: Link to the third-party idna package. (GH-25208) #25210
  • [3.8] bpo-17305: Link to the third-party idna package. (GH-25208) #25211
  • Files
  • idna_translate.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2013-02-27.01:32:46.318>
    labels = ['expert-SSL', 'type-feature', 'library', '3.10']
    title = 'IDNA2008 encoding is missing'
    updated_at = <Date 2021-11-04.19:39:59.773>
    user = 'https://bugs.python.org/marten'

    bugs.python.org fields:

    activity = <Date 2021-11-04.19:39:59.773>
    actor = 'case'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)', 'SSL']
    creation = <Date 2013-02-27.01:32:46.318>
    creator = 'marten'
    dependencies = []
    files = ['29256']
    hgrepos = []
    issue_num = 17305
    keywords = ['patch']
    message_count = 35.0
    messages = ['183104', '183106', '183144', '183147', '183149', '183159', '183160', '183199', '183202', '205009', '205034', '217092', '217218', '278493', '279904', '310923', '310924', '310943', '310988', '311007', '311009', '311013', '349642', '349643', '349840', '349855', '349886', '370635', '389932', '390246', '390288', '390291', '390296', '391980', '396587']
    nosy_count = 20.0
    nosy_names = ['loewis', 'gregory.p.smith', 'christian.heimes', 'r.david.murray', 'njs', 'asvetlov', 'lukasz.langa', 'socketpair', 'underrun', 'berker.peksag', 'era', 'marten', 'Lukasa', 'wumpus', 'SamWhited', 'Socob', 'miss-islington', 'epicfaace', 'andrei.avk', 'case']
    pr_nums = ['25208', '25210', '25211']
    priority = 'high'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue17305'
    versions = ['Python 3.10']

    @marten
    Copy link
    Mannequin Author

    marten mannequin commented Feb 27, 2013

    Since Python 2.3 the idna encoding is available for Internationalized Domain Names. But the current encoding doesn't work according to the latest version of the spec.

    There is a new IDNA2008 specification (RFCs 5890-5894). Although I'm not very deep into all the changes, I know that at least the nameprep has changed. For example, the German sharp S ('ß') isn't replaced by 'ss' any longer.

    The attached file shows the difference between the expected translation and the actual translation.

    @marten marten mannequin added type-feature A feature request or enhancement stdlib Python modules in the Lib dir labels Feb 27, 2013
    @bitdancer
    Copy link
    Member

    How are they handling interoperability?

    @marten
    Copy link
    Mannequin Author

    marten mannequin commented Feb 27, 2013

    At least from the GNU people, two separate projects exists for this matter:

    libidn, the original IDNA translation (http://www.gnu.org/software/libidn/)
    libidn2, the IDNA2008 translation (http://www.gnu.org/software/libidn/libidn2/manual/libidn2.html)

    Btw.: Does Python provide a way to decode the ASCII-representation back to UTF-8?

    >>> name.encode('idna')
    'xn--mller-kva.com'
    
    >>> name.encode('idna').decode('utf-8')
    u'xn--mller-kva.com'

    Otherwise I'd look for Python bindings of libidn2 or idnkit-2.

    @marten
    Copy link
    Mannequin Author

    marten mannequin commented Feb 27, 2013

    For the embedded Python examples, please prepend the following lines:

    from __future__ import unicode_literals
    name='müller.com'

    So regarding interoperability: Usually you only use one implementation in your code and hopefully the latest release, but in case someone needs to old one, maybe there should be a separate encodings.idna2008 class.

    @bitdancer
    Copy link
    Member

    Does this mean the differences are only in the canonicalization of unicode values? IDNA is a wire protocol, which means that an application can't know if it is being asked to decode an idna1 or idna2 string unless there's something in the protocol that tells it. But if the differences are only on the encoding side, and an idna1 decoder will "do the right thing" with the idna2 string, then that would be interoperable. I'll have to read the standard, but I don't have time right now :)

    idna is a codec:

    >>> b'xn--mller-kva.com'.decode('idna')
    'müller.com'

    (that's python3, it'll be a unicode string in python2, obviously).

    @marten
    Copy link
    Mannequin Author

    marten mannequin commented Feb 27, 2013

    IDNA2008 should be backwards compatible. I can try to explain it in a practical example:

    DENIC was the first registry that actually used IDNA2008 - at a time, where not even libidn2 officially included the changes required for it. This was mainly due to the point, that the German Latin Small Letter Sharp S ('ß') was treated differently to other German Umlauts ('ä', 'ö', 'ü') in the original IDNA spec: It was not punycoded, because the nameprep already replaced it by 'ss'. Replacing 'ß' with 'ss' is in general correct in German (e.g. if your keyboard doesn't allow to enter 'ß'), but then 'ä' would have to be replaced by 'ae', 'ö' by 'oe' and 'ü' by 'ue' as well.

    Punycoding 'ä', 'ö', 'ü', but not 'ß' was inconsistent and it wouldn't allow to register a domain name like straße.de, because it was translated to strasse.de. Therefor DENIC supported IDNA2008 very early to allow the registration of domain names containing 'ß'.

    The only thing I'm aware of in this situation is, that previously straße.de was translated to strasse.de, while with IDNA2008 it's being translated to xn--strae-oqa.de. So people that have hardcoded a URL containing 'ß' and who are expecting it to be translated to 'ss' would fail, because with IDNA2008 it would be translated to a different ASCII-hostname. But those people could just change 'ß' to 'ss' in their code and everything would work again.

    On the contrary, people that have registered a domain name containing 'ß' in the meantime couldn't access it right now by specifying the IDN version, because it would be translated to the wrong domain name with the current Python IDNA encoding. So the current IDNA-Encoding should be upgraded to IDNA2008.

    @bitdancer
    Copy link
    Member

    That doesn't sound like interoperability to me, that sounds like backward incompatibility :(. I hope you are right that it only affects people with hardcoded domain names, but that is still an issue.

    In any case, since this is a new feature it can only go into Python3.4, however we decide to do it.

    @marten
    Copy link
    Mannequin Author

    marten mannequin commented Feb 28, 2013

    I found an interesting link about this issue:

    http://www.unicode.org/faq/idn.html

    I also checked a domain name of a client that ends with 'straße.de': IE, Firefox and Chrome still use IDNA2003, Opera already does IDNA2008.

    In IDNA2008 a lot of characters aren't allowed any longer (like symbols or strike-through letters). But I think this doesn't have any practical relevance, because even while IDNA2003 formally allowed these characters, domain name registries disallowed to register internationalized domain names containing any of these characters.

    Most registries restricted the allowed characters very strong, e.g. in the .de zone you cannot use Japanese characters, only those in use within the German language. Some other registries expect you to submit a language property during the domain registration and then only special characters within that language are allowed in the domain name. Also, most registries don't allow to register a domain name that mixes different languages.

    So IDNA2008 is the future and hopefully shouldn't break a lot. I don't know of any real life use of the IDNA encoding other than DNS / URLs. I don't know how many existing modules in PyPI working with URLs already make use of the current encodings.idna class but I guess it would cause more work if they all would have to change their code to use name.encode('idna2008') or work with an outdated encoding in the end if unchanged than just silentely switching to IDNA2008 for encodings.idna and add encodings.idna2003 for those who really need the old one for some reason. Reminds me a bit on the range() / xrange() thing. Now the special new xrange() is the default and called just range() again. I guess in some years we'll look back on the IDNA2003/2008 transition the same way.

    @bitdancer
    Copy link
    Member

    Ah, excellent, that document looks like exactly what I was looking for.

    Now, when someone is going to get around to working on this, I don't know.

    (Note that the xrange/range change was made at the Python2/Python3 boundary, where we broke backward compatibility. I doubt that we are ever going to do that kind of transition again, but we do have ways to phase in changes in the default behavior over time.)

    @era
    Copy link
    Mannequin

    era mannequin commented Dec 2, 2013

    At least the following existing domain names are rejected by the current implementation, apparently because they are not IDNA2003-compatible.

    XN----NNC9BXA1KSA.COM
    XN--14-CUD4D3A.COM
    XN--YGB4AR5HPA.COM
    XN---14-00E9E9A.COM
    XN--MGB2DAM4BK.COM
    XN--6-ZHCPPA1B7A.COM
    XN--3-YMCCH8IVAY.COM
    XN--3-YMCLXLE2A3F.COM
    XN--4-ZHCJXA0E.COM
    XN--014-QQEUW.COM
    XN--118-Y2EK60DC2ZB.COM

    As a workaround, in the code where I needed to process these, I used a fallback to string[4:].decode('punycode'); this was in a code path where I had already lowercased the string and established that string[0:4] == 'xn--'.

    As a partial remedy, supporting a relaxed interpretation of the spec somehow would be useful; see also (tangentially) issue bpo-12263.

    @marten
    Copy link
    Mannequin Author

    marten mannequin commented Dec 2, 2013

    There's nice library called idna on PyPI doing idna2008: https://pypi.python.org/pypi/idna/0.1

    I'd however prefer this standard encoding to be part of standard python.

    @underrun
    Copy link
    Mannequin

    underrun mannequin commented Apr 23, 2014

    It is worth noting that the do exist some domains that have been registered in the past that work with IDNA2003 but not IDNA2008.

    There definitely needs to be IDNA2008 support, for my use case I need to attempt IDNA2008 and then fall back to IDNA2003.

    When support for IDNA2008 is added, please retain support for IDNA2003.

    I would say that ideally there would be a codec that could handle both - attempt to use IDNA2008 and on error fallback to idna2003. I realize this isn't "official" but it would certainly be useful.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 26, 2014

    I would propose this approach:

    1. Python should implement both IDNA2008 and UTS#46, and keep IDNA2003
    2. "idna" should become an alias for "idna2003".
    3. The socket module and all other place that use the "idna" encoding should use "uts46" instead.
    4. Pre-existing implementations of IDNA 2008 should be used as inspirations at best; Python will need a new implementation from scratch, one that puts all relevant tables into the unicodedata module if they aren't there already. This is in particular where the idna 0.1 library fails. The implementation should refer to the relevant parts of the specification, to be easily reviewable for correctness.

    Contributions are welcome.

    @tiran tiran added topic-SSL 3.7 (EOL) end of life labels Sep 26, 2016
    @tiran tiran assigned tiran and unassigned tiran Sep 26, 2016
    @tiran
    Copy link
    Member

    tiran commented Oct 11, 2016

    I'm considering lack of IDNA 2008 a security issue for applications that perform DNS lookups and X.509 cert validation. Applications may end up connecting to the wrong machine and even validate the cert correctly.

    Wrong:

    >>> import socket
    >>> u'straße.de'.encode('idna')
    'strasse.de'
    >>> socket.gethostbyname(u'straße.de'.encode('idna'))
    '72.52.4.119'
    
    Correct:
    >>> import idna
    >>> idna.encode(u'straße.de')
    'xn--strae-oqa.de'
    >>> socket.gethostbyname(idna.encode(u'straße.de'))
    '81.169.145.78'

    @tiran tiran added type-security A security issue and removed type-feature A feature request or enhancement labels Oct 11, 2016
    @tiran
    Copy link
    Member

    tiran commented Nov 2, 2016

    I reported the issue for curl, CVE-2016-8625 https://curl.haxx.se/docs/adv_20161102K.html

    @wumpus
    Copy link
    Mannequin

    wumpus mannequin commented Jan 28, 2018

    I am avoiding Python's built-in libraries as much as possible in my aiohttp-based crawler because of this issue, but I cannot open a connection to https://xn--ho-hia.de because there is an 'IDNA does not round-trip' raise in the python 3.6 library ssl.py code.

    Happy to provide a code sample. I guess the 500-line async crawler in Guido's book was never used on German websites.

    @njsmith
    Copy link
    Contributor

    njsmith commented Jan 28, 2018

    Greg: That's bpo-28414. There's currently no motion towards builtin IDNA 2008 support (this bug), but I *think* in 3.7 the ssl module will be able to handle pre-encoded A-labels like that. I'm a little confused about the exact status right now but there's been lots of dicussion about that specific issue and I think Christian is planning to get one of the relevant PRs merged ASAP.

    @tiran
    Copy link
    Member

    tiran commented Jan 28, 2018

    A fix will land in 3.7 and maybe get backported to 3.6. Stay tuned!

    @tiran
    Copy link
    Member

    tiran commented Jan 28, 2018

    bpo-31399 has fixed hostname matching for IDNA 2003 compatible domain names. IDNA 2008 domain names with German ß are still broken, for example:

    UnicodeError: ('IDNA does not round-trip', b'xn--knigsgchen-b4a3dun', b'xn--knigsgsschen-lcb0w')

    @bitdancer
    Copy link
    Member

    What we need for this issue is someone volunteering to writing the code. Given how long it has already been, I don't think anyone already on the core team is going to pick it up.

    @tiran
    Copy link
    Member

    tiran commented Jan 28, 2018

    I lack the expertise and time to implement IDNA 2008 with UTS46 codec. I considered GNU libidn2, but the library requires two more helper libraries and LGPLv3 might be an issue for us.

    @njsmith
    Copy link
    Contributor

    njsmith commented Jan 28, 2018

    The "obvious" solution would be to move the "idna" module into the stdlib, but someone would still have to work that out, and it's clearly not happening for 3.7.

    @njsmith njsmith added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Jan 28, 2018
    @asvetlov asvetlov changed the title IDNA2008 encoding missing IDNA2008 encoding is missing May 29, 2018
    @epicfaace epicfaace mannequin added 3.9 only security fixes and removed 3.8 only security fixes labels Aug 14, 2019
    @epicfaace
    Copy link
    Mannequin

    epicfaace mannequin commented Aug 14, 2019

    Why would chrome still be using IDNA 2003 to link http://straße.de to http://strasse.de?

    @tiran
    Copy link
    Member

    tiran commented Aug 14, 2019

    You have to ask the Chrome team.

    @tiran tiran added the 3.8 only security fixes label Aug 14, 2019
    @epicfaace
    Copy link
    Mannequin

    epicfaace mannequin commented Aug 16, 2019

    So is the consensus that the best way to do this is to move the "idna" library to stdlib, or implement it from scratch?

    @asvetlov
    Copy link
    Contributor

    There is no consensus yet, IMHO.
    There is a lack of resources for the issue.

    @tiran
    Copy link
    Member

    tiran commented Aug 16, 2019

    There is no consensus yet. Besides https://pypi.org/project/idna/ we could also consider to wrap libidn2 and ship it. Both PyPI idna and libidn2 have potential licensing issues. I don't like the idea to reinvent the wheel and implement our own idna2008 codec. It's not a trivial task.

    Once Python has a working idna2008 encoder, we need to address integration into socket, ssl, http, and asyncio module.

    @tiran
    Copy link
    Member

    tiran commented Jun 2, 2020

    BPO bpo-40845 is another case of IDNA 2003 / 2008 bug.

    @tiran tiran added 3.10 only security fixes type-feature A feature request or enhancement and removed 3.8 only security fixes 3.9 only security fixes type-security A security issue labels Mar 31, 2021
    @underrun
    Copy link
    Mannequin

    underrun mannequin commented Mar 31, 2021

    why the downgrade from security to enhancement and critical to high?

    this is a significant issue that can impact everything from phishing to TLS certificate domain validation and SNI.

    @tiran
    Copy link
    Member

    tiran commented Apr 5, 2021

    The issue has been waiting for contributions for 8 years now. So far nobody has shown an interested to address the problem and contribute an IDNA 2008 codec to Python's standard library.

    @gpshead
    Copy link
    Member

    gpshead commented Apr 6, 2021

    My PR merely adds a note to the docs linking to idna on pypi. Don't get excited, it doesn't implement anything. :P

    re "Once Python has a working idna2008 encoder, we need to address integration into socket, ssl, http, and asyncio module."

    ... doing that _could_ be the same can of worms the browsers all had to go through? We'd need to decide which behavior we wanted; pure? or matching what browsers do? I suspect that is equivalent to the pypi idna https://github.com/kjd/idna 's uts46=True + transitional=True mode [*] but anyone doing this work would need to figure that out for sure if we wanted to default to behaving like browsers with the transitional compatibility mode.

    That there is a need for a couple options on top of idna2008 as an encoding suggests it may not be a great fit for the Python codecs encodings system as those use a single string name. We'd need to permute the useful possible combos of flag behavior in the names. idna2003, idna2008, idna2008uts46, idna2008uts46transitional, and other combos of those if alternate combinations are deemed relevant.

    I worry that a browser-transitional-behavior-matching situation may change over time as TLDs decide when to change their policies. Is that an irrational fear? Browsers are well equipped to deal with this as they've got frequent updates. A PyPI package could as well.

    [*] Browser history:

    fwiw people wondering _why_ browsers like Chrome and Firefox don't "just blindly use idna2008 for everything" should go read the backwards compatibility transitional rationale and security concerns in https://bugs.chromium.org/p/chromium/issues/detail?id=61328
    and https://bugzilla.mozilla.org/show_bug.cgi?id=479520

    (caution: be ready to filter out the random internet whiners from those threads)

    @miss-islington
    Copy link
    Contributor

    New changeset 1d023e3 by Gregory P. Smith in branch 'master':
    bpo-17305: Link to the third-party idna package. (GH-25208)
    1d023e3

    @miss-islington
    Copy link
    Contributor

    New changeset c7ccb0f by Miss Islington (bot) in branch '3.9':
    bpo-17305: Link to the third-party idna package. (GH-25208)
    c7ccb0f

    @ambv
    Copy link
    Contributor

    ambv commented Apr 26, 2021

    New changeset 2760a67 by Miss Islington (bot) in branch '3.8':
    bpo-17305: Link to the third-party idna package. (GH-25208) (bpo-25211)
    2760a67

    @akulakov
    Copy link
    Contributor

    Maybe deprecate idna so that users are strongly prompted to consider the pypi idna?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @JannesAlthoff
    Copy link

    I think IDNA2008 with TR46 without Transitional should be used, because Browser use this, and it should replace the current idna codec, the old codec's and IDNA2008 with other options could be provided under different codec names.

    This should be done because the current behavior is a high severity vulnerability. And replacing idna2003 with TR46 without Transitional matches most Browsers behavior.

    @JannesAlthoff
    Copy link

    A not so good temporary fix could be not removing the ezset, endsigma, ZWJ and ZWNJ int the stringprep function.

    This should be easy:

    newlabel.append(stringprep.map_table_b2(c))

    Would be replaced by

    newlabel.append(stringprep.map_table_b2(c) if c!="ß" and c!="ς" else c)

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir topic-SSL type-feature A feature request or enhancement
    Projects
    Development

    No branches or pull requests

    9 participants