This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: socket module calls with long host names can fail with idna codec error
Type: Stage: needs patch
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ben.Darnell, ablack, agnosticdev, alexmv, gregory.p.smith, joseph.hackman, midopa, r.david.murray, sdbowman
Priority: normal Keywords:

Created on 2018-02-26 19:52 by ablack, last changed 2022-04-11 14:58 by admin.

Messages (10)
msg312947 - (view) Author: Aaron Black (ablack) Date: 2018-02-26 19:52
While working on a custom conda channel with authentication, I ran into the following UnicodeError:

Traceback (most recent call last):
  File "/Users/ablack/miniconda3/lib/python3.6/site-packages/conda/core/repodata.py", line 402, in fetch_repodata_remote_request
    timeout=timeout)
  File "/Users/ablack/miniconda3/lib/python3.6/site-packages/requests/sessions.py", line 521, in get
    return self.request('GET', url, **kwargs)
  File "/Users/ablack/miniconda3/lib/python3.6/site-packages/requests/sessions.py", line 499, in request
    prep.url, proxies, stream, verify, cert
  File "/Users/ablack/miniconda3/lib/python3.6/site-packages/requests/sessions.py", line 672, in merge_environment_settings
    env_proxies = get_environ_proxies(url, no_proxy=no_proxy)
  File "/Users/ablack/miniconda3/lib/python3.6/site-packages/requests/utils.py", line 692, in get_environ_proxies
    if should_bypass_proxies(url, no_proxy=no_proxy):
  File "/Users/ablack/miniconda3/lib/python3.6/site-packages/requests/utils.py", line 676, in should_bypass_proxies
    bypass = proxy_bypass(netloc)
  File "/Users/ablack/miniconda3/lib/python3.6/urllib/request.py", line 2612, in proxy_bypass
    return proxy_bypass_macosx_sysconf(host)
  File "/Users/ablack/miniconda3/lib/python3.6/urllib/request.py", line 2589, in proxy_bypass_macosx_sysconf
    return _proxy_bypass_macosx_sysconf(host, proxy_settings)
  File "/Users/ablack/miniconda3/lib/python3.6/urllib/request.py", line 2562, in _proxy_bypass_macosx_sysconf
    hostIP = socket.gethostbyname(hostonly)
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)

The error can be consistently reproduced when the first substring of the url hostname is greater than 64 characters long, as in "0123456789012345678901234567890123456789012345678901234567890123.example.com". This wouldn't be a problem, except that it doesn't seem to separate out credentials from the first substring of the hostname so the entire "[user]:[secret]@XXX" section must be less than 65 characters long. This is problematic for services that use longer API keys and expect their submission over basic auth.
msg313163 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-03-02 21:32
Thanks for the report.  The behavior you see can be further isolated to socket.gethostbyname:

>>> import socket
>>> h = "0123456789012345678901234567890123456789012345678901234567890123.example.com"
>>> socket.gethostbyname(h)
Traceback (most recent call last):
  File "/usr/lib/python3.6/encodings/idna.py", line 165, in encode
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)

Other socket module calls accepting host names fail similarly, such as getaddrinfo.
msg313164 - (view) Author: Aaron Black (ablack) Date: 2018-03-02 21:46
Just to be clear, I don't know if the socket needs to support 64 character long host name sections, so here's an example url that is at the root of my problem that I'm pretty sure it should support:

>>> import socket
>>> h = "username:long_api_key0123456789012345678901234567890123456789@www.example.com"
>>> socket.gethostbyname(h)
Traceback (most recent call last):
  File "/Users/ablack/miniconda3/lib/python3.6/encodings/idna.py", line 165, in encode
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)
msg313323 - (view) Author: Matt Eaton (agnosticdev) * Date: 2018-03-06 12:35
Using Ubuntu 16.04 with the 3.6.0 tag I was also able to reproduce the same error reported:

import socket

h = "0123456789012345678901234567890123456789012345678901234567890123.example.com"
socket.gethostbyname(h)

Traceback (most recent call last):
  File "/home/agnosticdev/Documents/code/python/python-dev/cpython-3_6_0/Lib/encodings/idna.py", line 165, in encode
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "host_test.py", line 8, in <module>
    socket.gethostbyname(h)
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)


It looks like the hostname being 64 characters long is the issue in that it cannot be encoded.  Thus falling into the UnicodeError being raised in idna.py:
            # ASCII name: fast path
            labels = result.split(b'.')
            for label in labels[:-1]:
                if not (0 < len(label) < 64):
                    raise UnicodeError("label empty or too long")
            if len(labels[-1]) >= 64:
                raise UnicodeError("label too long")
            return result, len(input)

I did some work on this to try and resolve this, but ultimately it was not worth committing so I wanted to report my findings.
msg372865 - (view) Author: Steve Bowman (sdbowman) Date: 2020-07-02 16:54
When will this issue be fixed?  Thanks!
msg373006 - (view) Author: Joseph Hackman (joseph.hackman) * Date: 2020-07-05 00:22
According to the DNS standard, hostnames with more than 63 characters per label (the sections between .) are not allowed [https://tools.ietf.org/html/rfc1035#section-2.3.1].

That said, enforcing that at the codec level might be the wrong choice. I threw together a quick patch moving the limits up to 250, and nothing blew up. It's unclear what the general usefulness of such a change would be, since DNS servers probably couldn't handle those requests anyway.

As for the original issue, if anybody is still doing something like that, could they provide a full example URL? I was unable to reproduce on HTTP (failed in a different place), or FTP.
msg374207 - (view) Author: Aaron Black (ablack) Date: 2020-07-24 19:35
joseph.hackman

I don't think that the 63 character limit on a label is the problem specifically, merely it's application. 

The crux of my issue was that credentials passed with the url in a basic-authy fashion (as some services require) count against the label length. For example, this would trigger the error:

h = "https://ablack:very_long_api_key_0123456789012345678901234567890123456789012345678901234567890123@www.example.com"

Since the first label would be treated as:
 "ablack:very_long_api_key_0123456789012345678901234567890123456789012345678901234567890123@www"

My specific issue goes away if any text up to / including an "@" in the first label section is not included in the label validation. I don't know off hand if that information is supposed to be included per the label in the DNS spec though.
msg391990 - (view) Author: Alex Vandiver (alexmv) Date: 2021-04-26 22:04
It seems reasonable to fail on hostnames that are too long -- but it feels like the weirdness is that it is categorized as a UnicodeError, and not as, say, a ValueError.

Would a re-categorization as ValueError seem like a reasonable adjustment here?
msg393323 - (view) Author: Ben Darnell (Ben.Darnell) * Date: 2021-05-09 15:57
[I'm coming here from https://github.com/tornadoweb/tornado/pull/3010)

UnicodeError is a subclass of ValueError, so I don't see what value that change would provide. The thing that's surprising to me is that it's not a `socket.herror` (or `gaierror` for socket.getaddrinfo). I guess the docs don't formally say that `herror`/`gaierror` is the *only* possible error from these functions, but `gaierror` was the only error I was catching so the unexpected UnicodeError escaped the layer that was intended to handle it. 

I do think that in the special case of `getaddrinfo` with the `AI_NUMERICHOST` flag it should be handled differently: in that mode there is no network access necessary and it's reasonable to assume that the only possible error is a `gaierror` with `EAI_NONAME`. 

I'd like to at least see better documentation about what errors are possible from this family of functions.
msg411539 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-25 00:36
ablack: the basic auth username:password@ part of the string is not part of a hostname.  What code are you seeing that is trying to send that to a name resolver rather than stripping the obviously private info up through the @ sign?
History
Date User Action Args
2022-04-11 14:58:58adminsetgithub: 77139
2022-01-25 00:36:47gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg411539
2021-05-09 15:57:56Ben.Darnellsetnosy: + Ben.Darnell
messages: + msg393323
2021-04-26 22:04:10alexmvsetnosy: + alexmv
messages: + msg391990
2020-11-15 04:15:55midopasetnosy: + midopa
2020-07-24 19:35:40ablacksetmessages: + msg374207
2020-07-05 00:22:53joseph.hackmansetnosy: + joseph.hackman
messages: + msg373006
2020-07-02 16:54:51sdbowmansetnosy: + sdbowman
messages: + msg372865
2018-03-18 20:35:56ned.deilysetnosy: - ned.deily
2018-03-17 18:39:35r.david.murraysetnosy: + r.david.murray
2018-03-06 12:35:41agnosticdevsetnosy: + agnosticdev
messages: + msg313323
2018-03-02 21:46:33ablacksetmessages: + msg313164
2018-03-02 21:32:44ned.deilysetversions: + Python 3.7, Python 3.8
type: crash ->

nosy: + ned.deily
title: Urllib proxy_bypass crashes for urls containing long basic auth strings -> socket module calls with long host names can fail with idna codec error
messages: + msg313163
stage: needs patch
2018-02-26 19:52:35ablackcreate