Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names #53623

Open
baikie mannequin opened this issue Jul 25, 2010 · 52 comments
Open

socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names #53623

baikie mannequin opened this issue Jul 25, 2010 · 52 comments
Labels
extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@baikie
Copy link
Mannequin

baikie mannequin commented Jul 25, 2010

BPO 9377
Nosy @malemburg, @loewis, @amauryfa, @vstinner, @ezio-melotti, @bitdancer, @zooba
Files
  • ascii-surrogateescape.diff: Decode hostnames as ASCII/surrogateescape rather than UTF-8
  • try-surrogateescape-first.diff: Accept ASCII/surrogateescape strings as hostname arguments
  • uname-surrogateescape.diff: In posix.uname(), decode nodename as ASCII/surrogateescape
  • ascii-surrogateescape-2.diff: Renamed unicode_from_hostname -> decode_hostname
  • try-surrogateescape-first-2.diff: Made various small changes
  • try-surrogateescape-first-3.diff: Fixed a couple of mistakes
  • try-surrogateescape-first-4.diff
  • try-surrogateescape-first-getnameinfo-4.diff
  • decode-strict-ascii.diff: Decode hostnames strictly as ASCII
  • hostname-bytes-apis.diff: Add name resolution APIs that return names as bytes (applies on top of decode-strict-ascii.diff)
  • return-ascii-surrogateescape-2015-06-25.diff
  • accept-ascii-surrogateescape-2015-06-25.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2010-07-25.18:33:02.681>
    labels = ['extension-modules', 'type-bug']
    title = 'socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names'
    updated_at = <Date 2016-01-28.01:05:18.521>
    user = 'https://bugs.python.org/baikie'

    bugs.python.org fields:

    activity = <Date 2016-01-28.01:05:18.521>
    actor = 'vstinner'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Extension Modules']
    creation = <Date 2010-07-25.18:33:02.681>
    creator = 'baikie'
    dependencies = []
    files = ['18195', '18196', '18259', '18272', '18273', '18609', '18616', '18617', '18674', '18676', '39812', '39813']
    hgrepos = []
    issue_num = 9377
    keywords = ['patch']
    message_count = 52.0
    messages = ['111550', '111766', '111985', '112094', '114688', '114710', '114754', '114756', '114847', '114882', '115014', '115030', '115116', '115119', '115185', '115186', '115187', '118582', '118602', '118617', '118694', '118709', '118816', '118952', '119051', '119076', '119177', '119230', '119231', '119245', '119260', '119271', '119346', '119837', '119918', '119925', '119927', '119928', '119929', '119935', '119941', '119943', '119946', '120081', '158118', '158165', '158175', '158178', '159776', '243311', '245826', '259079']
    nosy_count = 11.0
    nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'vstinner', 'baikie', 'ezio.melotti', 'r.david.murray', 'jesterKing', 'spaun2002', 'steve.dower', 'Almad']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue9377'
    versions = ['Python 3.2']

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Jul 25, 2010

    The functions in the socket module which return host/domain
    names, such as gethostbyaddr() and getnameinfo(), are wrappers
    around byte-oriented interfaces but return Unicode strings in
    3.x, and have not been updated to deal with undecodable byte
    sequences in the results, as discussed in PEP-383.

    Some DNS resolvers do discard hostnames not matching the
    ASCII-only RFC 1123 syntax, but checks for this may be absent or
    turned off, and non-ASCII bytes can be returned via other lookup
    facilities such as /etc/hosts.

    Currently, names are converted to str objects using
    PyUnicode_FromString(), i.e. by attempting to decode them as
    UTF-8. This can fail with UnicodeError of course, but even if it
    succeeds, any non-ASCII names returned will fail to round-trip
    correctly because most socket functions encode string arguments
    into IDNA ASCII-compatible form before using them. For example,
    with UTF-8 encoded entries

    127.0.0.2 €
    127.0.0.3 xn--lzg

    in /etc/hosts, I get:

    Python 3.1.2 (r312:79147, Mar 23 2010, 19:02:21) 
    [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu4)] on linux2
    Type "help", "copyright", "credits" or "license" for more
    information.
    >>> from socket import *
    >>> getnameinfo(("127.0.0.2", 0), 0)
    ('€', '0')
    >>> getaddrinfo(*_)
    [(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]

    Here, getaddrinfo() has encoded "€" to its corresponding ACE
    label "xn--lzg", which maps to a different address.

    PEP-383 can't be applied as-is here, since if the name happened
    to be decodable in the file system encoding (and thus was
    returned as valid non-ASCII Unicode), the result would fail to
    round-trip correctly as shown above, but I think there is a
    solution which follows the general idea of PEP-383.

    Surrogate characters are not allowed in IDNs, since they are
    prohibited by Nameprep[1][2], so if names were instead decoded as
    ASCII with the surrogateescape error handler, strings
    representing non-ASCII names would always contain surrogate
    characters representing the non-ASCII bytes, and would therefore
    fail to encode with the IDNA codec. Thus there would be no
    ambiguity between these strings and valid IDNs. The attached
    ascii-surrogateescape.diff does this.

    The returned strings could then be made to round-trip as
    arguments, by having functions that take hostname arguments
    attempt to encode them using ASCII/surrogateescape first before
    trying IDNA encoding. Since IDNA leaves ASCII names unchanged
    and surrogate characters are not allowed in IDNs, this would not
    change the interpretation of any string hostnames that are
    currently accepted. It would remove the 63-octet limit on label
    length currently imposed due to the IDNA encoding, for ASCII
    names only, but since this is imposed due to the 63-octet limit
    of the DNS, and non-IDN names may be intended for other
    resolution mechanisms, I think this is a feature, not a bug :)

    The patch try-surrogateescape-first.diff implements the above for
    all relevant interfaces, including gethostbyaddr() and
    getnameinfo(), which do currently accept hostnames, even if the
    documentation is vague (in the standard library, socket.fqdn()
    calls gethostbyaddr() with a hostname, and the "os" module docs
    suggest calling socket.gethostbyaddr(socket.gethostname()) to get
    the fully-qualified hostname).

    The patch still allows hostnames to be passed as bytes objects,
    but to simplify the implementation, it removes support for
    bytearray (as has been done for pathnames in 3.2). Bytearrays
    are currently only accepted by the socket object methods
    (.connect(), etc.), and this is undocumented and perhaps
    unintentional - the get*() functions have never accepted them.

    One problem with the surrogateescape scheme would be with
    existing code that looks up an address and then tries to write
    the hostname to a log file or use it as part of the wire
    protocol, since the surrogate characters would fail to encode as
    ASCII or UTF-8, but the code would appear to work normally until
    it encountered a non-ASCII hostname, allowing the problem to go
    undetected.

    On the other hand, such code is probably broken as things stand,
    given that the address lookup functions can undocumentedly raise
    UnicodeError in the same situation. Also, protocol definitions
    often specify some variant of the RFC 1123 syntax for hostnames
    (thus making non-ASCII bytes illegal), so code that checked for
    this prior to encoding the name would probably be OK, but it's
    more likely the exception than the rule.

    An alternative approach might be to return all hostnames as bytes
    objects, thus breaking everything immediately and obviously...

    [1] http://tools.ietf.org/html/rfc3491#section-5
    [2] http://tools.ietf.org/html/rfc3454#appendix-C.5

    @baikie baikie mannequin added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Jul 25, 2010
    @vstinner
    Copy link
    Member

    I like the idea of using the PEP-383 for hostnames, but I don't understand the relation with IDNA (maybe because I don't know this encoding).

    +this leaves IDNA ASCII-compatible encodings in ASCII
    +form, but converts any non-ASCII bytes in the hostname to the Unicode
    +lone surrogate codes U+DC80...U+DCFF.

    What is an "IDNA ASCII-compatible encoding"?

    --

    ascii-surrogateescape.diff:

    • I don't like unicode_from_hostname() name: "decode_hostname()" would be better.
    • It doesn't patch the doc and so cannot be applied alone. It doesn't matter, it's better to apply both patches at the same time. But thanks to have splitted them, it's easier to review them :-)

    try-surrogateescape-first.diff:

    • hostname_to_bytes() should be called "encode_hostname()"
    • if (!PyErr_ExceptionMatches(PyExc_UnicodeError)): you should catch UnicodeEncodeError here
    • "if this is not possible, :exc:`UnicodeError` is raised.": is it an UnicodeEncodeError?
    • use PyUnicode_AsEncodedString() instead of PyUnicode_AsEncodedObject(): it's faster for ASCII and ensure that the result is a bytes object (so you don't need to re-check the type)

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Jul 29, 2010

    "Leaving IDNA ASCII-compatible encodings in ASCII form" is just preserving the existing behaviour (not doing IDNA decoding). See

    http://tools.ietf.org/html/rfc3490

    and the docs for codecs -> encodings.idna ("xn--lzg" in the example is the ASCII-compatible encoding of "€", so if you look up that IP address, "xn--lzg" is returned with or without the patch).

    I'll look into your other comments. In the meantime, I've got one more patch, as the decoding of the nodename field in os.uname() also needs to be changed to match the other hostname-returning functions. This patch changes it to ASCII/surrogateescape, with the usual PEP-383 decoding for the other fields.

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Jul 30, 2010

    OK, here are new versions of the original patches.

    I've tweaked the docs to make clear that ASCII-compatible
    encodings actually *are* ASCII, and point to an explanation as
    soon as they're mentioned.

    You're right that PyUnicode_AsEncodedString() is the preferable
    interface for the argument converter (I think I got
    PyUnicode_AsEncodedObject() from an old version of
    PyUnicode_FSConverter() :/), but for the ASCII step I've just
    short-circuited it and used PyUnicode_EncodeASCII() directly,
    since the converter has already checked that the object is of
    Unicode type. For the IDNA step, PyUnicode_AsEncodedString()
    should result in a less confusing error message if the codec
    returns some non-bytes object one day.

    However, the PyBytes_Check isn't to check up on the codec, but to
    check for a bytes argument, which the converter also supports.
    For that reason, I think encode_hostname would be a misleading
    name, but I've renamed it hostname_converter after the example of
    PyUnicode_FSConverter, and renamed unicode_from_hostname to
    decode_hostname.

    I've also made the converter check for UnicodeEncodeError in the
    ASCII step, but the end result really is UnicodeError if the IDNA
    step fails, because the "idna" codec does not use
    UnicodeEncodeError or UnicodeDecodeError. Complain about that if
    you wish :)

    I think the example I gave in the previous comment was also
    confusing, so just to be clear...

    In /etc/hosts (in UTF-8 encoding):

    127.0.0.2 €
    127.0.0.3 xn--lzg

    Without patches:

    >>> from socket import *
    >>> getnameinfo(("127.0.0.3", 0), 0)
    ('xn--lzg', '0')
    >>> getnameinfo(("127.0.0.2", 0), 0)
    ('€', '0')
    >>> getaddrinfo(*_)
    [(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
    >>> ''.encode("idna")
    b'xn--lzg'

    With patches:

    >>> from socket import *
    >>> getnameinfo(("127.0.0.3", 0), 0)
    ('xn--lzg', '0')
    >>> getnameinfo(("127.0.0.2", 0), 0)
    ('\udce2\udc82\udcac', '0')
    >>> getaddrinfo(*_)
    [(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))]
    >>> '\udce2\udc82\udcac'.encode("idna")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File
      "/home/david/python-patches/python-3/Lib/encodings/idna.py",
      line 167, in encode
        result.extend(ToASCII(label))
      File
      "/home/david/python-patches/python-3/Lib/encodings/idna.py",
      line 76, in ToASCII
        label = nameprep(label)
      File
      "/home/david/python-patches/python-3/Lib/encodings/idna.py",
      line 38, in nameprep
        raise UnicodeError("Invalid character %r" % c)
    UnicodeError: Invalid character '\udce2'

    The exception at the end demonstrates why surrogateescape strings
    don't get confused with IDNs.

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Aug 22, 2010

    I noticed that try-surrogateescape-first.diff missed out one of
    the string references that needed to be changed to point to the
    bytes object, and also used PyBytes_AS_STRING() in an unlocked
    section. This version fixes these things by taking the generally
    safer approach of setting the original char * variable to the
    hostname immediately after using hostname_converter().

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 22, 2010

    Is this patch in response to an actual problem, or a theoretical problem?
    If "actual problem": what was the specific application, and what was the specific host name?

    If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available".

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Aug 23, 2010

    Is this patch in response to an actual problem, or a theoretical problem?
    If "actual problem": what was the specific application, and what was the specific host name?

    It's about environments, not applications - the local network may
    be configured with non-ASCII bytes in hostnames (either in the
    local DNS *or* a different lookup mechanism - I mentioned
    /etc/hosts as a simple example), or someone might deliberately
    connect from a garbage hostname as a denial of service attack
    against a server which tries to look it up with gethostbyaddr()
    or whatever (this may require a "non-strict" resolver library, as
    noted above).

    If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available".

    There are two points here. One is that the decoding can fail; I
    do think that programmers will find this surprising, and the fact
    that Python refuses to return what was actually received is a
    regression compared to 2.x.

    The other is that the encoding and decoding are not symmetric -
    hostnames are being decoded with UTF-8 but encoded with IDNA.
    That means that when a decoded hostname contains a non-ASCII
    character which is not prohibited by IDNA/Nameprep, that string
    will, when used in a subsequent call, not refer to the hostname
    that was actually received, because it will be re-encoded using a
    different codec.

    Attaching a refreshed version of try-surrogateescape-first.diff.
    I've separated out the change to getnameinfo() as it may be
    superfluous (issue bpo-1027206).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 23, 2010

    > Is this patch in response to an actual problem, or a theoretical problem?
    > If "actual problem": what was the specific application, and what was the specific host name?

    It's about environments, not applications

    Still, my question remains. Is it a theoretical problem (i.e. one
    of your imagination), or a real one (i.e. one you observed in real
    life, without explicitly triggering it)? If real: what was the
    specific environment, and what was the specific host name?

    There are two points here. One is that the decoding can fail; I
    do think that programmers will find this surprising, and the fact
    that Python refuses to return what was actually received is a
    regression compared to 2.x.

    True. However, I think this is an acceptable regression,
    assuming the problem is merely theoretical. It is ok if
    an operation fails that you will never run into in real life.

    That means that when a decoded hostname contains a non-ASCII
    character which is not prohibited by IDNA/Nameprep, that string
    will, when used in a subsequent call, not refer to the hostname
    that was actually received, because it will be re-encoded using a
    different codec.

    Again, I fail to see the problem in this. It won't happen in
    real life. However, if you worried that this could be abused,
    I think it should decode host names as ASCII, not as UTF-8.
    Then it will be symmetric again (IIUC).

    @loewis loewis mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 23, 2010
    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Aug 24, 2010

    > It's about environments, not applications

    Still, my question remains. Is it a theoretical problem (i.e. one
    of your imagination), or a real one (i.e. one you observed in real
    life, without explicitly triggering it)? If real: what was the
    specific environment, and what was the specific host name?

    Yes, I did reproduce the problem on my own system (Ubuntu 8.04).
    No, it is not from a real application, nor do I know anyone with
    their network configured like this (except possibly Dan "djbdns"
    Bernstein: http://cr.yp.to/djbdns/idn.html ).

    I reported this bug to save anyone who *is* in such an
    environment from crashing applications and erroneous name
    resolution.

    > That means that when a decoded hostname contains a non-ASCII
    > character which is not prohibited by IDNA/Nameprep, that string
    > will, when used in a subsequent call, not refer to the hostname
    > that was actually received, because it will be re-encoded using a
    > different codec.

    Again, I fail to see the problem in this. It won't happen in
    real life. However, if you worried that this could be abused,
    I think it should decode host names as ASCII, not as UTF-8.
    Then it will be symmetric again (IIUC).

    That would be an improvement. The idea of the patches I posted
    is to combine this with the existing surrogateescape mechanism,
    which handles situations like this perfectly well. I don't see
    how getting a UnicodeError is better than getting a string with
    some lone surrogates in it. In fact, it was my understanding of
    PEP-383 that it is in fact better to get the lone surrogates.

    @baikie baikie mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 24, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 25, 2010

    That would be an improvement. The idea of the patches I posted
    is to combine this with the existing surrogateescape mechanism,
    which handles situations like this perfectly well.

    The surrogateescape mechanism is a very hackish approach, and
    violates the principle that errors should never pass silently.
    However, it solves a real problem - people do run into the problem
    with file names every day. With this problem, I'd say "if it hurts,
    don't do it, then".

    @loewis loewis mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 25, 2010
    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Aug 26, 2010

    The surrogateescape mechanism is a very hackish approach, and
    violates the principle that errors should never pass silently.

    I don't see how a name resolution API returning non-ASCII bytes
    would indicate an error. If the host table contains a non-ASCII
    byte sequence for a host, then that is the host's name - it works
    just as well as an ASCII name, both forwards and backwards.

    What is hackish is representing char * data as a Unicode string
    when there is no native Unicode API to feed it to - there is no
    issue here such as file names being bytes on Unix and Unicode on
    Windows, so the clean thing to do would be to return a bytes
    object. I suggested the surrogateescape mechanism in order to
    retain backwards compatibility.

    However, it solves a real problem - people do run into the problem
    with file names every day. With this problem, I'd say "if it hurts,
    don't do it, then".

    But to be more explicit, that's like saying "if it hurts, get
    your sysadmin to reconfigure the company network".

    @baikie baikie mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 26, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 26, 2010

    I don't see how a name resolution API returning non-ASCII bytes
    would indicate an error.

    It's in violation of RFC 952 (slightly relaxed by RFC 1123).

    But to be more explicit, that's like saying "if it hurts, get
    your sysadmin to reconfigure the company network".

    Which I consider perfectly reasonable. The sysadmin should have
    known (and, in practice, *always* knows) not to do that in the first
    place (the larger the company, the more cautious the sysadmin).

    @loewis loewis mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 26, 2010
    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Aug 27, 2010

    > I don't see how a name resolution API returning non-ASCII bytes
    > would indicate an error.

    It's in violation of RFC 952 (slightly relaxed by RFC 1123).

    That's bad if it's on the public Internet, but it's not an
    error. The OS is returning the name by which it knows the host.

    If you look at POSIX, you'll see that what getaddrinfo() and
    getnameinfo() look up and return is referred to as a "node name",
    which can be an address string or a "descriptive name", and that
    if used with Internet address families, descriptive names
    "include" host names. It doesn't say that the string can only be
    an address string or a hostname (RFC 1123 compliant or
    otherwise).

    > But to be more explicit, that's like saying "if it hurts, get
    > your sysadmin to reconfigure the company network".

    Which I consider perfectly reasonable. The sysadmin should have
    known (and, in practice, *always* knows) not to do that in the first
    place (the larger the company, the more cautious the sysadmin).

    It's not reasonable when addressed to a customer who might go
    elsewhere. And I still don't see a technical reason for making
    such a demand. Python 2.x seems to work just fine using 8-bit
    strings.

    @baikie baikie mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 27, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 27, 2010

    It's not reasonable when addressed to a customer who might go
    elsewhere.

    I remain -1 on this change, until such a customer actually shows
    up at a Python developer.

    @loewis loewis mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 27, 2010
    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Aug 29, 2010

    OK, I still think this issue should be addressed, but here is a patch for the part we agree on: that decoding should not return any Unicode characters except ASCII.

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Aug 29, 2010

    The rest of the issue could also be straightforwardly addressed by adding bytes versions of the name lookup APIs. Attaching a patch which does that (applies on top of decode-strict-ascii.diff).

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Aug 29, 2010

    Oops, forgot to refresh the last change into that patch. This should fix it.

    @jesterKing
    Copy link
    Mannequin

    jesterKing mannequin commented Oct 13, 2010

    platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage)

    There are a lot of our Blender users that are not english native-speakers and they set up their machine as they please, against RCFs or not.

    This currently breaks some code that use platform.system() to check the system it's run on. The paste from above is from a user who has named his computer Nötkötti.

    It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that.

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Oct 13, 2010

    platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage)

    This trace is from a Windows system, where the platform module
    uses gethostname() in its cross-platform uname() function, which
    platform.system() and various other functions in the module rely
    on. On a Unix system, platform.uname() depends on os.uname()
    working, meaning that these functions still fail when the
    hostname cannot be decoded, as it is part of os.uname()'s return
    value.

    Given that os.uname() is a primary source of information about
    the platform on Unix systems, this sort of collateral damage from
    an undecodable hostname is likely to occur in more places.

    It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that.

    If you'd like to try the surrogateescape patches, they ought to
    fix this. The relevant patches are ascii-surrogateescape-2.diff,
    try-surrogateescape-first-4.diff and uname-surrogateescape.diff.

    @baikie baikie mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 13, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 14, 2010

    The failure of platform.uname is an independent bug. IMO, it shouldn't use socket.gethostname on Windows, but instead look at the COMPUTERNAME environment variable or call the GetComputerName API function. This is more close to what uname() does on Unix (i.e. retrieve the local machine name independent of DNS).

    I have created bpo-10097 for this bug.

    @baikie baikie mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 20, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 21, 2010

    Sorry, I didn't mean how Windows constructs the result for the
    "A" interface - I was talking about Python code being able to map
    the result from the Unicode interface to the form used in the
    protocol (e.g. DNS). I believe the proposal is to use the DNS
    name

    I disagree with the proposal - it should return whatever
    name gethostname from winsock.dll returns (which I expect
    to be the netbios name).

    so since the DNS is byte oriented, I would have thought
    that the Unicode "DNS name" result would always have a bytes
    equivalent that the DNS resolver code would use - perhaps its
    UTF-8 encoding?

    No no no. When Microsoft calls it the DNS name, they don't actually
    mean that it has to do anything with DNS. In particular, it's not
    byte-oriented.

    @loewis loewis mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 21, 2010
    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    > Sorry, I didn't mean how Windows constructs the result for the
    > "A" interface - I was talking about Python code being able to map
    > the result from the Unicode interface to the form used in the
    > protocol (e.g. DNS). I believe the proposal is to use the DNS
    > name

    I disagree with the proposal - it should return whatever
    name gethostname from winsock.dll returns (which I expect
    to be the netbios name).

    > so since the DNS is byte oriented, I would have thought
    > that the Unicode "DNS name" result would always have a bytes
    > equivalent that the DNS resolver code would use - perhaps its
    > UTF-8 encoding?

    No no no. When Microsoft calls it the DNS name, they don't actually
    mean that it has to do anything with DNS. In particular, it's not
    byte-oriented.

    Just to clarify: I was proposing to use the
    GetComputerNameExW() win32 API with ComputerNamePhysicalDnsHostname,
    which returns Unicode without needing any roundtrip via bytes
    and the issues associated with this.

    I don't understand why Martin insists that the MS "DNS name"
    doesn't have anything to with DNS... the fully qualified
    DNS name of a machine is determined as hostname.domainname,
    just like you would expect in DNS.

    http://msdn.microsoft.com/en-us/library/ms724301(v=VS.85).aspx
    http://msdn.microsoft.com/en-us/library/ms724224(v=VS.85).aspx

    As I said earlier: NetBIOS is being phased out in favor of
    DNS. MS is using a convention which mandates that NetBIOS names
    match DNS names. The only difference between the two is that
    NetBIOS names have a length limitation:

    http://msdn.microsoft.com/en-us/library/ms724931(v=VS.85).aspx

    Perhaps Martin could clarify why he insists on using the
    ANSI WinSock interface gethostname instead.

    PS: WinSock provides many other Unicode APIs for socket
    module interfaces as well, so at least for that platform,
    we could use those to resolve uncertainties about the
    encoding used in name resolution.

    On other platforms, I guess we'll just have to do some trial
    and error to see what works and what not. E.g. on Linux it is
    possible to set the hostname to a non-ASCII value, but then
    the resolver returns an error, so it's not very practical:

    # hostname l\303\266wis
    # hostname
    löwis
    # hostname -f
    hostname: Resolver Error 0 (no error)

    Using the IDNA version doesn't help either:

    # hostname xn--lwis-5qa
    # hostname
    xn--lwis-5qa
    # hostname -f
    hostname: Resolver Error 0 (no error)

    Python2 happily returns the host name, but fails to return
    a fully qualified domain name:

    'l\xc3\xb6wis'
    >>> socket.getfqdn()
    'l\xc3\xb6wis'

    and

    'xn--lwis-5qa'
    >>> socket.getfqdn()
    'xn--lwis-5qa'

    Just for comparison:

    # hostname newton
    # hostname
    newton
    # hostname -f
    newton.egenix.internal

    and

    'newton'
    >>> socket.getfqdn()
    'newton.egenix.internal'

    So at least on Linux, using non-ASCII hostnames doesn't really
    appear to be an option at this time.

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Oct 21, 2010

    On other platforms, I guess we'll just have to do some trial
    and error to see what works and what not. E.g. on Linux it is
    possible to set the hostname to a non-ASCII value, but then
    the resolver returns an error, so it's not very practical:

    hostname l\303\266wis

    hostname

    löwis

    hostname -f

    hostname: Resolver Error 0 (no error)

    Using the IDNA version doesn't help either:

    hostname xn--lwis-5qa

    hostname

    xn--lwis-5qa

    hostname -f

    hostname: Resolver Error 0 (no error)

    I think what's happening here is that simply that you're setting
    the hostname to something which doesn't exist in the relevant
    name databases - the man page for Linux's hostname(1) says that
    "The FQDN is the name gethostbyname(2) returns for the host name
    returned by gethostname(2).". If the computer's usual name is
    "newton", that may be why it works and the others don't.

    It works for me if I add "127.0.0.9 löwis.egenix.com löwis" to
    /etc/hosts and then set the hostname to "löwis" (all UTF-8):
    hostname -f prints "löwis.egenix.com", and Python 2's
    socket.getfqdn() returns the corresponding bytes; non-UTF-8 names
    work too. (Note that the FQDN must appear before the bare
    hostname in the /etc/hosts entry, and I used the address
    127.0.0.9 simply to avoid a collision with existing entries - by
    default, Ubuntu assigns the FQDN to 127.0.1.1.)

    @baikie baikie mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 21, 2010
    @bitdancer
    Copy link
    Member

    Looks like we have our first customer (bpo-10223).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 29, 2010

    I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset. After the reboot, I found

    • COMPUTERNAME is "P3141", and so is the result of GetComputerNameEx(4)
    • GetComputerNameEx(5) is "π3141"
    • socket.gethostname of Python 2.5 returns "p3141".

    So my theory of how this all fits together is this:

    1. it's not really possible to completely decouple the DNS name and the NetBIOS name. Setting the DNS name also modifies the NetBIOS name; I suspect that the reverse is also true.

    2. gethostname returns the ANSI version of the DNS name (which happens to convert the GREEK SMALL LETTER PI to a LATIN SMALL LETTER P).

    3. the NetBIOS name is an generally an uppercase version of the gethostname result. There may be rules in case the gethostname result contains characters illegal in NetBIOS.

    In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows.

    Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name".

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 29, 2010

    r85934 now uses GetComputerNameExW on Windows.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset. After the reboot, I found

    • COMPUTERNAME is "P3141", and so is the result of GetComputerNameEx(4)
    • GetComputerNameEx(5) is "π3141"
    • socket.gethostname of Python 2.5 returns "p3141".

    So my theory of how this all fits together is this:

    1. it's not really possible to completely decouple the DNS name and the NetBIOS name. Setting the DNS name also modifies the NetBIOS name; I suspect that the reverse is also true.

    The MS docs mention that setting the DNS name will adjust the NetBIO name
    as well (with the NetBIOS name being converted to upper case and truncated,
    if the DNS name is too long).

    They don't mention anything about the NetBIOS name encoding.

    1. gethostname returns the ANSI version of the DNS name (which happens to convert the GREEK SMALL LETTER PI to a LATIN SMALL LETTER P).

    2. the NetBIOS name is an generally an uppercase version of the gethostname result. There may be rules in case the gethostname result contains characters illegal in NetBIOS.

    In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows.

    Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name".

    The DNS name of the Windows machine is the combination of the DNS host
    name and the DNS domain that you setup on the machine. I think the
    misunderstanding is that you assume this combination will
    somehow appear as known DNS name of the machine via some
    DNS server on the network - that's not the case.

    Of course, it's not particularly useful to set the DNS name to
    something that other machines cannot find out via an DNS query.

    FWIW, you can do the same on a Linux box, i.e. setup the host name
    and domain to some completely bogus values. And as David pointed out,
    without also updating the /etc/hosts on the Linux, you always get the
    resolver error with hostname -f I mentioned earlier on (which does
    a DNS lookup), so there's no real connection to the DNS system on
    Linux either.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    r85934 now uses GetComputerNameExW on Windows.

    Thanks, Martin.

    Here's a similar discussion of the Windows approach (used in bzr):

    https://bugs.launchpad.net/bzr/+bug/256550/comments/6

    This is what Solaris uses:

    http://developers.sun.com/dev/gadc/faq/locale.html#get-set

    (they require conversion to ASCII and using IDNA for non-ASCII
    names)

    I found this RFC draft on the topic:
    http://tools.ietf.org/html/draft-josefsson-getaddrinfo-idn-00
    which suggests that there is no standard for the encoding
    used by the socket host name APIs yet.

    ASCII, UTF-8 and IDNA are happily mixed and matched.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 29, 2010

    The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code.

    The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API).

    If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code.

    The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API).

    If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't.

    Wouldn't it be better to also attempt to decode the name using IDNA
    in case the name starts with the IDNA prefix ?

    This would then also cover the Solaris case.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 29, 2010

    The DNS name of the Windows machine is the combination of the DNS host
    name and the DNS domain that you setup on the machine. I think the
    misunderstanding is that you assume this combination will
    somehow appear as known DNS name of the machine via some
    DNS server on the network - that's not the case.

    I don't assume that - I merely point it that it clearly has no
    relationship to the DNS (unless you explicitly make it that way).
    So, I wonder why they call it the DNS name - they could have just
    as well called the "LDAP name", or the "NIS name". In either case,
    setting the name would have no impact on the respective naming
    infrastructure.

    FWIW, you can do the same on a Linux box, i.e. setup the host name
    and domain to some completely bogus values. And as David pointed out,
    without also updating the /etc/hosts on the Linux, you always get the
    resolver error with hostname -f I mentioned earlier on (which does
    a DNS lookup), so there's no real connection to the DNS system on
    Linux either.

    Yes, but Linux (rightly) calls it the "hostname", not the "DNS name".

    @loewis loewis mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 29, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Oct 29, 2010

    Wouldn't it be better to also attempt to decode the name using IDNA
    in case the name starts with the IDNA prefix ?

    Perhaps better - but incompatible. I don't see a way to have the
    resolver functions automatically decode IDNA, without potentially
    breaking existing applications that specifically look for the
    IDNA prefix (say).

    @amauryfa
    Copy link
    Member

    The code in socketmodule.c currently compile with suspect warnings:

    socketmodule.c(3108) : warning C4047: 'function' : 'LPSTR' differs in levels of indirection from 'int'
    socketmodule.c(3108) : warning C4024: 'GetComputerNameA' : different types for formal and actual parameter 1
    socketmodule.c(3109) : warning C4133: 'function' : incompatible types - from 'Py_UNICODE *' to 'LPDWORD'
    socketmodule.c(3110) : warning C4020: 'GetComputerNameA' : too many actual parameters

    was GetComputerName() used instead of GetComputerNameExW()?

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Oct 31, 2010

    FWIW, you can do the same on a Linux box, i.e. setup the host name
    and domain to some completely bogus values. And as David pointed out,
    without also updating the /etc/hosts on the Linux, you always get the
    resolver error with hostname -f I mentioned earlier on (which does
    a DNS lookup), so there's no real connection to the DNS system on
    Linux either.

    Just to clarify here: there isn't anything special about
    /etc/hosts; it's handled by a pluggable module which performs
    hostname lookups in it alongside a similar module for the DNS.
    glibc's Name Service Switch combines the views provided by the
    various modules into a single byte-oriented namespace for
    hostnames according to the settings in /etc/nssswitch.conf (this
    namespace allows non-ASCII bytes, as the /etc/hosts examples
    demonstrate).

    http://www.kernel.org/doc/man-pages/online/pages/man5/nsswitch.conf.5.html
    http://www.gnu.org/software/libc/manual/html_node/Name-Service-Switch.html

    It's an extensible system, so people can write their own modules
    to handle whatever name services they have to deal with, and
    configure hostname lookup to query them before, after or instead
    of the DNS. A hostname that is not resolvable in the DNS may be
    resolvable in one of these.

    @baikie baikie mannequin changed the title socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 31, 2010
    @spaun2002
    Copy link
    Mannequin

    spaun2002 mannequin commented Apr 12, 2012

    I faced with the issue on my own PC. For a Russian version of WinOS default PC name is ИВАН-ПК (C8 C2 C0 CD 2D CF CA in hex) and it returns from gethostbyaddr (CRT) exactly in this form (encoded with system locale cp1251 not UTF8). So when the function PyUnicode_FromString is called, it expects that argument is utf8 encoded string and throws and error.
    A lot of 3rd party modules use gethostbyaddr or getfqdn (which uses gethostbyaddr) and I can't just use function that returns names as bytes. Surrogate names are also not acceptable because the name mentioned above becomes ????-??

    @amauryfa
    Copy link
    Member

    Nick, which version of Python are you using? And which function are you running exactly?
    It seems that a4fd3dc74299 fixed the issue, this was included with 3.2.

    @spaun2002
    Copy link
    Mannequin

    spaun2002 mannequin commented Apr 12, 2012

    Originally I tried 3.2.2 (32bit), but I've just checked 3.2.3 and got the same.
    A code for reproduce is simple:

    from socket import gethostbyaddr
    a = gethostbyaddr('127.0.0.1')
    leads to:
    Traceback (most recent call last):
      File "C:\Users\user\test\test.py", line 13, in <module>
        a = gethostbyaddr('127.0.0.1')
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte

    Or more complex sample:

    def main():
        import http.server
        port = 80
        handlerClass = http.server.SimpleHTTPRequestHandler
        srv = http.server.HTTPServer(("", port), handlerClass )
        srv.serve_forever()
    if __name__ == "__main__":
        main()

    Attempt of connection to the server leads to:

    ----------------------------------------

    Exception happened during processing of request from ('127.0.0.1', 1156)
    Traceback (most recent call last):
      File "C:\Python32\lib\socketserver.py", line 284, in _handle_request_noblock
        self.process_request(request, client_address)
      File "C:\Python32\lib\socketserver.py", line 310, in process_request
        self.finish_request(request, client_address)
      File "C:\Python32\lib\socketserver.py", line 323, in finish_request
        self.RequestHandlerClass(request, client_address, self)
      File "C:\Python32\lib\socketserver.py", line 637, in __init__
        self.handle()
      File "C:\Python32\lib\http\server.py", line 396, in handle
        self.handle_one_request()
      File "C:\Python32\lib\http\server.py", line 384, in handle_one_request
        method()
      File "C:\Python32\lib\http\server.py", line 657, in do_GET
        f = self.send_head()
      File "C:\Python32\lib\http\server.py", line 701, in send_head
        self.send_response(200)
      File "C:\Python32\lib\http\server.py", line 438, in send_response
        self.log_request(code)
      File "C:\Python32\lib\http\server.py", line 483, in log_request
        self.requestline, str(code), str(size))
      File "C:\Python32\lib\http\server.py", line 517, in log_message
        (self.address_string(),
      File "C:\Python32\lib\http\server.py", line 559, in address_string
        return socket.getfqdn(host)
      File "C:\Python32\lib\socket.py", line 355, in getfqdn
        hostname, aliases, ipaddrs = gethostbyaddr(name)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte

    P.S. My PC name is "USER-ПК"

    @vstinner
    Copy link
    Member

    a4fd3dc74299 only fixed socket.gethostname(), not socket.gethostbyaddr().

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 2, 2012

    For Windows versions that support it, we could use GetNameInfoW, available on XPSP2+, W2k3+ and Vista+.

    The questions then are: what to do about gethostbyaddr, and what to do about the general case?

    Since the problem appears to be specific to Windows, it might be appropriate to find a solution to just the Windows case, and ignore the general issue. For gethostbyaddr, decoding would then use CP_ACP.

    @Almad
    Copy link
    Mannequin

    Almad mannequin commented May 16, 2015

    I'd add that this bug is very practical and can render a lot of software unusable/noisy/confusing on Windows, including Django (I discovered this bug when mentoring on Django Girls].

    The simple step to reproduce is to take any windows and set regional settings to non-English (I've used Czech). You can verify that using "import locale; locale.getpreferredencoding()", that should display something else ("cp1250" in my case).

    Then, set "name" (= hostname, in Windows settings) of the computer to anything containing non-ascii character (like "Didejo-noťas").

    As Windows apparently encodes the hostname using their default encoding, it fails with

      File "C:\Python34\lib\wsgiref\simple_server.py", line 50, in server_bind
        HTTPServer.server_bind(self)
      File "C:\Python34\lib\http\server.py", line 135, in server_bind
        self.server_name = socket.getfqdn(host)
      File "C:\Python34\lib\socket.py", line 463, in getfqdn
        hostname, aliases, ipaddrs = gethostbyaddr(name)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 9: invalid
    start byte
    

    @baikie
    Copy link
    Mannequin Author

    baikie mannequin commented Jun 25, 2015

    I've updated the ASCII/surrogateescape patches in line with
    various changes to Python since I posted them.

    return-ascii-surrogateescape-2015-06-25.diff incorporates the
    ascii-surrogateescape and uname-surrogateescape patches, and
    accept-ascii-surrogateescape-2015-06-25.diff corresponds to the
    try-surrogateescape-first patch. Neither patch touches
    gethostname() on Windows.

    Python's existing code now has a fast path for ASCII-only strings
    which passes them through unchanged (Unicode -> ASCII), so in
    order not to slow down processing of valid IDNs, the latter patch
    now effectively tries encodings in the order

    ASCII/strict (existing code, fast path)
    IDNA/strict (existing code)
    ASCII/surrogateescape (added by patch)

    rather than the previous

    ASCII/surrogateescape
    IDNA/strict

    This doesn't change the behaviour of the patch, since IDNA always
    rejects strings containing surrogate codes, and either rejects
    ASCII-only strings (e.g. when a label is longer than 63
    characters) or passes them through unchanged.

    These patches would at least allow getfqdn() to work in Almad's
    example, but in that case the host also appears to be addressable
    by the IDNA equivalent ("xn--didejo-noas-1ic") of its Unicode
    hostname (I haven't checked as I'm not a Windows user, but I
    presume the UnicodeDecodeError came from gethost_common() in
    socketmodule.c and hence the name lookup was successful), so it
    would certainly be more helpful to return Unicode for non-ASCII
    gethostbyaddr() results there, if they were guaranteed to map to
    real IDNA hostnames in Windows environments.

    (That isn't guaranteed in Unix environments of course, which is
    why I'm still suggesting ASCII/surrogateescape for the general
    case.)

    @vstinner
    Copy link
    Member

    FYI I created the issue bpo-26227 to change the encoding used to decode hostnames on Windows. UTF-8 doesn't seem to be the right encoding, it fails on non-ASCII hostnames. I propose to use the ANSI code page.

    Sorry, I didn't read this issue, but it looks like IDNA isn't the good encoding to decode hostnames *on Windows*.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants