socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names #53623

baikie · 2010-07-25T18:33:03Z

BPO	9377
Nosy	@malemburg, @loewis, @amauryfa, @vstinner, @ezio-melotti, @bitdancer, @zooba
Files	ascii-surrogateescape.diff: Decode hostnames as ASCII/surrogateescape rather than UTF-8 try-surrogateescape-first.diff: Accept ASCII/surrogateescape strings as hostname arguments uname-surrogateescape.diff: In posix.uname(), decode nodename as ASCII/surrogateescape ascii-surrogateescape-2.diff: Renamed unicode_from_hostname -> decode_hostname try-surrogateescape-first-2.diff: Made various small changes try-surrogateescape-first-3.diff: Fixed a couple of mistakes try-surrogateescape-first-4.diff try-surrogateescape-first-getnameinfo-4.diff decode-strict-ascii.diff: Decode hostnames strictly as ASCII hostname-bytes-apis.diff: Add name resolution APIs that return names as bytes (applies on top of decode-strict-ascii.diff) return-ascii-surrogateescape-2015-06-25.diff accept-ascii-surrogateescape-2015-06-25.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2010-07-25.18:33:02.681>
labels = ['extension-modules', 'type-bug']
title = 'socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names'
updated_at = <Date 2016-01-28.01:05:18.521>
user = 'https://bugs.python.org/baikie'

bugs.python.org fields:

activity = <Date 2016-01-28.01:05:18.521>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules']
creation = <Date 2010-07-25.18:33:02.681>
creator = 'baikie'
dependencies = []
files = ['18195', '18196', '18259', '18272', '18273', '18609', '18616', '18617', '18674', '18676', '39812', '39813']
hgrepos = []
issue_num = 9377
keywords = ['patch']
message_count = 52.0
messages = ['111550', '111766', '111985', '112094', '114688', '114710', '114754', '114756', '114847', '114882', '115014', '115030', '115116', '115119', '115185', '115186', '115187', '118582', '118602', '118617', '118694', '118709', '118816', '118952', '119051', '119076', '119177', '119230', '119231', '119245', '119260', '119271', '119346', '119837', '119918', '119925', '119927', '119928', '119929', '119935', '119941', '119943', '119946', '120081', '158118', '158165', '158175', '158178', '159776', '243311', '245826', '259079']
nosy_count = 11.0
nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'vstinner', 'baikie', 'ezio.melotti', 'r.david.murray', 'jesterKing', 'spaun2002', 'steve.dower', 'Almad']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue9377'
versions = ['Python 3.2']

baikie · 2010-07-25T18:32:56Z

The functions in the socket module which return host/domain
names, such as gethostbyaddr() and getnameinfo(), are wrappers
around byte-oriented interfaces but return Unicode strings in
3.x, and have not been updated to deal with undecodable byte
sequences in the results, as discussed in PEP-383.

Some DNS resolvers do discard hostnames not matching the
ASCII-only RFC 1123 syntax, but checks for this may be absent or
turned off, and non-ASCII bytes can be returned via other lookup
facilities such as /etc/hosts.

Currently, names are converted to str objects using
PyUnicode_FromString(), i.e. by attempting to decode them as
UTF-8. This can fail with UnicodeError of course, but even if it
succeeds, any non-ASCII names returned will fail to round-trip
correctly because most socket functions encode string arguments
into IDNA ASCII-compatible form before using them. For example,
with UTF-8 encoded entries

127.0.0.2 €
127.0.0.3 xn--lzg

in /etc/hosts, I get:

Python 3.1.2 (r312:79147, Mar 23 2010, 19:02:21) 
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> from socket import *
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]

Here, getaddrinfo() has encoded "€" to its corresponding ACE
label "xn--lzg", which maps to a different address.

PEP-383 can't be applied as-is here, since if the name happened
to be decodable in the file system encoding (and thus was
returned as valid non-ASCII Unicode), the result would fail to
round-trip correctly as shown above, but I think there is a
solution which follows the general idea of PEP-383.

Surrogate characters are not allowed in IDNs, since they are
prohibited by Nameprep[1][2], so if names were instead decoded as
ASCII with the surrogateescape error handler, strings
representing non-ASCII names would always contain surrogate
characters representing the non-ASCII bytes, and would therefore
fail to encode with the IDNA codec. Thus there would be no
ambiguity between these strings and valid IDNs. The attached
ascii-surrogateescape.diff does this.

The returned strings could then be made to round-trip as
arguments, by having functions that take hostname arguments
attempt to encode them using ASCII/surrogateescape first before
trying IDNA encoding. Since IDNA leaves ASCII names unchanged
and surrogate characters are not allowed in IDNs, this would not
change the interpretation of any string hostnames that are
currently accepted. It would remove the 63-octet limit on label
length currently imposed due to the IDNA encoding, for ASCII
names only, but since this is imposed due to the 63-octet limit
of the DNS, and non-IDN names may be intended for other
resolution mechanisms, I think this is a feature, not a bug :)

The patch try-surrogateescape-first.diff implements the above for
all relevant interfaces, including gethostbyaddr() and
getnameinfo(), which do currently accept hostnames, even if the
documentation is vague (in the standard library, socket.fqdn()
calls gethostbyaddr() with a hostname, and the "os" module docs
suggest calling socket.gethostbyaddr(socket.gethostname()) to get
the fully-qualified hostname).

The patch still allows hostnames to be passed as bytes objects,
but to simplify the implementation, it removes support for
bytearray (as has been done for pathnames in 3.2). Bytearrays
are currently only accepted by the socket object methods
(.connect(), etc.), and this is undocumented and perhaps
unintentional - the get*() functions have never accepted them.

One problem with the surrogateescape scheme would be with
existing code that looks up an address and then tries to write
the hostname to a log file or use it as part of the wire
protocol, since the surrogate characters would fail to encode as
ASCII or UTF-8, but the code would appear to work normally until
it encountered a non-ASCII hostname, allowing the problem to go
undetected.

On the other hand, such code is probably broken as things stand,
given that the address lookup functions can undocumentedly raise
UnicodeError in the same situation. Also, protocol definitions
often specify some variant of the RFC 1123 syntax for hostnames
(thus making non-ASCII bytes illegal), so code that checked for
this prior to encoding the name would probably be OK, but it's
more likely the exception than the rule.

An alternative approach might be to return all hostnames as bytes
objects, thus breaking everything immediately and obviously...

[1] http://tools.ietf.org/html/rfc3491#section-5
[2] http://tools.ietf.org/html/rfc3454#appendix-C.5

vstinner · 2010-07-28T02:44:42Z

I like the idea of using the PEP-383 for hostnames, but I don't understand the relation with IDNA (maybe because I don't know this encoding).

+this leaves IDNA ASCII-compatible encodings in ASCII
+form, but converts any non-ASCII bytes in the hostname to the Unicode
+lone surrogate codes U+DC80...U+DCFF.

What is an "IDNA ASCII-compatible encoding"?

--

ascii-surrogateescape.diff:

I don't like unicode_from_hostname() name: "decode_hostname()" would be better.
It doesn't patch the doc and so cannot be applied alone. It doesn't matter, it's better to apply both patches at the same time. But thanks to have splitted them, it's easier to review them :-)

try-surrogateescape-first.diff:

hostname_to_bytes() should be called "encode_hostname()"
if (!PyErr_ExceptionMatches(PyExc_UnicodeError)): you should catch UnicodeEncodeError here
"if this is not possible, :exc:`UnicodeError` is raised.": is it an UnicodeEncodeError?
use PyUnicode_AsEncodedString() instead of PyUnicode_AsEncodedObject(): it's faster for ASCII and ensure that the result is a bytes object (so you don't need to re-check the type)

baikie · 2010-07-29T18:28:18Z

"Leaving IDNA ASCII-compatible encodings in ASCII form" is just preserving the existing behaviour (not doing IDNA decoding). See

http://tools.ietf.org/html/rfc3490

and the docs for codecs -> encodings.idna ("xn--lzg" in the example is the ASCII-compatible encoding of "€", so if you look up that IP address, "xn--lzg" is returned with or without the patch).

I'll look into your other comments. In the meantime, I've got one more patch, as the decoding of the nodename field in os.uname() also needs to be changed to match the other hostname-returning functions. This patch changes it to ASCII/surrogateescape, with the usual PEP-383 decoding for the other fields.

baikie · 2010-07-30T18:11:42Z

OK, here are new versions of the original patches.

I've tweaked the docs to make clear that ASCII-compatible
encodings actually *are* ASCII, and point to an explanation as
soon as they're mentioned.

You're right that PyUnicode_AsEncodedString() is the preferable
interface for the argument converter (I think I got
PyUnicode_AsEncodedObject() from an old version of
PyUnicode_FSConverter() :/), but for the ASCII step I've just
short-circuited it and used PyUnicode_EncodeASCII() directly,
since the converter has already checked that the object is of
Unicode type. For the IDNA step, PyUnicode_AsEncodedString()
should result in a less confusing error message if the codec
returns some non-bytes object one day.

However, the PyBytes_Check isn't to check up on the codec, but to
check for a bytes argument, which the converter also supports.
For that reason, I think encode_hostname would be a misleading
name, but I've renamed it hostname_converter after the example of
PyUnicode_FSConverter, and renamed unicode_from_hostname to
decode_hostname.

I've also made the converter check for UnicodeEncodeError in the
ASCII step, but the end result really is UnicodeError if the IDNA
step fails, because the "idna" codec does not use
UnicodeEncodeError or UnicodeDecodeError. Complain about that if
you wish :)

I think the example I gave in the previous comment was also
confusing, so just to be clear...

In /etc/hosts (in UTF-8 encoding):

127.0.0.2 €
127.0.0.3 xn--lzg

Without patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
>>> '€'.encode("idna")
b'xn--lzg'

With patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('\udce2\udc82\udcac', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))]
>>> '\udce2\udc82\udcac'.encode("idna")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 167, in encode
    result.extend(ToASCII(label))
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 76, in ToASCII
    label = nameprep(label)
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 38, in nameprep
    raise UnicodeError("Invalid character %r" % c)
UnicodeError: Invalid character '\udce2'

The exception at the end demonstrates why surrogateescape strings
don't get confused with IDNs.

baikie · 2010-08-22T18:27:31Z

I noticed that try-surrogateescape-first.diff missed out one of
the string references that needed to be changed to point to the
bytes object, and also used PyBytes_AS_STRING() in an unlocked
section. This version fixes these things by taking the generally
safer approach of setting the original char * variable to the
hostname immediately after using hostname_converter().

loewis · 2010-08-22T22:03:25Z

Is this patch in response to an actual problem, or a theoretical problem?
If "actual problem": what was the specific application, and what was the specific host name?

If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available".

baikie · 2010-08-23T22:48:14Z

Is this patch in response to an actual problem, or a theoretical problem?
If "actual problem": what was the specific application, and what was the specific host name?

It's about environments, not applications - the local network may
be configured with non-ASCII bytes in hostnames (either in the
local DNS *or* a different lookup mechanism - I mentioned
/etc/hosts as a simple example), or someone might deliberately
connect from a garbage hostname as a denial of service attack
against a server which tries to look it up with gethostbyaddr()
or whatever (this may require a "non-strict" resolver library, as
noted above).

If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available".

There are two points here. One is that the decoding can fail; I
do think that programmers will find this surprising, and the fact
that Python refuses to return what was actually received is a
regression compared to 2.x.

The other is that the encoding and decoding are not symmetric -
hostnames are being decoded with UTF-8 but encoded with IDNA.
That means that when a decoded hostname contains a non-ASCII
character which is not prohibited by IDNA/Nameprep, that string
will, when used in a subsequent call, not refer to the hostname
that was actually received, because it will be re-encoded using a
different codec.

Attaching a refreshed version of try-surrogateescape-first.diff.
I've separated out the change to getnameinfo() as it may be
superfluous (issue bpo-1027206).

loewis · 2010-08-23T23:07:05Z

> Is this patch in response to an actual problem, or a theoretical problem?
> If "actual problem": what was the specific application, and what was the specific host name?

It's about environments, not applications

Still, my question remains. Is it a theoretical problem (i.e. one
of your imagination), or a real one (i.e. one you observed in real
life, without explicitly triggering it)? If real: what was the
specific environment, and what was the specific host name?

There are two points here. One is that the decoding can fail; I
do think that programmers will find this surprising, and the fact
that Python refuses to return what was actually received is a
regression compared to 2.x.

True. However, I think this is an acceptable regression,
assuming the problem is merely theoretical. It is ok if
an operation fails that you will never run into in real life.

That means that when a decoded hostname contains a non-ASCII
character which is not prohibited by IDNA/Nameprep, that string
will, when used in a subsequent call, not refer to the hostname
that was actually received, because it will be re-encoded using a
different codec.

Again, I fail to see the problem in this. It won't happen in
real life. However, if you worried that this could be abused,
I think it should decode host names as ASCII, not as UTF-8.
Then it will be symmetric again (IIUC).

baikie · 2010-08-24T22:59:29Z

> It's about environments, not applications

Still, my question remains. Is it a theoretical problem (i.e. one
of your imagination), or a real one (i.e. one you observed in real
life, without explicitly triggering it)? If real: what was the
specific environment, and what was the specific host name?

Yes, I did reproduce the problem on my own system (Ubuntu 8.04).
No, it is not from a real application, nor do I know anyone with
their network configured like this (except possibly Dan "djbdns"
Bernstein: http://cr.yp.to/djbdns/idn.html ).

I reported this bug to save anyone who *is* in such an
environment from crashing applications and erroneous name
resolution.

> That means that when a decoded hostname contains a non-ASCII
> character which is not prohibited by IDNA/Nameprep, that string
> will, when used in a subsequent call, not refer to the hostname
> that was actually received, because it will be re-encoded using a
> different codec.

Again, I fail to see the problem in this. It won't happen in
real life. However, if you worried that this could be abused,
I think it should decode host names as ASCII, not as UTF-8.
Then it will be symmetric again (IIUC).

That would be an improvement. The idea of the patches I posted
is to combine this with the existing surrogateescape mechanism,
which handles situations like this perfectly well. I don't see
how getting a UnicodeError is better than getting a string with
some lone surrogates in it. In fact, it was my understanding of
PEP-383 that it is in fact better to get the lone surrogates.

loewis · 2010-08-25T05:53:48Z

That would be an improvement. The idea of the patches I posted
is to combine this with the existing surrogateescape mechanism,
which handles situations like this perfectly well.

The surrogateescape mechanism is a very hackish approach, and
violates the principle that errors should never pass silently.
However, it solves a real problem - people do run into the problem
with file names every day. With this problem, I'd say "if it hurts,
don't do it, then".

baikie · 2010-08-26T18:04:05Z

The surrogateescape mechanism is a very hackish approach, and
violates the principle that errors should never pass silently.

I don't see how a name resolution API returning non-ASCII bytes
would indicate an error. If the host table contains a non-ASCII
byte sequence for a host, then that is the host's name - it works
just as well as an ASCII name, both forwards and backwards.

What is hackish is representing char * data as a Unicode string
when there is no native Unicode API to feed it to - there is no
issue here such as file names being bytes on Unix and Unicode on
Windows, so the clean thing to do would be to return a bytes
object. I suggested the surrogateescape mechanism in order to
retain backwards compatibility.

However, it solves a real problem - people do run into the problem
with file names every day. With this problem, I'd say "if it hurts,
don't do it, then".

But to be more explicit, that's like saying "if it hurts, get
your sysadmin to reconfigure the company network".

loewis · 2010-08-26T21:36:30Z

I don't see how a name resolution API returning non-ASCII bytes
would indicate an error.

It's in violation of RFC 952 (slightly relaxed by RFC 1123).

But to be more explicit, that's like saying "if it hurts, get
your sysadmin to reconfigure the company network".

Which I consider perfectly reasonable. The sysadmin should have
known (and, in practice, *always* knows) not to do that in the first
place (the larger the company, the more cautious the sysadmin).

baikie · 2010-08-27T19:13:04Z

> I don't see how a name resolution API returning non-ASCII bytes
> would indicate an error.

It's in violation of RFC 952 (slightly relaxed by RFC 1123).

That's bad if it's on the public Internet, but it's not an
error. The OS is returning the name by which it knows the host.

If you look at POSIX, you'll see that what getaddrinfo() and
getnameinfo() look up and return is referred to as a "node name",
which can be an address string or a "descriptive name", and that
if used with Internet address families, descriptive names
"include" host names. It doesn't say that the string can only be
an address string or a hostname (RFC 1123 compliant or
otherwise).

> But to be more explicit, that's like saying "if it hurts, get
> your sysadmin to reconfigure the company network".

Which I consider perfectly reasonable. The sysadmin should have
known (and, in practice, *always* knows) not to do that in the first
place (the larger the company, the more cautious the sysadmin).

It's not reasonable when addressed to a customer who might go
elsewhere. And I still don't see a technical reason for making
such a demand. Python 2.x seems to work just fine using 8-bit
strings.

loewis · 2010-08-27T19:20:30Z

It's not reasonable when addressed to a customer who might go
elsewhere.

I remain -1 on this change, until such a customer actually shows
up at a Python developer.

baikie · 2010-08-29T18:44:55Z

OK, I still think this issue should be addressed, but here is a patch for the part we agree on: that decoding should not return any Unicode characters except ASCII.

baikie · 2010-08-29T18:47:12Z

The rest of the issue could also be straightforwardly addressed by adding bytes versions of the name lookup APIs. Attaching a patch which does that (applies on top of decode-strict-ascii.diff).

baikie · 2010-08-29T19:01:35Z

Oops, forgot to refresh the last change into that patch. This should fix it.

jesterKing · 2010-10-13T20:50:28Z

platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage)

There are a lot of our Blender users that are not english native-speakers and they set up their machine as they please, against RCFs or not.

This currently breaks some code that use platform.system() to check the system it's run on. The paste from above is from a user who has named his computer Nötkötti.

It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that.

baikie · 2010-10-13T23:38:21Z

platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage)

This trace is from a Windows system, where the platform module
uses gethostname() in its cross-platform uname() function, which
platform.system() and various other functions in the module rely
on. On a Unix system, platform.uname() depends on os.uname()
working, meaning that these functions still fail when the
hostname cannot be decoded, as it is part of os.uname()'s return
value.

Given that os.uname() is a primary source of information about
the platform on Unix systems, this sort of collateral damage from
an undecodable hostname is likely to occur in more places.

It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that.

If you'd like to try the surrogateescape patches, they ought to
fix this. The relevant patches are ascii-surrogateescape-2.diff,
try-surrogateescape-first-4.diff and uname-surrogateescape.diff.

loewis · 2010-10-14T06:12:46Z

The failure of platform.uname is an independent bug. IMO, it shouldn't use socket.gethostname on Windows, but instead look at the COMPUTERNAME environment variable or call the GetComputerName API function. This is more close to what uname() does on Unix (i.e. retrieve the local machine name independent of DNS).

I have created bpo-10097 for this bug.

loewis · 2010-10-21T05:09:50Z

Sorry, I didn't mean how Windows constructs the result for the
"A" interface - I was talking about Python code being able to map
the result from the Unicode interface to the form used in the
protocol (e.g. DNS). I believe the proposal is to use the DNS
name

I disagree with the proposal - it should return whatever
name gethostname from winsock.dll returns (which I expect
to be the netbios name).

so since the DNS is byte oriented, I would have thought
that the Unicode "DNS name" result would always have a bytes
equivalent that the DNS resolver code would use - perhaps its
UTF-8 encoding?

No no no. When Microsoft calls it the DNS name, they don't actually
mean that it has to do anything with DNS. In particular, it's not
byte-oriented.

malemburg · 2010-10-21T09:24:44Z

Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

> Sorry, I didn't mean how Windows constructs the result for the
> "A" interface - I was talking about Python code being able to map
> the result from the Unicode interface to the form used in the
> protocol (e.g. DNS). I believe the proposal is to use the DNS
> name

I disagree with the proposal - it should return whatever
name gethostname from winsock.dll returns (which I expect
to be the netbios name).

> so since the DNS is byte oriented, I would have thought
> that the Unicode "DNS name" result would always have a bytes
> equivalent that the DNS resolver code would use - perhaps its
> UTF-8 encoding?

No no no. When Microsoft calls it the DNS name, they don't actually
mean that it has to do anything with DNS. In particular, it's not
byte-oriented.

Just to clarify: I was proposing to use the
GetComputerNameExW() win32 API with ComputerNamePhysicalDnsHostname,
which returns Unicode without needing any roundtrip via bytes
and the issues associated with this.

I don't understand why Martin insists that the MS "DNS name"
doesn't have anything to with DNS... the fully qualified
DNS name of a machine is determined as hostname.domainname,
just like you would expect in DNS.

http://msdn.microsoft.com/en-us/library/ms724301(v=VS.85).aspx
http://msdn.microsoft.com/en-us/library/ms724224(v=VS.85).aspx

As I said earlier: NetBIOS is being phased out in favor of
DNS. MS is using a convention which mandates that NetBIOS names
match DNS names. The only difference between the two is that
NetBIOS names have a length limitation:

http://msdn.microsoft.com/en-us/library/ms724931(v=VS.85).aspx

Perhaps Martin could clarify why he insists on using the
ANSI WinSock interface gethostname instead.

PS: WinSock provides many other Unicode APIs for socket
module interfaces as well, so at least for that platform,
we could use those to resolve uncertainties about the
encoding used in name resolution.

On other platforms, I guess we'll just have to do some trial
and error to see what works and what not. E.g. on Linux it is
possible to set the hostname to a non-ASCII value, but then
the resolver returns an error, so it's not very practical:

# hostname l\303\266wis
# hostname
löwis
# hostname -f
hostname: Resolver Error 0 (no error)

Using the IDNA version doesn't help either:

# hostname xn--lwis-5qa
# hostname
xn--lwis-5qa
# hostname -f
hostname: Resolver Error 0 (no error)

Python2 happily returns the host name, but fails to return
a fully qualified domain name:

'l\xc3\xb6wis'
>>> socket.getfqdn()
'l\xc3\xb6wis'

and

'xn--lwis-5qa'
>>> socket.getfqdn()
'xn--lwis-5qa'

Just for comparison:

# hostname newton
# hostname
newton
# hostname -f
newton.egenix.internal

and

'newton'
>>> socket.getfqdn()
'newton.egenix.internal'

So at least on Linux, using non-ASCII hostnames doesn't really
appear to be an option at this time.

baikie · 2010-10-21T22:37:23Z

On other platforms, I guess we'll just have to do some trial
and error to see what works and what not. E.g. on Linux it is
possible to set the hostname to a non-ASCII value, but then
the resolver returns an error, so it's not very practical:

hostname l\303\266wis

hostname

löwis

hostname -f

hostname: Resolver Error 0 (no error)

Using the IDNA version doesn't help either:

hostname xn--lwis-5qa

hostname

xn--lwis-5qa

hostname -f

hostname: Resolver Error 0 (no error)

I think what's happening here is that simply that you're setting
the hostname to something which doesn't exist in the relevant
name databases - the man page for Linux's hostname(1) says that
"The FQDN is the name gethostbyname(2) returns for the host name
returned by gethostname(2).". If the computer's usual name is
"newton", that may be why it works and the others don't.

It works for me if I add "127.0.0.9 löwis.egenix.com löwis" to
/etc/hosts and then set the hostname to "löwis" (all UTF-8):
hostname -f prints "löwis.egenix.com", and Python 2's
socket.getfqdn() returns the corresponding bytes; non-UTF-8 names
work too. (Note that the FQDN must appear before the bare
hostname in the /etc/hosts entry, and I used the address
127.0.0.9 simply to avoid a collision with existing entries - by
default, Ubuntu assigns the FQDN to 127.0.1.1.)

bitdancer · 2010-10-29T02:00:12Z

Looks like we have our first customer (bpo-10223).

loewis · 2010-10-29T17:44:51Z

I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset. After the reboot, I found

COMPUTERNAME is "P3141", and so is the result of GetComputerNameEx(4)
GetComputerNameEx(5) is "π3141"
socket.gethostname of Python 2.5 returns "p3141".

So my theory of how this all fits together is this:

it's not really possible to completely decouple the DNS name and the NetBIOS name. Setting the DNS name also modifies the NetBIOS name; I suspect that the reverse is also true.
gethostname returns the ANSI version of the DNS name (which happens to convert the GREEK SMALL LETTER PI to a LATIN SMALL LETTER P).
the NetBIOS name is an generally an uppercase version of the gethostname result. There may be rules in case the gethostname result contains characters illegal in NetBIOS.

In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows.

Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name".

loewis · 2010-10-29T18:22:41Z

r85934 now uses GetComputerNameExW on Windows.

malemburg · 2010-10-29T18:33:14Z

Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset. After the reboot, I found

COMPUTERNAME is "P3141", and so is the result of GetComputerNameEx(4)

GetComputerNameEx(5) is "π3141"

socket.gethostname of Python 2.5 returns "p3141".

So my theory of how this all fits together is this:

it's not really possible to completely decouple the DNS name and the NetBIOS name. Setting the DNS name also modifies the NetBIOS name; I suspect that the reverse is also true.

The MS docs mention that setting the DNS name will adjust the NetBIO name
as well (with the NetBIOS name being converted to upper case and truncated,
if the DNS name is too long).

They don't mention anything about the NetBIOS name encoding.

gethostname returns the ANSI version of the DNS name (which happens to convert the GREEK SMALL LETTER PI to a LATIN SMALL LETTER P).

the NetBIOS name is an generally an uppercase version of the gethostname result. There may be rules in case the gethostname result contains characters illegal in NetBIOS.

In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows.

Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name".

The DNS name of the Windows machine is the combination of the DNS host
name and the DNS domain that you setup on the machine. I think the
misunderstanding is that you assume this combination will
somehow appear as known DNS name of the machine via some
DNS server on the network - that's not the case.

Of course, it's not particularly useful to set the DNS name to
something that other machines cannot find out via an DNS query.

FWIW, you can do the same on a Linux box, i.e. setup the host name
and domain to some completely bogus values. And as David pointed out,
without also updating the /etc/hosts on the Linux, you always get the
resolver error with hostname -f I mentioned earlier on (which does
a DNS lookup), so there's no real connection to the DNS system on
Linux either.

malemburg · 2010-10-29T19:04:44Z

Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

r85934 now uses GetComputerNameExW on Windows.

Thanks, Martin.

Here's a similar discussion of the Windows approach (used in bzr):

https://bugs.launchpad.net/bzr/+bug/256550/comments/6

This is what Solaris uses:

http://developers.sun.com/dev/gadc/faq/locale.html#get-set

(they require conversion to ASCII and using IDNA for non-ASCII
names)

I found this RFC draft on the topic:
http://tools.ietf.org/html/draft-josefsson-getaddrinfo-idn-00
which suggests that there is no standard for the encoding
used by the socket host name APIs yet.

ASCII, UTF-8 and IDNA are happily mixed and matched.

loewis · 2010-10-29T19:31:47Z

The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code.

The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API).

If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't.

malemburg · 2010-10-29T20:09:04Z

Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code.

The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API).

If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't.

Wouldn't it be better to also attempt to decode the name using IDNA
in case the name starts with the IDNA prefix ?

This would then also cover the Solaris case.

loewis · 2010-10-29T22:25:22Z

The DNS name of the Windows machine is the combination of the DNS host
name and the DNS domain that you setup on the machine. I think the
misunderstanding is that you assume this combination will
somehow appear as known DNS name of the machine via some
DNS server on the network - that's not the case.

I don't assume that - I merely point it that it clearly has no
relationship to the DNS (unless you explicitly make it that way).
So, I wonder why they call it the DNS name - they could have just
as well called the "LDAP name", or the "NIS name". In either case,
setting the name would have no impact on the respective naming
infrastructure.

FWIW, you can do the same on a Linux box, i.e. setup the host name
and domain to some completely bogus values. And as David pointed out,
without also updating the /etc/hosts on the Linux, you always get the
resolver error with hostname -f I mentioned earlier on (which does
a DNS lookup), so there's no real connection to the DNS system on
Linux either.

Yes, but Linux (rightly) calls it the "hostname", not the "DNS name".

loewis · 2010-10-29T22:26:53Z

Wouldn't it be better to also attempt to decode the name using IDNA
in case the name starts with the IDNA prefix ?

Perhaps better - but incompatible. I don't see a way to have the
resolver functions automatically decode IDNA, without potentially
breaking existing applications that specifically look for the
IDNA prefix (say).

amauryfa · 2010-10-29T22:29:03Z

The code in socketmodule.c currently compile with suspect warnings:

socketmodule.c(3108) : warning C4047: 'function' : 'LPSTR' differs in levels of indirection from 'int'
socketmodule.c(3108) : warning C4024: 'GetComputerNameA' : different types for formal and actual parameter 1
socketmodule.c(3109) : warning C4133: 'function' : incompatible types - from 'Py_UNICODE *' to 'LPDWORD'
socketmodule.c(3110) : warning C4020: 'GetComputerNameA' : too many actual parameters

was GetComputerName() used instead of GetComputerNameExW()?

baikie · 2010-10-31T19:34:29Z

FWIW, you can do the same on a Linux box, i.e. setup the host name
and domain to some completely bogus values. And as David pointed out,
without also updating the /etc/hosts on the Linux, you always get the
resolver error with hostname -f I mentioned earlier on (which does
a DNS lookup), so there's no real connection to the DNS system on
Linux either.

Just to clarify here: there isn't anything special about
/etc/hosts; it's handled by a pluggable module which performs
hostname lookups in it alongside a similar module for the DNS.
glibc's Name Service Switch combines the views provided by the
various modules into a single byte-oriented namespace for
hostnames according to the settings in /etc/nssswitch.conf (this
namespace allows non-ASCII bytes, as the /etc/hosts examples
demonstrate).

http://www.kernel.org/doc/man-pages/online/pages/man5/nsswitch.conf.5.html
http://www.gnu.org/software/libc/manual/html_node/Name-Service-Switch.html

It's an extensible system, so people can write their own modules
to handle whatever name services they have to deal with, and
configure hostname lookup to query them before, after or instead
of the DNS. A hostname that is not resolvable in the DNS may be
resolvable in one of these.

spaun2002 · 2012-04-12T10:08:04Z

I faced with the issue on my own PC. For a Russian version of WinOS default PC name is ИВАН-ПК (C8 C2 C0 CD 2D CF CA in hex) and it returns from gethostbyaddr (CRT) exactly in this form (encoded with system locale cp1251 not UTF8). So when the function PyUnicode_FromString is called, it expects that argument is utf8 encoded string and throws and error.
A lot of 3rd party modules use gethostbyaddr or getfqdn (which uses gethostbyaddr) and I can't just use function that returns names as bytes. Surrogate names are also not acceptable because the name mentioned above becomes ????-??

amauryfa · 2012-04-12T19:44:48Z

Nick, which version of Python are you using? And which function are you running exactly?
It seems that a4fd3dc74299 fixed the issue, this was included with 3.2.

spaun2002 · 2012-04-12T21:52:14Z

Originally I tried 3.2.2 (32bit), but I've just checked 3.2.3 and got the same.
A code for reproduce is simple:

from socket import gethostbyaddr
a = gethostbyaddr('127.0.0.1')

leads to:
Traceback (most recent call last):
  File "C:\Users\user\test\test.py", line 13, in <module>
    a = gethostbyaddr('127.0.0.1')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte

Or more complex sample:

def main():
    import http.server
    port = 80
    handlerClass = http.server.SimpleHTTPRequestHandler
    srv = http.server.HTTPServer(("", port), handlerClass )
    srv.serve_forever()
if __name__ == "__main__":
    main()

Attempt of connection to the server leads to:

----------------------------------------

Exception happened during processing of request from ('127.0.0.1', 1156)
Traceback (most recent call last):
  File "C:\Python32\lib\socketserver.py", line 284, in _handle_request_noblock
    self.process_request(request, client_address)
  File "C:\Python32\lib\socketserver.py", line 310, in process_request
    self.finish_request(request, client_address)
  File "C:\Python32\lib\socketserver.py", line 323, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "C:\Python32\lib\socketserver.py", line 637, in __init__
    self.handle()
  File "C:\Python32\lib\http\server.py", line 396, in handle
    self.handle_one_request()
  File "C:\Python32\lib\http\server.py", line 384, in handle_one_request
    method()
  File "C:\Python32\lib\http\server.py", line 657, in do_GET
    f = self.send_head()
  File "C:\Python32\lib\http\server.py", line 701, in send_head
    self.send_response(200)
  File "C:\Python32\lib\http\server.py", line 438, in send_response
    self.log_request(code)
  File "C:\Python32\lib\http\server.py", line 483, in log_request
    self.requestline, str(code), str(size))
  File "C:\Python32\lib\http\server.py", line 517, in log_message
    (self.address_string(),
  File "C:\Python32\lib\http\server.py", line 559, in address_string
    return socket.getfqdn(host)
  File "C:\Python32\lib\socket.py", line 355, in getfqdn
    hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte

P.S. My PC name is "USER-ПК"

vstinner · 2012-04-12T21:55:14Z

a4fd3dc74299 only fixed socket.gethostname(), not socket.gethostbyaddr().

loewis · 2012-05-02T06:25:34Z

For Windows versions that support it, we could use GetNameInfoW, available on XPSP2+, W2k3+ and Vista+.

The questions then are: what to do about gethostbyaddr, and what to do about the general case?

Since the problem appears to be specific to Windows, it might be appropriate to find a solution to just the Windows case, and ignore the general issue. For gethostbyaddr, decoding would then use CP_ACP.

Almad · 2015-05-16T12:00:58Z

I'd add that this bug is very practical and can render a lot of software unusable/noisy/confusing on Windows, including Django (I discovered this bug when mentoring on Django Girls].

The simple step to reproduce is to take any windows and set regional settings to non-English (I've used Czech). You can verify that using "import locale; locale.getpreferredencoding()", that should display something else ("cp1250" in my case).

Then, set "name" (= hostname, in Windows settings) of the computer to anything containing non-ascii character (like "Didejo-noťas").

As Windows apparently encodes the hostname using their default encoding, it fails with

  File "C:\Python34\lib\wsgiref\simple_server.py", line 50, in server_bind
    HTTPServer.server_bind(self)
  File "C:\Python34\lib\http\server.py", line 135, in server_bind
    self.server_name = socket.getfqdn(host)
  File "C:\Python34\lib\socket.py", line 463, in getfqdn
    hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 9: invalid
start byte

baikie · 2015-06-25T20:28:14Z

I've updated the ASCII/surrogateescape patches in line with
various changes to Python since I posted them.

return-ascii-surrogateescape-2015-06-25.diff incorporates the
ascii-surrogateescape and uname-surrogateescape patches, and
accept-ascii-surrogateescape-2015-06-25.diff corresponds to the
try-surrogateescape-first patch. Neither patch touches
gethostname() on Windows.

Python's existing code now has a fast path for ASCII-only strings
which passes them through unchanged (Unicode -> ASCII), so in
order not to slow down processing of valid IDNs, the latter patch
now effectively tries encodings in the order

ASCII/strict (existing code, fast path)
IDNA/strict (existing code)
ASCII/surrogateescape (added by patch)

rather than the previous

ASCII/surrogateescape
IDNA/strict

This doesn't change the behaviour of the patch, since IDNA always
rejects strings containing surrogate codes, and either rejects
ASCII-only strings (e.g. when a label is longer than 63
characters) or passes them through unchanged.

These patches would at least allow getfqdn() to work in Almad's
example, but in that case the host also appears to be addressable
by the IDNA equivalent ("xn--didejo-noas-1ic") of its Unicode
hostname (I haven't checked as I'm not a Windows user, but I
presume the UnicodeDecodeError came from gethost_common() in
socketmodule.c and hence the name lookup was successful), so it
would certainly be more helpful to return Unicode for non-ASCII
gethostbyaddr() results there, if they were guaranteed to map to
real IDNA hostnames in Windows environments.

(That isn't guaranteed in Unix environments of course, which is
why I'm still suggesting ASCII/surrogateescape for the general
case.)

vstinner · 2016-01-28T01:05:18Z

FYI I created the issue bpo-26227 to change the encoding used to decode hostnames on Windows. UTF-8 doesn't seem to be the right encoding, it fails on non-ASCII hostnames. I propose to use the ANSI code page.

Sorry, I didn't read this issue, but it looks like IDNA isn't the good encoding to decode hostnames *on Windows*.

baikie mannequin added extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Jul 25, 2010

loewis mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 23, 2010

baikie mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 24, 2010

loewis mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 25, 2010

baikie mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 26, 2010

loewis mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 26, 2010

baikie mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 27, 2010

loewis mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Aug 27, 2010

baikie mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 13, 2010

baikie mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 20, 2010

loewis mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 21, 2010

baikie mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 21, 2010

loewis mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 29, 2010

baikie mannequin changed the title ~~socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names~~ socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names Oct 31, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

esev mentioned this issue Feb 4, 2024

Wemo integration fails to initialize: UnicodeDecodeError home-assistant/core#109619

Open

esev mentioned this issue Mar 9, 2024

UnicodeDecodeError prevents home-assistant integration from initializing pywemo/pywemo#716

Open

socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names #53623

socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names #53623

Comments

baikie mannequin commented Jul 25, 2010

baikie mannequin commented Jul 25, 2010

vstinner commented Jul 28, 2010

baikie mannequin commented Jul 29, 2010

baikie mannequin commented Jul 30, 2010

baikie mannequin commented Aug 22, 2010

loewis mannequin commented Aug 22, 2010

baikie mannequin commented Aug 23, 2010

loewis mannequin commented Aug 23, 2010

baikie mannequin commented Aug 24, 2010

loewis mannequin commented Aug 25, 2010

baikie mannequin commented Aug 26, 2010

loewis mannequin commented Aug 26, 2010

baikie mannequin commented Aug 27, 2010

loewis mannequin commented Aug 27, 2010

baikie mannequin commented Aug 29, 2010

baikie mannequin commented Aug 29, 2010

baikie mannequin commented Aug 29, 2010

jesterKing mannequin commented Oct 13, 2010

baikie mannequin commented Oct 13, 2010

loewis mannequin commented Oct 14, 2010

loewis mannequin commented Oct 21, 2010

malemburg commented Oct 21, 2010

baikie mannequin commented Oct 21, 2010

hostname l\303\266wis

hostname

hostname -f

hostname xn--lwis-5qa

hostname

hostname -f

bitdancer commented Oct 29, 2010

loewis mannequin commented Oct 29, 2010

loewis mannequin commented Oct 29, 2010

malemburg commented Oct 29, 2010

malemburg commented Oct 29, 2010

loewis mannequin commented Oct 29, 2010

malemburg commented Oct 29, 2010

loewis mannequin commented Oct 29, 2010

loewis mannequin commented Oct 29, 2010

amauryfa commented Oct 29, 2010

baikie mannequin commented Oct 31, 2010

spaun2002 mannequin commented Apr 12, 2012

amauryfa commented Apr 12, 2012

spaun2002 mannequin commented Apr 12, 2012

vstinner commented Apr 12, 2012

loewis mannequin commented May 2, 2012

Almad mannequin commented May 16, 2015

baikie mannequin commented Jun 25, 2015

vstinner commented Jan 28, 2016