classification
Title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
Type: behavior Stage:
Components: Extension Modules Versions: Python 3.2
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Almad, amaury.forgeotdarc, baikie, ezio.melotti, jesterKing, lemburg, loewis, r.david.murray, spaun2002, steve.dower, vstinner
Priority: normal Keywords: patch

Created on 2010-07-25 18:33 by baikie, last changed 2016-01-28 01:05 by vstinner.

Files
File name Uploaded Description Edit
ascii-surrogateescape.diff baikie, 2010-07-25 18:32 Decode hostnames as ASCII/surrogateescape rather than UTF-8
try-surrogateescape-first.diff baikie, 2010-07-25 18:33 Accept ASCII/surrogateescape strings as hostname arguments
uname-surrogateescape.diff baikie, 2010-07-29 18:28 In posix.uname(), decode nodename as ASCII/surrogateescape
ascii-surrogateescape-2.diff baikie, 2010-07-30 18:11 Renamed unicode_from_hostname -> decode_hostname
try-surrogateescape-first-2.diff baikie, 2010-07-30 18:14 Made various small changes
try-surrogateescape-first-3.diff baikie, 2010-08-22 18:27 Fixed a couple of mistakes
try-surrogateescape-first-4.diff baikie, 2010-08-23 22:48
try-surrogateescape-first-getnameinfo-4.diff baikie, 2010-08-23 22:48
decode-strict-ascii.diff baikie, 2010-08-29 18:44 Decode hostnames strictly as ASCII
hostname-bytes-apis.diff baikie, 2010-08-29 19:01 Add name resolution APIs that return names as bytes (applies on top of decode-strict-ascii.diff)
return-ascii-surrogateescape-2015-06-25.diff baikie, 2015-06-25 20:28 review
accept-ascii-surrogateescape-2015-06-25.diff baikie, 2015-06-25 20:28
Messages (52)
msg111550 - (view) Author: David Watson (baikie) Date: 2010-07-25 18:32
The functions in the socket module which return host/domain
names, such as gethostbyaddr() and getnameinfo(), are wrappers
around byte-oriented interfaces but return Unicode strings in
3.x, and have not been updated to deal with undecodable byte
sequences in the results, as discussed in PEP 383.

Some DNS resolvers do discard hostnames not matching the
ASCII-only RFC 1123 syntax, but checks for this may be absent or
turned off, and non-ASCII bytes can be returned via other lookup
facilities such as /etc/hosts.

Currently, names are converted to str objects using
PyUnicode_FromString(), i.e. by attempting to decode them as
UTF-8.  This can fail with UnicodeError of course, but even if it
succeeds, any non-ASCII names returned will fail to round-trip
correctly because most socket functions encode string arguments
into IDNA ASCII-compatible form before using them.  For example,
with UTF-8 encoded entries

127.0.0.2       €
127.0.0.3       xn--lzg

in /etc/hosts, I get:

Python 3.1.2 (r312:79147, Mar 23 2010, 19:02:21) 
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> from socket import *
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]

Here, getaddrinfo() has encoded "€" to its corresponding ACE
label "xn--lzg", which maps to a different address.

PEP 383 can't be applied as-is here, since if the name happened
to be decodable in the file system encoding (and thus was
returned as valid non-ASCII Unicode), the result would fail to
round-trip correctly as shown above, but I think there is a
solution which follows the general idea of PEP 383.

Surrogate characters are not allowed in IDNs, since they are
prohibited by Nameprep[1][2], so if names were instead decoded as
ASCII with the surrogateescape error handler, strings
representing non-ASCII names would always contain surrogate
characters representing the non-ASCII bytes, and would therefore
fail to encode with the IDNA codec.  Thus there would be no
ambiguity between these strings and valid IDNs.  The attached
ascii-surrogateescape.diff does this.

The returned strings could then be made to round-trip as
arguments, by having functions that take hostname arguments
attempt to encode them using ASCII/surrogateescape first before
trying IDNA encoding.  Since IDNA leaves ASCII names unchanged
and surrogate characters are not allowed in IDNs, this would not
change the interpretation of any string hostnames that are
currently accepted.  It would remove the 63-octet limit on label
length currently imposed due to the IDNA encoding, for ASCII
names only, but since this is imposed due to the 63-octet limit
of the DNS, and non-IDN names may be intended for other
resolution mechanisms, I think this is a feature, not a bug :)

The patch try-surrogateescape-first.diff implements the above for
all relevant interfaces, including gethostbyaddr() and
getnameinfo(), which do currently accept hostnames, even if the
documentation is vague (in the standard library, socket.fqdn()
calls gethostbyaddr() with a hostname, and the "os" module docs
suggest calling socket.gethostbyaddr(socket.gethostname()) to get
the fully-qualified hostname).

The patch still allows hostnames to be passed as bytes objects,
but to simplify the implementation, it removes support for
bytearray (as has been done for pathnames in 3.2).  Bytearrays
are currently only accepted by the socket object methods
(.connect(), etc.), and this is undocumented and perhaps
unintentional - the get*() functions have never accepted them.

One problem with the surrogateescape scheme would be with
existing code that looks up an address and then tries to write
the hostname to a log file or use it as part of the wire
protocol, since the surrogate characters would fail to encode as
ASCII or UTF-8, but the code would appear to work normally until
it encountered a non-ASCII hostname, allowing the problem to go
undetected.

On the other hand, such code is probably broken as things stand,
given that the address lookup functions can undocumentedly raise
UnicodeError in the same situation.  Also, protocol definitions
often specify some variant of the RFC 1123 syntax for hostnames
(thus making non-ASCII bytes illegal), so code that checked for
this prior to encoding the name would probably be OK, but it's
more likely the exception than the rule.

An alternative approach might be to return all hostnames as bytes
objects, thus breaking everything immediately and obviously...


[1] http://tools.ietf.org/html/rfc3491#section-5
[2] http://tools.ietf.org/html/rfc3454#appendix-C.5
msg111766 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-28 02:44
I like the idea of using the PEP 383 for hostnames, but I don't understand the relation with IDNA (maybe because I don't know this encoding).

+this leaves IDNA ASCII-compatible encodings in ASCII
+form, but converts any non-ASCII bytes in the hostname to the Unicode
+lone surrogate codes U+DC80...U+DCFF.

What is an "IDNA ASCII-compatible encoding"?

--

ascii-surrogateescape.diff: 
 - I don't like unicode_from_hostname() name: "decode_hostname()" would be better.
 - It doesn't patch the doc and so cannot be applied alone. It doesn't matter, it's better to apply both patches at the same time. But thanks to have splitted them, it's easier to review them :-)

try-surrogateescape-first.diff:
 - hostname_to_bytes() should be called "encode_hostname()"
 - if (!PyErr_ExceptionMatches(PyExc_UnicodeError)):  you should catch UnicodeEncodeError here
 - "if this is not possible, :exc:`UnicodeError` is raised.": is it an UnicodeEncodeError?
 - use PyUnicode_AsEncodedString() instead of PyUnicode_AsEncodedObject(): it's faster for ASCII and ensure that the result is a bytes object (so you don't need to re-check the type)
msg111985 - (view) Author: David Watson (baikie) Date: 2010-07-29 18:28
"Leaving IDNA ASCII-compatible encodings in ASCII form" is just preserving the existing behaviour (not doing IDNA decoding).  See

http://tools.ietf.org/html/rfc3490

and the docs for codecs -> encodings.idna ("xn--lzg" in the example is the ASCII-compatible encoding of "€", so if you look up that IP address, "xn--lzg" is returned with or without the patch).

I'll look into your other comments.  In the meantime, I've got one more patch, as the decoding of the nodename field in os.uname() also needs to be changed to match the other hostname-returning functions.  This patch changes it to ASCII/surrogateescape, with the usual PEP 383 decoding for the other fields.
msg112094 - (view) Author: David Watson (baikie) Date: 2010-07-30 18:11
OK, here are new versions of the original patches.

I've tweaked the docs to make clear that ASCII-compatible
encodings actually *are* ASCII, and point to an explanation as
soon as they're mentioned.

You're right that PyUnicode_AsEncodedString() is the preferable
interface for the argument converter (I think I got
PyUnicode_AsEncodedObject() from an old version of
PyUnicode_FSConverter() :/), but for the ASCII step I've just
short-circuited it and used PyUnicode_EncodeASCII() directly,
since the converter has already checked that the object is of
Unicode type.  For the IDNA step, PyUnicode_AsEncodedString()
should result in a less confusing error message if the codec
returns some non-bytes object one day.

However, the PyBytes_Check isn't to check up on the codec, but to
check for a bytes argument, which the converter also supports.
For that reason, I think encode_hostname would be a misleading
name, but I've renamed it hostname_converter after the example of
PyUnicode_FSConverter, and renamed unicode_from_hostname to
decode_hostname.

I've also made the converter check for UnicodeEncodeError in the
ASCII step, but the end result really is UnicodeError if the IDNA
step fails, because the "idna" codec does not use
UnicodeEncodeError or UnicodeDecodeError.  Complain about that if
you wish :)


I think the example I gave in the previous comment was also
confusing, so just to be clear...

In /etc/hosts (in UTF-8 encoding):

127.0.0.2       €
127.0.0.3       xn--lzg


Without patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
>>> '€'.encode("idna")
b'xn--lzg'


With patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('\udce2\udc82\udcac', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))]
>>> '\udce2\udc82\udcac'.encode("idna")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 167, in encode
    result.extend(ToASCII(label))
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 76, in ToASCII
    label = nameprep(label)
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 38, in nameprep
    raise UnicodeError("Invalid character %r" % c)
UnicodeError: Invalid character '\udce2'


The exception at the end demonstrates why surrogateescape strings
don't get confused with IDNs.
msg114688 - (view) Author: David Watson (baikie) Date: 2010-08-22 18:27
I noticed that try-surrogateescape-first.diff missed out one of
the string references that needed to be changed to point to the
bytes object, and also used PyBytes_AS_STRING() in an unlocked
section.  This version fixes these things by taking the generally
safer approach of setting the original char * variable to the
hostname immediately after using hostname_converter().
msg114710 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-08-22 22:03
Is this patch in response to an actual problem, or a theoretical problem?
If "actual problem": what was the specific application, and what was the specific host name?

If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available".
msg114754 - (view) Author: David Watson (baikie) Date: 2010-08-23 22:48
> Is this patch in response to an actual problem, or a theoretical problem?
> If "actual problem": what was the specific application, and what was the specific host name?

It's about environments, not applications - the local network may
be configured with non-ASCII bytes in hostnames (either in the
local DNS *or* a different lookup mechanism - I mentioned
/etc/hosts as a simple example), or someone might deliberately
connect from a garbage hostname as a denial of service attack
against a server which tries to look it up with gethostbyaddr()
or whatever (this may require a "non-strict" resolver library, as
noted above).

> If theoretical, I recommend to close it as "won't fix". I find it perfectly reasonable if Python's socket module gives an error if the hostname can't be clearly decoded. Applications that run into it as a result of gethostbyaddr should treat that as "no reverse name available".

There are two points here.  One is that the decoding can fail; I
do think that programmers will find this surprising, and the fact
that Python refuses to return what was actually received is a
regression compared to 2.x.

The other is that the encoding and decoding are not symmetric -
hostnames are being decoded with UTF-8 but encoded with IDNA.
That means that when a decoded hostname contains a non-ASCII
character which is not prohibited by IDNA/Nameprep, that string
will, when used in a subsequent call, not refer to the hostname
that was actually received, because it will be re-encoded using a
different codec.

Attaching a refreshed version of try-surrogateescape-first.diff.
I've separated out the change to getnameinfo() as it may be
superfluous (issue #1027206).
msg114756 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-08-23 23:07
>> Is this patch in response to an actual problem, or a theoretical problem?
>> If "actual problem": what was the specific application, and what was the specific host name?
> 
> It's about environments, not applications

Still, my question remains. Is it a theoretical problem (i.e. one
of your imagination), or a real one (i.e. one you observed in real
life, without explicitly triggering it)? If real: what was the
specific environment, and what was the specific host name?

> There are two points here.  One is that the decoding can fail; I
> do think that programmers will find this surprising, and the fact
> that Python refuses to return what was actually received is a
> regression compared to 2.x.

True. However, I think this is an acceptable regression,
assuming the problem is merely theoretical. It is ok if
an operation fails that you will never run into in real life.

> That means that when a decoded hostname contains a non-ASCII
> character which is not prohibited by IDNA/Nameprep, that string
> will, when used in a subsequent call, not refer to the hostname
> that was actually received, because it will be re-encoded using a
> different codec.

Again, I fail to see the problem in this. It won't happen in
real life. However, if you worried that this could be abused,
I think it should decode host names as ASCII, not as UTF-8.
Then it will be symmetric again (IIUC).
msg114847 - (view) Author: David Watson (baikie) Date: 2010-08-24 22:59
> > It's about environments, not applications
> 
> Still, my question remains. Is it a theoretical problem (i.e. one
> of your imagination), or a real one (i.e. one you observed in real
> life, without explicitly triggering it)? If real: what was the
> specific environment, and what was the specific host name?

Yes, I did reproduce the problem on my own system (Ubuntu 8.04).
No, it is not from a real application, nor do I know anyone with
their network configured like this (except possibly Dan "djbdns"
Bernstein: http://cr.yp.to/djbdns/idn.html ).

I reported this bug to save anyone who *is* in such an
environment from crashing applications and erroneous name
resolution.

> > That means that when a decoded hostname contains a non-ASCII
> > character which is not prohibited by IDNA/Nameprep, that string
> > will, when used in a subsequent call, not refer to the hostname
> > that was actually received, because it will be re-encoded using a
> > different codec.
> 
> Again, I fail to see the problem in this. It won't happen in
> real life. However, if you worried that this could be abused,
> I think it should decode host names as ASCII, not as UTF-8.
> Then it will be symmetric again (IIUC).

That would be an improvement.  The idea of the patches I posted
is to combine this with the existing surrogateescape mechanism,
which handles situations like this perfectly well.  I don't see
how getting a UnicodeError is better than getting a string with
some lone surrogates in it.  In fact, it was my understanding of
PEP 383 that it is in fact better to get the lone surrogates.
msg114882 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-08-25 05:53
> That would be an improvement.  The idea of the patches I posted
> is to combine this with the existing surrogateescape mechanism,
> which handles situations like this perfectly well.

The surrogateescape mechanism is a very hackish approach, and
violates the principle that errors should never pass silently.
However, it solves a real problem - people do run into the problem
with file names every day. With this problem, I'd say "if it hurts,
don't do it, then".
msg115014 - (view) Author: David Watson (baikie) Date: 2010-08-26 18:04
> The surrogateescape mechanism is a very hackish approach, and
> violates the principle that errors should never pass silently.

I don't see how a name resolution API returning non-ASCII bytes
would indicate an error.  If the host table contains a non-ASCII
byte sequence for a host, then that is the host's name - it works
just as well as an ASCII name, both forwards and backwards.

What is hackish is representing char * data as a Unicode string
when there is no native Unicode API to feed it to - there is no
issue here such as file names being bytes on Unix and Unicode on
Windows, so the clean thing to do would be to return a bytes
object.  I suggested the surrogateescape mechanism in order to
retain backwards compatibility.

> However, it solves a real problem - people do run into the problem
> with file names every day. With this problem, I'd say "if it hurts,
> don't do it, then".

But to be more explicit, that's like saying "if it hurts, get
your sysadmin to reconfigure the company network".
msg115030 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-08-26 21:36
> I don't see how a name resolution API returning non-ASCII bytes
> would indicate an error.

It's in violation of RFC 952 (slightly relaxed by RFC 1123).

> But to be more explicit, that's like saying "if it hurts, get
> your sysadmin to reconfigure the company network".

Which I consider perfectly reasonable. The sysadmin should have
known (and, in practice, *always* knows) not to do that in the first
place (the larger the company, the more cautious the sysadmin).
msg115116 - (view) Author: David Watson (baikie) Date: 2010-08-27 19:13
> > I don't see how a name resolution API returning non-ASCII bytes
> > would indicate an error.
> 
> It's in violation of RFC 952 (slightly relaxed by RFC 1123).

That's bad if it's on the public Internet, but it's not an
error.  The OS is returning the name by which it knows the host.

If you look at POSIX, you'll see that what getaddrinfo() and
getnameinfo() look up and return is referred to as a "node name",
which can be an address string or a "descriptive name", and that
if used with Internet address families, descriptive names
"include" host names.  It doesn't say that the string can only be
an address string or a hostname (RFC 1123 compliant or
otherwise).

> > But to be more explicit, that's like saying "if it hurts, get
> > your sysadmin to reconfigure the company network".
> 
> Which I consider perfectly reasonable. The sysadmin should have
> known (and, in practice, *always* knows) not to do that in the first
> place (the larger the company, the more cautious the sysadmin).

It's not reasonable when addressed to a customer who might go
elsewhere.  And I still don't see a technical reason for making
such a demand.  Python 2.x seems to work just fine using 8-bit
strings.
msg115119 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-08-27 19:20
> It's not reasonable when addressed to a customer who might go
> elsewhere.

I remain -1 on this change, until such a customer actually shows
up at a Python developer.
msg115185 - (view) Author: David Watson (baikie) Date: 2010-08-29 18:44
OK, I still think this issue should be addressed, but here is a patch for the part we agree on: that decoding should not return any Unicode characters except ASCII.
msg115186 - (view) Author: David Watson (baikie) Date: 2010-08-29 18:47
The rest of the issue could also be straightforwardly addressed by adding bytes versions of the name lookup APIs.  Attaching a patch which does that (applies on top of decode-strict-ascii.diff).
msg115187 - (view) Author: David Watson (baikie) Date: 2010-08-29 19:01
Oops, forgot to refresh the last change into that patch.  This should fix it.
msg118582 - (view) Author: Nathan Letwory (jesterKing) Date: 2010-10-13 20:50
platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage)

There are a lot of our Blender users that are not english native-speakers and they set up their machine as they please, against RCFs or not.

This currently breaks some code that use platform.system() to check the system it's run on. The paste from above is from a user who has named his computer Nötkötti.

It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that.
msg118602 - (view) Author: David Watson (baikie) Date: 2010-10-13 23:38
> platform.system() fails with UnicodeEncodeError on systems that have their computer name set to a name containing non-ascii characters. The implementation of platform.system() uses at some point socket.gethostname() ( see http://www.pasteall.org/16215 for a stacktrace of such usage)

This trace is from a Windows system, where the platform module
uses gethostname() in its cross-platform uname() function, which
platform.system() and various other functions in the module rely
on.  On a Unix system, platform.uname() depends on os.uname()
working, meaning that these functions still fail when the
hostname cannot be decoded, as it is part of os.uname()'s return
value.

Given that os.uname() is a primary source of information about
the platform on Unix systems, this sort of collateral damage from
an undecodable hostname is likely to occur in more places.

> It would be more than great if this error could be fixed. If another 3.1 release is planned, preferrably for that.

If you'd like to try the surrogateescape patches, they ought to
fix this.  The relevant patches are ascii-surrogateescape-2.diff,
try-surrogateescape-first-4.diff and uname-surrogateescape.diff.
msg118617 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-14 06:12
The failure of platform.uname is an independent bug. IMO, it shouldn't use socket.gethostname on Windows, but instead look at the COMPUTERNAME environment variable or call the GetComputerName API function. This is more close to what uname() does on Unix (i.e. retrieve the local machine name independent of DNS).

I have created issue10097 for this bug.
msg118694 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-14 17:15
As a further note: I think socket.gethostname() is a special case, since this is just about a local setting (i.e. not related to DNS). We should then assume that it is encoded in the locale encoding (in particular, that it is encoded in mbcs on Windows).
msg118709 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-14 18:29
Regarding fixing the issue at hand on Windows, I think Python should use the corresponding win32 API for getting the hostname: GetComputerNameEx().

It supports Unicode, so the encoding issue doesn't arise.

See  http://msdn.microsoft.com/en-us/library/ms724301(v=VS.85).aspx for details.

This also solves the platform.uname() issue mentioned here, since the uname() emulation for Windows relies on socket.gethostname() to determine the node name.

FWIW: Glib C does the reverse...

       The  GNU  C library implements gethostname() as a library function that calls
       uname(2) and copies up to len bytes from the  returned  nodename  field  into
       name.
msg118816 - (view) Author: David Watson (baikie) Date: 2010-10-15 18:03
> As a further note: I think socket.gethostname() is a special case, since this is just about a local setting (i.e. not related to DNS).

But the hostname *is* commonly intended to be looked up in the
DNS or whatever name resolution mechanisms are used locally -
socket.getfqdn(), for instance, works by looking up the result
using gethostbyaddr() (actually the C function getaddrinfo(),
followed by gethostbyaddr()).  So I don't see the rationale for
treating it differently from the results of gethostbyaddr(),
getnameinfo(), etc.

POSIX says of the name lookup functions that "in many cases" they
are implemented by the Domain Name System, not that they always
are, so a name intended for lookup need not be ASCII-only either.

> We should then assume that it is encoded in the locale encoding (in particular, that it is encoded in mbcs on Windows).

I can see the point of returning the characters that were
intended, but code that looked up the returned name would then
have to be changed to re-encode it to bytes to avoid the
round-tripping issue when non-ASCII characters are returned.
msg118952 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-17 17:38
Am 15.10.2010 20:03, schrieb David Watson:
> 
> David Watson <baikie@users.sourceforge.net> added the comment:
> 
>> As a further note: I think socket.gethostname() is a special case, since this is just about a local setting (i.e. not related to DNS).
> 
> But the hostname *is* commonly intended to be looked up in the
> DNS or whatever name resolution mechanisms are used locally -
> socket.getfqdn(), for instance, works by looking up the result
> using gethostbyaddr() (actually the C function getaddrinfo(),
> followed by gethostbyaddr()).  So I don't see the rationale for
> treating it differently from the results of gethostbyaddr(),
> getnameinfo(), etc.

The result from gethostname likely comes out of machine-local
configuration. It may have non-ASCII in it, which is then likely
encoded in the local encoding. When looking it up in DNS, IDNA
should be applied.

OTOH, output from gethostbyaddr likely comes out of the DNS itself.
Guessing what encoding it may have is futile - other than guessing
that it really ought to be ASCII.

> I can see the point of returning the characters that were
> intended, but code that looked up the returned name would then
> have to be changed to re-encode it to bytes to avoid the
> round-tripping issue when non-ASCII characters are returned.

Python's socket module is clearly focused on the internet, and
intends to support that well. So if you pass a non-ASCII
string, it will have to encode that using IDNA. If that's
not what you want to get, tough luck.
msg119051 - (view) Author: David Watson (baikie) Date: 2010-10-18 18:11
> The result from gethostname likely comes out of machine-local
> configuration. It may have non-ASCII in it, which is then likely
> encoded in the local encoding. When looking it up in DNS, IDNA
> should be applied.

I would have thought that someone who intended a Unicode hostname
to be looked up in its IDNA form would have encoded it using
IDNA, rather than an 8-bit encoding - how many C programs would
transcode the name that way, rather than just passing the char *
from one interface to another?

In fact, I would think that non-ASCII bytes in a hostname most
probably indicated that a name resolution mechanism other than
the DNS was in use, and that the byte string should be passed
unaltered just as a typical C program would.

> OTOH, output from gethostbyaddr likely comes out of the DNS itself.
> Guessing what encoding it may have is futile - other than guessing
> that it really ought to be ASCII.

Sure, but that doesn't mean the result can't be made to
round-trip if it turns out not to be ASCII.  The guess that it
will be ASCII is, after all, still a guess (as is the guess that
it comes from the DNS).

> Python's socket module is clearly focused on the internet, and
> intends to support that well. So if you pass a non-ASCII
> string, it will have to encode that using IDNA. If that's
> not what you want to get, tough luck.

I don't object to that, but it does force a choice between
decoding an 8-bit name for display (e.g. by using the locale
encoding), and decoding it to round-trip automatically (e.g. by
using ASCII/surrogateescape, with support on the encoding side).

Using PyUnicode_DecodeFSDefault() for the hostname or other
returned names (thus decoding them for display) would make this
issue solvable with programmer intervention - for instance,
"socket.gethostbyaddr(socket.gethostname())" could be replaced by
"socket.gethostbyaddr(os.fsencode(socket.gethostname()))", but
programmers might well neglect to do this, given that no encoding
was needed in Python 2.

Also, even displaying a non-ASCII name decoded according to the
locale creates potential for confusion, as when the user types
the same characters into a Python program for lookup (again,
barring programmer intervention), they will not represent the
same byte sequence as the characters the user sees on the screen
(as they will instead represent their IDNA ASCII-compatible
equivalent).

So overall, I do think it is better to decode names for automatic
round-tripping rather than for display, but my main concern is
simply that it should be possible to recover the original bytes
so that round-tripping is at least possible.
PyUnicode_DecodeFSDefault() would accomplish that much at least.
msg119076 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-18 20:37
> I would have thought that someone who intended a Unicode hostname
> to be looked up in its IDNA form would have encoded it using
> IDNA, rather than an 8-bit encoding - how many C programs would
> transcode the name that way, rather than just passing the char *
> from one interface to another?

Well, Python is not C. In Python, you would pass a str, and
expect it to work, which means it will get automatically encoded
with IDNA.

> In fact, I would think that non-ASCII bytes in a hostname most
> probably indicated that a name resolution mechanism other than
> the DNS was in use, and that the byte string should be passed
> unaltered just as a typical C program would.

I'm not talking about byte strings, but character strings.

> I don't object to that, but it does force a choice between
> decoding an 8-bit name for display (e.g. by using the locale
> encoding), and decoding it to round-trip automatically (e.g. by
> using ASCII/surrogateescape, with support on the encoding side).

In the face of ambiguity, refuse the temptation to guess.

> So overall, I do think it is better to decode names for automatic
> round-tripping rather than for display, but my main concern is
> simply that it should be possible to recover the original bytes
> so that round-tripping is at least possible.

Marc-Andre wants gethostname to use the Wide API on Windows, which,
in theory, allows for cases where round-tripping to bytes is
impossible.
msg119177 - (view) Author: David Watson (baikie) Date: 2010-10-19 23:15
> > In fact, I would think that non-ASCII bytes in a hostname most
> > probably indicated that a name resolution mechanism other than
> > the DNS was in use, and that the byte string should be passed
> > unaltered just as a typical C program would.
> 
> I'm not talking about byte strings, but character strings.

I mean that passing the str object from socket.gethostname() to
the Python lookup function ought to result in the same byte
string being passed to the C lookup function as was returned by
the C gethostname() function (or else that the programmer must
re-encode the str to ensure that that result is obtained).

> > I don't object to that, but it does force a choice between
> > decoding an 8-bit name for display (e.g. by using the locale
> > encoding), and decoding it to round-trip automatically (e.g. by
> > using ASCII/surrogateescape, with support on the encoding side).
> 
> In the face of ambiguity, refuse the temptation to guess.

Yes, I would interpret that to mean not using the locale encoding
for data obtained from the network.  That's another reason why
the ASCII/surrogateescape scheme appeals to me more.

> Well, Python is not C. In Python, you would pass a str, and
> expect it to work, which means it will get automatically encoded
> with IDNA.

I think there might be a misunderstanding here - I've never
proposed changing the interpretation of Unicode characters in
hostname arguments.  The ASCII/surrogateescape scheme I suggested
only changes the interpretation of unpaired surrogate codes, as
they do not occur in IDNs or any other genuine Unicode data; all
IDNs, including those solely consisting of ASCII characters,
would be encoded to the same byte sequence as before.

ASCII/surrogateescape decoding could also be used without support
on the encoding side - that would satisfy the requirement to
"refuse the temptation to guess", would allow the original bytes
to be recovered, and would mean that attempting to look up a
non-ASCII result in str form would raise an exception rather than
looking up the wrong name.

> Marc-Andre wants gethostname to use the Wide API on Windows, which,
> in theory, allows for cases where round-tripping to bytes is
> impossible.

Well, the name resolution APIs wrapped by Python are all
byte-oriented, so if the computer name were to have no bytes
equivalent then it wouldn't be possible to resolve it anyway, and
an exception rightly ought be raised at some point in the process
of trying to do so.
msg119230 - (view) Author: David Watson (baikie) Date: 2010-10-20 19:37
I was looking at the MSDN pages linked to above, and these two
pages seemed to suggest that Unicode characters appearing in DNS
names represented UTF-8 sequences, and that Windows allowed such
non-ASCII byte sequences in the DNS by default:

http://msdn.microsoft.com/en-us/library/ms724220%28v=VS.85%29.aspx
http://msdn.microsoft.com/en-us/library/ms682032%28v=VS.85%29.aspx

(See the discussion of DNS_ERROR_NON_RFC_NAME in the latter.)
Can anyone confirm if this is the case?

The BSD-style gethostname() function can't be returning UTF-8,
though, or else the "Nötkötti" example above would have been
decoded successfully, given that Python currently uses
PyUnicode_FromString().

Also, if GetComputerNameEx() only offers a choice of DNS names or
NetBIOS names, and both are byte-oriented underneath (that was my
reading of the "Computer Names" page), then presumably there
shouldn't be a problem with mapping the result to a bytes
equivalent when necessary?
msg119231 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-20 19:49
> Also, if GetComputerNameEx() only offers a choice of DNS names or
> NetBIOS names, and both are byte-oriented underneath (that was my
> reading of the "Computer Names" page), then presumably there
> shouldn't be a problem with mapping the result to a bytes
> equivalent when necessary?

They aren't byte-oriented underneath.It depends on whether use
GetComputerNameA or GetComputerNameW whether you get bytes or Unicode.
If bytes, they are converted as if by WideCharToMultiByte using
CP_ACP, which in turn will introduce question marks and the like
for unconvertable characters.
msg119245 - (view) Author: David Watson (baikie) Date: 2010-10-20 23:42
> > Also, if GetComputerNameEx() only offers a choice of DNS names or
> > NetBIOS names, and both are byte-oriented underneath (that was my
> > reading of the "Computer Names" page), then presumably there
> > shouldn't be a problem with mapping the result to a bytes
> > equivalent when necessary?
> 
> They aren't byte-oriented underneath.It depends on whether use
> GetComputerNameA or GetComputerNameW whether you get bytes or Unicode.
> If bytes, they are converted as if by WideCharToMultiByte using
> CP_ACP, which in turn will introduce question marks and the like
> for unconvertable characters.

Sorry, I didn't mean how Windows constructs the result for the
"A" interface - I was talking about Python code being able to map
the result from the Unicode interface to the form used in the
protocol (e.g. DNS).  I believe the proposal is to use the DNS
name, so since the DNS is byte oriented, I would have thought
that the Unicode "DNS name" result would always have a bytes
equivalent that the DNS resolver code would use - perhaps its
UTF-8 encoding?
msg119260 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-21 05:09
> Sorry, I didn't mean how Windows constructs the result for the
> "A" interface - I was talking about Python code being able to map
> the result from the Unicode interface to the form used in the
> protocol (e.g. DNS).  I believe the proposal is to use the DNS
> name

I disagree with the proposal - it should return whatever
name gethostname from winsock.dll returns (which I expect
to be the netbios name).

> so since the DNS is byte oriented, I would have thought
> that the Unicode "DNS name" result would always have a bytes
> equivalent that the DNS resolver code would use - perhaps its
> UTF-8 encoding?

No no no. When Microsoft calls it the DNS name, they don't actually
mean that it has to do anything with DNS. In particular, it's not
byte-oriented.
msg119271 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-21 09:24
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> Sorry, I didn't mean how Windows constructs the result for the
>> "A" interface - I was talking about Python code being able to map
>> the result from the Unicode interface to the form used in the
>> protocol (e.g. DNS).  I believe the proposal is to use the DNS
>> name
> 
> I disagree with the proposal - it should return whatever
> name gethostname from winsock.dll returns (which I expect
> to be the netbios name).
> 
>> so since the DNS is byte oriented, I would have thought
>> that the Unicode "DNS name" result would always have a bytes
>> equivalent that the DNS resolver code would use - perhaps its
>> UTF-8 encoding?
> 
> No no no. When Microsoft calls it the DNS name, they don't actually
> mean that it has to do anything with DNS. In particular, it's not
> byte-oriented.

Just to clarify: I was proposing to use the
GetComputerNameExW() win32 API with ComputerNamePhysicalDnsHostname,
which returns Unicode without needing any roundtrip via bytes
and the issues associated with this.

I don't understand why Martin insists that the MS "DNS name"
doesn't have anything to with DNS... the fully qualified
DNS name of a machine is determined as hostname.domainname,
just like you would expect in DNS.

http://msdn.microsoft.com/en-us/library/ms724301(v=VS.85).aspx
http://msdn.microsoft.com/en-us/library/ms724224(v=VS.85).aspx

As I said earlier: NetBIOS is being phased out in favor of
DNS. MS is using a convention which mandates that NetBIOS names
match DNS names. The only difference between the two is that
NetBIOS names have a length limitation:

http://msdn.microsoft.com/en-us/library/ms724931(v=VS.85).aspx

Perhaps Martin could clarify why he insists on using the
ANSI WinSock interface gethostname instead.

PS: WinSock provides many other Unicode APIs for socket
module interfaces as well, so at least for that platform,
we could use those to resolve uncertainties about the
encoding used in name resolution.

On other platforms, I guess we'll just have to do some trial
and error to see what works and what not. E.g. on Linux it is
possible to set the hostname to a non-ASCII value, but then
the resolver returns an error, so it's not very practical:

# hostname l\303\266wis
# hostname
löwis
# hostname -f
hostname: Resolver Error 0 (no error)

Using the IDNA version doesn't help either:

# hostname xn--lwis-5qa
# hostname
xn--lwis-5qa
# hostname -f
hostname: Resolver Error 0 (no error)

Python2 happily returns the host name, but fails to return
a fully qualified domain name:

'l\xc3\xb6wis'
>>> socket.getfqdn()
'l\xc3\xb6wis'

and

'xn--lwis-5qa'
>>> socket.getfqdn()
'xn--lwis-5qa'

Just for comparison:

# hostname newton
# hostname
newton
# hostname -f
newton.egenix.internal

and

'newton'
>>> socket.getfqdn()
'newton.egenix.internal'

So at least on Linux, using non-ASCII hostnames doesn't really
appear to be an option at this time.
msg119346 - (view) Author: David Watson (baikie) Date: 2010-10-21 22:37
> On other platforms, I guess we'll just have to do some trial
> and error to see what works and what not. E.g. on Linux it is
> possible to set the hostname to a non-ASCII value, but then
> the resolver returns an error, so it's not very practical:
> 
> # hostname l\303\266wis
> # hostname
> löwis
> # hostname -f
> hostname: Resolver Error 0 (no error)
> 
> Using the IDNA version doesn't help either:
> 
> # hostname xn--lwis-5qa
> # hostname
> xn--lwis-5qa
> # hostname -f
> hostname: Resolver Error 0 (no error)

I think what's happening here is that simply that you're setting
the hostname to something which doesn't exist in the relevant
name databases - the man page for Linux's hostname(1) says that
"The FQDN is the name gethostbyname(2) returns for the host name
returned by gethostname(2).".  If the computer's usual name is
"newton", that may be why it works and the others don't.

It works for me if I add "127.0.0.9 löwis.egenix.com löwis" to
/etc/hosts and then set the hostname to "löwis" (all UTF-8):
hostname -f prints "löwis.egenix.com", and Python 2's
socket.getfqdn() returns the corresponding bytes; non-UTF-8 names
work too.  (Note that the FQDN must appear before the bare
hostname in the /etc/hosts entry, and I used the address
127.0.0.9 simply to avoid a collision with existing entries - by
default, Ubuntu assigns the FQDN to 127.0.1.1.)
msg119837 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-29 02:00
Looks like we have our first customer (issue 10223).
msg119918 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-29 17:44
I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset.  After the reboot, I found

- COMPUTERNAME is "P3141", and so is the result of GetComputerNameEx(4)
- GetComputerNameEx(5) is "π3141"
- socket.gethostname of Python 2.5 returns "p3141".

So my theory of how this all fits together is this:

1. it's not really possible to completely decouple the DNS name and the NetBIOS name. Setting the DNS name also modifies the NetBIOS name; I suspect that the reverse is also true.

2. gethostname returns the ANSI version of the DNS name (which happens to convert the GREEK SMALL LETTER PI to a LATIN SMALL LETTER P).

3. the NetBIOS name is an generally an uppercase version of the gethostname result. There may be rules in case the gethostname result contains characters illegal in NetBIOS.

In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows.

Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name".
msg119925 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-29 18:22
r85934 now uses GetComputerNameExW on Windows.
msg119927 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-29 18:33
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> I just did an experiment on Windows 7. I used SetComputerNameEx to set the NetBIOS name (4) to "e2718", and the DNS name (5) to "π3141"; then I rebooted. This is on a system with windows-1252 as its ANSI code page (i.e. u"π"==u"\N{GREEK SMALL LETTER PI}" is not in the ANSI charset.  After the reboot, I found
> 
> - COMPUTERNAME is "P3141", and so is the result of GetComputerNameEx(4)
> - GetComputerNameEx(5) is "π3141"
> - socket.gethostname of Python 2.5 returns "p3141".
> 
> So my theory of how this all fits together is this:
> 
> 1. it's not really possible to completely decouple the DNS name and the NetBIOS name. Setting the DNS name also modifies the NetBIOS name; I suspect that the reverse is also true.

The MS docs mention that setting the DNS name will adjust the NetBIO name
as well (with the NetBIOS name being converted to upper case and truncated,
if the DNS name is too long).

They don't mention anything about the NetBIOS name encoding.

> 2. gethostname returns the ANSI version of the DNS name (which happens to convert the GREEK SMALL LETTER PI to a LATIN SMALL LETTER P).
> 
> 3. the NetBIOS name is an generally an uppercase version of the gethostname result. There may be rules in case the gethostname result contains characters illegal in NetBIOS.
> 
> In summary, I (now) think it's fine to return the Unicode version of the DNS name from gethostname on Windows.
> 
> Re msg119271: the name "π3141" really has nothing to do with the DNS on my system. It doesn't occur in DNS any zone, nor could it possibly. It's unclear to me why Microsoft calls it the "DNS name".

The DNS name of the Windows machine is the combination of the DNS host
name and the DNS domain that you setup on the machine. I think the
misunderstanding is that you assume this combination will
somehow appear as known DNS name of the machine via some
DNS server on the network - that's not the case.

Of course, it's not particularly useful to set the DNS name to
something that other machines cannot find out via an DNS query.

FWIW, you can do the same on a Linux box, i.e. setup the host name
and domain to some completely bogus values. And as David pointed out,
without also updating the /etc/hosts on the Linux, you always get the
resolver error with hostname -f I mentioned earlier on (which does
a DNS lookup), so there's no real connection to the DNS system on
Linux either.
msg119928 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-29 19:04
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> r85934 now uses GetComputerNameExW on Windows.

Thanks, Martin.

Here's a similar discussion of the Windows approach (used in bzr):

https://bugs.launchpad.net/bzr/+bug/256550/comments/6

This is what Solaris uses:

http://developers.sun.com/dev/gadc/faq/locale.html#get-set

(they require conversion to ASCII and using IDNA for non-ASCII
names)

I found this RFC draft on the topic:
http://tools.ietf.org/html/draft-josefsson-getaddrinfo-idn-00
which suggests that there is no standard for the encoding
used by the socket host name APIs yet.

ASCII, UTF-8 and IDNA are happily mixed and matched.
msg119929 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-29 19:31
The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code.

The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API).

If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't.
msg119935 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-29 20:09
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> The Solaris case then is already supported, with no change required: if Solaris bans non-ASCII in the network configuration (or, rather, recommends to use IDNA), then this will work fine with the current code.
> 
> The Josefsson AI_IDN flag is irrelevant to Python, IMO: it treats byte names as locale-encoded, and converts them with IDNA. Python 3 users really should use Unicode strings in the first place for non-ASCII data, in which case the socket.getaddrinfo uses IDNA, anyway. However, it can't hurt to expose this flag if the underlying C library supports it. AI_CANONIDN might be interesting to implement, but I'd rather wait whether this finds RFC approval. In any case, undoing IDNA is orthogonal to this issue (which is about non-ASCII data returned from the socket API).
> 
> If anything needs to be done on Unix, I think that the gethostname result should be decoded using the file system encoding; I then don't mind using surrogate escape there for good measure. This won't hurt systems that restrict host names to ASCII, and may do some good for systems that don't.

Wouldn't it be better to also attempt to decode the name using IDNA
in case the name starts with the IDNA prefix ?

This would then also cover the Solaris case.
msg119941 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-29 22:25
> The DNS name of the Windows machine is the combination of the DNS host
> name and the DNS domain that you setup on the machine. I think the
> misunderstanding is that you assume this combination will
> somehow appear as known DNS name of the machine via some
> DNS server on the network - that's not the case.

I don't assume that - I merely point it that it clearly has no
relationship to the DNS (unless you explicitly make it that way).
So, I wonder why they call it the DNS name - they could have just
as well called the "LDAP name", or the "NIS name". In either case,
setting the name would have no impact on the respective naming
infrastructure.

> FWIW, you can do the same on a Linux box, i.e. setup the host name
> and domain to some completely bogus values. And as David pointed out,
> without also updating the /etc/hosts on the Linux, you always get the
> resolver error with hostname -f I mentioned earlier on (which does
> a DNS lookup), so there's no real connection to the DNS system on
> Linux either.

Yes, but Linux (rightly) calls it the "hostname", not the "DNS name".
msg119943 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-29 22:26
> Wouldn't it be better to also attempt to decode the name using IDNA
> in case the name starts with the IDNA prefix ?

Perhaps better - but incompatible. I don't see a way to have the
resolver functions automatically decode IDNA, without potentially
breaking existing applications that specifically look for the
IDNA prefix (say).
msg119946 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-29 22:29
The code in socketmodule.c currently compile with suspect warnings:

socketmodule.c(3108) : warning C4047: 'function' : 'LPSTR' differs in levels of indirection from 'int'
socketmodule.c(3108) : warning C4024: 'GetComputerNameA' : different types for formal and actual parameter 1
socketmodule.c(3109) : warning C4133: 'function' : incompatible types - from 'Py_UNICODE *' to 'LPDWORD'
socketmodule.c(3110) : warning C4020: 'GetComputerNameA' : too many actual parameters

was GetComputerName() used instead of GetComputerNameExW()?
msg120081 - (view) Author: David Watson (baikie) Date: 2010-10-31 19:34
> FWIW, you can do the same on a Linux box, i.e. setup the host name
> and domain to some completely bogus values. And as David pointed out,
> without also updating the /etc/hosts on the Linux, you always get the
> resolver error with hostname -f I mentioned earlier on (which does
> a DNS lookup), so there's no real connection to the DNS system on
> Linux either.

Just to clarify here: there isn't anything special about
/etc/hosts; it's handled by a pluggable module which performs
hostname lookups in it alongside a similar module for the DNS.
glibc's Name Service Switch combines the views provided by the
various modules into a single byte-oriented namespace for
hostnames according to the settings in /etc/nssswitch.conf (this
namespace allows non-ASCII bytes, as the /etc/hosts examples
demonstrate).

http://www.kernel.org/doc/man-pages/online/pages/man5/nsswitch.conf.5.html
http://www.gnu.org/software/libc/manual/html_node/Name-Service-Switch.html

It's an extensible system, so people can write their own modules
to handle whatever name services they have to deal with, and
configure hostname lookup to query them before, after or instead
of the DNS.  A hostname that is not resolvable in the DNS may be
resolvable in one of these.
msg158118 - (view) Author: Nick (spaun2002) Date: 2012-04-12 10:08
I faced with the issue on my own PC. For a Russian version of WinOS default PC name is ИВАН-ПК (C8 C2 C0 CD 2D CF CA in hex) and it returns from gethostbyaddr (CRT) exactly in this form (encoded with system locale cp1251 not UTF8). So when the function PyUnicode_FromString is called, it expects that argument is utf8 encoded string and throws and error.
A lot of 3rd party modules use gethostbyaddr or getfqdn (which uses gethostbyaddr) and I can't just use function that returns names as bytes. Surrogate names are also not acceptable because the name mentioned above becomes ????-??
msg158165 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-04-12 19:44
Nick, which version of Python are you using? And which function are you running exactly?
It seems that a4fd3dc74299 fixed the issue, this was included with 3.2.
msg158175 - (view) Author: Nick (spaun2002) Date: 2012-04-12 21:52
Originally I tried 3.2.2 (32bit), but I've just checked 3.2.3 and got the same.
A code for reproduce is simple:

from socket import gethostbyaddr
a = gethostbyaddr('127.0.0.1')

leads to:
Traceback (most recent call last):
  File "C:\Users\user\test\test.py", line 13, in <module>
    a = gethostbyaddr('127.0.0.1')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte

Or more complex sample:

def main():
    import http.server
    port = 80
    handlerClass = http.server.SimpleHTTPRequestHandler
    srv = http.server.HTTPServer(("", port), handlerClass )
    srv.serve_forever()
if __name__ == "__main__":
    main()

Attempt of connection to the server leads to:

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 1156)
Traceback (most recent call last):
  File "C:\Python32\lib\socketserver.py", line 284, in _handle_request_noblock
    self.process_request(request, client_address)
  File "C:\Python32\lib\socketserver.py", line 310, in process_request
    self.finish_request(request, client_address)
  File "C:\Python32\lib\socketserver.py", line 323, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "C:\Python32\lib\socketserver.py", line 637, in __init__
    self.handle()
  File "C:\Python32\lib\http\server.py", line 396, in handle
    self.handle_one_request()
  File "C:\Python32\lib\http\server.py", line 384, in handle_one_request
    method()
  File "C:\Python32\lib\http\server.py", line 657, in do_GET
    f = self.send_head()
  File "C:\Python32\lib\http\server.py", line 701, in send_head
    self.send_response(200)
  File "C:\Python32\lib\http\server.py", line 438, in send_response
    self.log_request(code)
  File "C:\Python32\lib\http\server.py", line 483, in log_request
    self.requestline, str(code), str(size))
  File "C:\Python32\lib\http\server.py", line 517, in log_message
    (self.address_string(),
  File "C:\Python32\lib\http\server.py", line 559, in address_string
    return socket.getfqdn(host)
  File "C:\Python32\lib\socket.py", line 355, in getfqdn
    hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 5: invalid continuation byte
----------------------------------------

P.S. My PC name is "USER-ПК"
msg158178 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-04-12 21:55
a4fd3dc74299 only fixed socket.gethostname(), not socket.gethostbyaddr().
msg159776 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-05-02 06:25
For Windows versions that support it, we could use GetNameInfoW, available on XPSP2+, W2k3+ and Vista+.

The questions then are: what to do about gethostbyaddr, and what to do about the general case?

Since the problem appears to be specific to Windows, it might be appropriate to find a solution to just the Windows case, and ignore the general issue. For gethostbyaddr, decoding would then use CP_ACP.
msg243311 - (view) Author: Almad (Almad) Date: 2015-05-16 12:00
I'd add that this bug is very practical and can render a lot of software unusable/noisy/confusing on Windows, including Django (I discovered this bug when mentoring on Django Girls].

The simple step to reproduce is to take any windows and set regional settings to non-English (I've used Czech). You can verify that using "import locale; locale.getpreferredencoding()", that should display something else ("cp1250" in my case).

Then, set "name" (= hostname, in Windows settings) of the computer to anything containing non-ascii character (like "Didejo-noťas").

As Windows apparently encodes the hostname using their default encoding, it fails with

```
  File "C:\Python34\lib\wsgiref\simple_server.py", line 50, in server_bind
    HTTPServer.server_bind(self)
  File "C:\Python34\lib\http\server.py", line 135, in server_bind
    self.server_name = socket.getfqdn(host)
  File "C:\Python34\lib\socket.py", line 463, in getfqdn
    hostname, aliases, ipaddrs = gethostbyaddr(name)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 9: invalid
start byte
```
msg245826 - (view) Author: David Watson (baikie) Date: 2015-06-25 20:28
I've updated the ASCII/surrogateescape patches in line with
various changes to Python since I posted them.

return-ascii-surrogateescape-2015-06-25.diff incorporates the
ascii-surrogateescape and uname-surrogateescape patches, and
accept-ascii-surrogateescape-2015-06-25.diff corresponds to the
try-surrogateescape-first patch.  Neither patch touches
gethostname() on Windows.

Python's existing code now has a fast path for ASCII-only strings
which passes them through unchanged (Unicode -> ASCII), so in
order not to slow down processing of valid IDNs, the latter patch
now effectively tries encodings in the order

   ASCII/strict (existing code, fast path)
   IDNA/strict (existing code)
   ASCII/surrogateescape (added by patch)

rather than the previous

   ASCII/surrogateescape
   IDNA/strict

This doesn't change the behaviour of the patch, since IDNA always
rejects strings containing surrogate codes, and either rejects
ASCII-only strings (e.g. when a label is longer than 63
characters) or passes them through unchanged.

These patches would at least allow getfqdn() to work in Almad's
example, but in that case the host also appears to be addressable
by the IDNA equivalent ("xn--didejo-noas-1ic") of its Unicode
hostname (I haven't checked as I'm not a Windows user, but I
presume the UnicodeDecodeError came from gethost_common() in
socketmodule.c and hence the name lookup was successful), so it
would certainly be more helpful to return Unicode for non-ASCII
gethostbyaddr() results there, if they were guaranteed to map to
real IDNA hostnames in Windows environments.

(That isn't guaranteed in Unix environments of course, which is
why I'm still suggesting ASCII/surrogateescape for the general
case.)
msg259079 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-28 01:05
FYI I created the issue #26227 to change the encoding used to decode hostnames on Windows. UTF-8 doesn't seem to be the right encoding, it fails on non-ASCII hostnames. I propose to use the ANSI code page.

Sorry, I didn't read this issue, but it looks like IDNA isn't the good encoding to decode hostnames *on Windows*.
History
Date User Action Args
2016-01-28 01:05:18vstinnersetmessages: + msg259079
2015-06-25 20:28:14baikiesetfiles: + return-ascii-surrogateescape-2015-06-25.diff, accept-ascii-surrogateescape-2015-06-25.diff

messages: + msg245826
2015-05-16 23:11:41ned.deilysetnosy: + steve.dower
2015-05-16 12:00:58Almadsetnosy: + Almad
messages: + msg243311
2012-05-02 06:25:33loewissetmessages: + msg159776
2012-04-12 21:55:13vstinnersetmessages: + msg158178
2012-04-12 21:52:13spaun2002setmessages: + msg158175
2012-04-12 19:44:47amaury.forgeotdarcsetmessages: + msg158165
2012-04-12 10:08:03spaun2002setnosy: + spaun2002
messages: + msg158118
2010-10-31 19:34:30baikiesetmessages: + msg120081
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-29 22:29:03amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg119946
2010-10-29 22:26:53loewissetmessages: + msg119943
2010-10-29 22:25:22loewissetmessages: + msg119941
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-29 20:09:04lemburgsetmessages: + msg119935
2010-10-29 19:31:47loewissetmessages: + msg119929
2010-10-29 19:04:44lemburgsetmessages: + msg119928
2010-10-29 18:33:15lemburgsetmessages: + msg119927
2010-10-29 18:22:40loewissetmessages: + msg119925
2010-10-29 17:44:51loewissetmessages: + msg119918
2010-10-29 02:00:12r.david.murraysetnosy: + r.david.murray
messages: + msg119837
2010-10-29 01:50:01r.david.murraylinkissue10223 superseder
2010-10-21 22:37:23baikiesetmessages: + msg119346
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-21 09:24:45lemburgsetmessages: + msg119271
2010-10-21 05:09:52loewissetmessages: + msg119260
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-20 23:42:55baikiesetmessages: + msg119245
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-20 19:49:48loewissetmessages: + msg119231
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-20 19:37:22baikiesetmessages: + msg119230
2010-10-19 23:15:10baikiesetmessages: + msg119177
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-18 20:37:06loewissetmessages: + msg119076
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-18 18:11:46baikiesetmessages: + msg119051
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-17 17:38:50loewissetmessages: + msg118952
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-15 18:03:43baikiesetmessages: + msg118816
2010-10-14 18:29:03lemburgsetmessages: + msg118709
2010-10-14 17:15:02loewissetmessages: + msg118694
2010-10-14 06:12:46loewissetmessages: + msg118617
2010-10-13 23:38:21baikiesetmessages: + msg118602
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-10-13 20:50:27jesterKingsetnosy: + jesterKing
messages: + msg118582
2010-08-29 19:01:35baikiesetfiles: + hostname-bytes-apis.diff

messages: + msg115187
2010-08-29 18:59:35baikiesetfiles: - hostname-bytes-apis.diff
2010-08-29 18:47:14baikiesetfiles: + hostname-bytes-apis.diff

messages: + msg115186
2010-08-29 18:44:57baikiesetfiles: + decode-strict-ascii.diff

messages: + msg115185
2010-08-27 19:20:30loewissetmessages: + msg115119
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-08-27 19:13:04baikiesetmessages: + msg115116
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-08-26 21:36:29loewissetmessages: + msg115030
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-08-26 18:04:05baikiesetmessages: + msg115014
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-08-25 05:53:49loewissetmessages: + msg114882
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-08-24 22:59:30baikiesetmessages: + msg114847
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-08-23 23:07:05loewissetmessages: + msg114756
title: socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names -> socket, PEP 383: Mishandling of non-ASCII bytes in host/domain names
2010-08-23 22:48:15baikiesetfiles: + try-surrogateescape-first-4.diff, try-surrogateescape-first-getnameinfo-4.diff

messages: + msg114754
2010-08-22 22:03:25loewissetmessages: + msg114710
2010-08-22 18:27:37baikiesetfiles: + try-surrogateescape-first-3.diff

messages: + msg114688
2010-07-30 18:14:44baikiesetfiles: + try-surrogateescape-first-2.diff
2010-07-30 18:11:44baikiesetfiles: + ascii-surrogateescape-2.diff

messages: + msg112094
2010-07-29 18:28:19baikiesetfiles: + uname-surrogateescape.diff

messages: + msg111985
2010-07-28 02:44:42vstinnersetmessages: + msg111766
2010-07-26 11:31:08eric.araujosetnosy: + lemburg, loewis, vstinner, ezio.melotti
2010-07-25 18:33:54baikiesetfiles: + try-surrogateescape-first.diff
2010-07-25 18:33:02baikiecreate