Message 119051 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	baikie
Recipients	baikie, ezio.melotti, jesterKing, lemburg, loewis, vstinner
Date	2010-10-18.18:11:45
SpamBayes Score	3.263484e-11
Marked as misclassified	No
Message-id	<20101018181136.GA3631@dbwatson.ukfsn.org>
In-reply-to	<4CBB34A8.4020600@v.loewis.de>

Content
> The result from gethostname likely comes out of machine-local > configuration. It may have non-ASCII in it, which is then likely > encoded in the local encoding. When looking it up in DNS, IDNA > should be applied. I would have thought that someone who intended a Unicode hostname to be looked up in its IDNA form would have encoded it using IDNA, rather than an 8-bit encoding - how many C programs would transcode the name that way, rather than just passing the char * from one interface to another? In fact, I would think that non-ASCII bytes in a hostname most probably indicated that a name resolution mechanism other than the DNS was in use, and that the byte string should be passed unaltered just as a typical C program would. > OTOH, output from gethostbyaddr likely comes out of the DNS itself. > Guessing what encoding it may have is futile - other than guessing > that it really ought to be ASCII. Sure, but that doesn't mean the result can't be made to round-trip if it turns out not to be ASCII. The guess that it will be ASCII is, after all, still a guess (as is the guess that it comes from the DNS). > Python's socket module is clearly focused on the internet, and > intends to support that well. So if you pass a non-ASCII > string, it will have to encode that using IDNA. If that's > not what you want to get, tough luck. I don't object to that, but it does force a choice between decoding an 8-bit name for display (e.g. by using the locale encoding), and decoding it to round-trip automatically (e.g. by using ASCII/surrogateescape, with support on the encoding side). Using PyUnicode_DecodeFSDefault() for the hostname or other returned names (thus decoding them for display) would make this issue solvable with programmer intervention - for instance, "socket.gethostbyaddr(socket.gethostname())" could be replaced by "socket.gethostbyaddr(os.fsencode(socket.gethostname()))", but programmers might well neglect to do this, given that no encoding was needed in Python 2. Also, even displaying a non-ASCII name decoded according to the locale creates potential for confusion, as when the user types the same characters into a Python program for lookup (again, barring programmer intervention), they will not represent the same byte sequence as the characters the user sees on the screen (as they will instead represent their IDNA ASCII-compatible equivalent). So overall, I do think it is better to decode names for automatic round-tripping rather than for display, but my main concern is simply that it should be possible to recover the original bytes so that round-tripping is at least possible. PyUnicode_DecodeFSDefault() would accomplish that much at least.

> The result from gethostname likely comes out of machine-local
> configuration. It may have non-ASCII in it, which is then likely
> encoded in the local encoding. When looking it up in DNS, IDNA
> should be applied.

I would have thought that someone who intended a Unicode hostname
to be looked up in its IDNA form would have encoded it using
IDNA, rather than an 8-bit encoding - how many C programs would
transcode the name that way, rather than just passing the char *
from one interface to another?

In fact, I would think that non-ASCII bytes in a hostname most
probably indicated that a name resolution mechanism other than
the DNS was in use, and that the byte string should be passed
unaltered just as a typical C program would.

> OTOH, output from gethostbyaddr likely comes out of the DNS itself.
> Guessing what encoding it may have is futile - other than guessing
> that it really ought to be ASCII.

Sure, but that doesn't mean the result can't be made to
round-trip if it turns out not to be ASCII.  The guess that it
will be ASCII is, after all, still a guess (as is the guess that
it comes from the DNS).

> Python's socket module is clearly focused on the internet, and
> intends to support that well. So if you pass a non-ASCII
> string, it will have to encode that using IDNA. If that's
> not what you want to get, tough luck.

I don't object to that, but it does force a choice between
decoding an 8-bit name for display (e.g. by using the locale
encoding), and decoding it to round-trip automatically (e.g. by
using ASCII/surrogateescape, with support on the encoding side).

Using PyUnicode_DecodeFSDefault() for the hostname or other
returned names (thus decoding them for display) would make this
issue solvable with programmer intervention - for instance,
"socket.gethostbyaddr(socket.gethostname())" could be replaced by
"socket.gethostbyaddr(os.fsencode(socket.gethostname()))", but
programmers might well neglect to do this, given that no encoding
was needed in Python 2.

Also, even displaying a non-ASCII name decoded according to the
locale creates potential for confusion, as when the user types
the same characters into a Python program for lookup (again,
barring programmer intervention), they will not represent the
same byte sequence as the characters the user sees on the screen
(as they will instead represent their IDNA ASCII-compatible
equivalent).

So overall, I do think it is better to decode names for automatic
round-tripping rather than for display, but my main concern is
simply that it should be possible to recover the original bytes
so that round-tripping is at least possible.
PyUnicode_DecodeFSDefault() would accomplish that much at least.

History
Date	User	Action	Args
2010-10-18 18:11:47	baikie	set	recipients: + baikie, lemburg, loewis, vstinner, ezio.melotti, jesterKing
2010-10-18 18:11:46	baikie	link	issue9377 messages
2010-10-18 18:11:45	baikie	create