Message 119177 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	baikie
Recipients	baikie, ezio.melotti, jesterKing, lemburg, loewis, vstinner
Date	2010-10-19.23:15:08
SpamBayes Score	2.2377489e-11
Marked as misclassified	No
Message-id	<20101019231500.GA4627@dbwatson.ukfsn.org>
In-reply-to	<4CBCAFF1.9050105@v.loewis.de>

Content
> > In fact, I would think that non-ASCII bytes in a hostname most > > probably indicated that a name resolution mechanism other than > > the DNS was in use, and that the byte string should be passed > > unaltered just as a typical C program would. > > I'm not talking about byte strings, but character strings. I mean that passing the str object from socket.gethostname() to the Python lookup function ought to result in the same byte string being passed to the C lookup function as was returned by the C gethostname() function (or else that the programmer must re-encode the str to ensure that that result is obtained). > > I don't object to that, but it does force a choice between > > decoding an 8-bit name for display (e.g. by using the locale > > encoding), and decoding it to round-trip automatically (e.g. by > > using ASCII/surrogateescape, with support on the encoding side). > > In the face of ambiguity, refuse the temptation to guess. Yes, I would interpret that to mean not using the locale encoding for data obtained from the network. That's another reason why the ASCII/surrogateescape scheme appeals to me more. > Well, Python is not C. In Python, you would pass a str, and > expect it to work, which means it will get automatically encoded > with IDNA. I think there might be a misunderstanding here - I've never proposed changing the interpretation of Unicode characters in hostname arguments. The ASCII/surrogateescape scheme I suggested only changes the interpretation of unpaired surrogate codes, as they do not occur in IDNs or any other genuine Unicode data; all IDNs, including those solely consisting of ASCII characters, would be encoded to the same byte sequence as before. ASCII/surrogateescape decoding could also be used without support on the encoding side - that would satisfy the requirement to "refuse the temptation to guess", would allow the original bytes to be recovered, and would mean that attempting to look up a non-ASCII result in str form would raise an exception rather than looking up the wrong name. > Marc-Andre wants gethostname to use the Wide API on Windows, which, > in theory, allows for cases where round-tripping to bytes is > impossible. Well, the name resolution APIs wrapped by Python are all byte-oriented, so if the computer name were to have no bytes equivalent then it wouldn't be possible to resolve it anyway, and an exception rightly ought be raised at some point in the process of trying to do so.

> > In fact, I would think that non-ASCII bytes in a hostname most
> > probably indicated that a name resolution mechanism other than
> > the DNS was in use, and that the byte string should be passed
> > unaltered just as a typical C program would.
> 
> I'm not talking about byte strings, but character strings.

I mean that passing the str object from socket.gethostname() to
the Python lookup function ought to result in the same byte
string being passed to the C lookup function as was returned by
the C gethostname() function (or else that the programmer must
re-encode the str to ensure that that result is obtained).

> > I don't object to that, but it does force a choice between
> > decoding an 8-bit name for display (e.g. by using the locale
> > encoding), and decoding it to round-trip automatically (e.g. by
> > using ASCII/surrogateescape, with support on the encoding side).
> 
> In the face of ambiguity, refuse the temptation to guess.

Yes, I would interpret that to mean not using the locale encoding
for data obtained from the network.  That's another reason why
the ASCII/surrogateescape scheme appeals to me more.

> Well, Python is not C. In Python, you would pass a str, and
> expect it to work, which means it will get automatically encoded
> with IDNA.

I think there might be a misunderstanding here - I've never
proposed changing the interpretation of Unicode characters in
hostname arguments.  The ASCII/surrogateescape scheme I suggested
only changes the interpretation of unpaired surrogate codes, as
they do not occur in IDNs or any other genuine Unicode data; all
IDNs, including those solely consisting of ASCII characters,
would be encoded to the same byte sequence as before.

ASCII/surrogateescape decoding could also be used without support
on the encoding side - that would satisfy the requirement to
"refuse the temptation to guess", would allow the original bytes
to be recovered, and would mean that attempting to look up a
non-ASCII result in str form would raise an exception rather than
looking up the wrong name.

> Marc-Andre wants gethostname to use the Wide API on Windows, which,
> in theory, allows for cases where round-tripping to bytes is
> impossible.

Well, the name resolution APIs wrapped by Python are all
byte-oriented, so if the computer name were to have no bytes
equivalent then it wouldn't be possible to resolve it anyway, and
an exception rightly ought be raised at some point in the process
of trying to do so.

History
Date	User	Action	Args
2010-10-19 23:15:12	baikie	set	recipients: + baikie, lemburg, loewis, vstinner, ezio.melotti, jesterKing
2010-10-19 23:15:09	baikie	link	issue9377 messages
2010-10-19 23:15:08	baikie	create