Message 111550 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	baikie
Recipients	baikie
Date	2010-07-25.18:32:56
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1280082784.57.0.628810673592.issue9377@psf.upfronthosting.co.za>
In-reply-to

Content
The functions in the socket module which return host/domain names, such as gethostbyaddr() and getnameinfo(), are wrappers around byte-oriented interfaces but return Unicode strings in 3.x, and have not been updated to deal with undecodable byte sequences in the results, as discussed in PEP 383. Some DNS resolvers do discard hostnames not matching the ASCII-only RFC 1123 syntax, but checks for this may be absent or turned off, and non-ASCII bytes can be returned via other lookup facilities such as /etc/hosts. Currently, names are converted to str objects using PyUnicode_FromString(), i.e. by attempting to decode them as UTF-8. This can fail with UnicodeError of course, but even if it succeeds, any non-ASCII names returned will fail to round-trip correctly because most socket functions encode string arguments into IDNA ASCII-compatible form before using them. For example, with UTF-8 encoded entries 127.0.0.2 € 127.0.0.3 xn--lzg in /etc/hosts, I get: Python 3.1.2 (r312:79147, Mar 23 2010, 19:02:21) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from socket import * >>> getnameinfo(("127.0.0.2", 0), 0) ('€', '0') >>> getaddrinfo(_) [(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))] Here, getaddrinfo() has encoded "€" to its corresponding ACE label "xn--lzg", which maps to a different address. PEP 383 can't be applied as-is here, since if the name happened to be decodable in the file system encoding (and thus was returned as valid non-ASCII Unicode), the result would fail to round-trip correctly as shown above, but I think there is a solution which follows the general idea of PEP 383. Surrogate characters are not allowed in IDNs, since they are prohibited by Nameprep[1][2], so if names were instead decoded as ASCII with the surrogateescape error handler, strings representing non-ASCII names would always contain surrogate characters representing the non-ASCII bytes, and would therefore fail to encode with the IDNA codec. Thus there would be no ambiguity between these strings and valid IDNs. The attached ascii-surrogateescape.diff does this. The returned strings could then be made to round-trip as arguments, by having functions that take hostname arguments attempt to encode them using ASCII/surrogateescape first before trying IDNA encoding. Since IDNA leaves ASCII names unchanged and surrogate characters are not allowed in IDNs, this would not change the interpretation of any string hostnames that are currently accepted. It would remove the 63-octet limit on label length currently imposed due to the IDNA encoding, for ASCII names only, but since this is imposed due to the 63-octet limit of the DNS, and non-IDN names may be intended for other resolution mechanisms, I think this is a feature, not a bug :) The patch try-surrogateescape-first.diff implements the above for all relevant interfaces, including gethostbyaddr() and getnameinfo(), which do currently accept hostnames, even if the documentation is vague (in the standard library, socket.fqdn() calls gethostbyaddr() with a hostname, and the "os" module docs suggest calling socket.gethostbyaddr(socket.gethostname()) to get the fully-qualified hostname). The patch still allows hostnames to be passed as bytes objects, but to simplify the implementation, it removes support for bytearray (as has been done for pathnames in 3.2). Bytearrays are currently only accepted by the socket object methods (.connect(), etc.), and this is undocumented and perhaps unintentional - the get() functions have never accepted them. One problem with the surrogateescape scheme would be with existing code that looks up an address and then tries to write the hostname to a log file or use it as part of the wire protocol, since the surrogate characters would fail to encode as ASCII or UTF-8, but the code would appear to work normally until it encountered a non-ASCII hostname, allowing the problem to go undetected. On the other hand, such code is probably broken as things stand, given that the address lookup functions can undocumentedly raise UnicodeError in the same situation. Also, protocol definitions often specify some variant of the RFC 1123 syntax for hostnames (thus making non-ASCII bytes illegal), so code that checked for this prior to encoding the name would probably be OK, but it's more likely the exception than the rule. An alternative approach might be to return all hostnames as bytes objects, thus breaking everything immediately and obviously... [1] http://tools.ietf.org/html/rfc3491#section-5 [2] http://tools.ietf.org/html/rfc3454#appendix-C.5

The functions in the socket module which return host/domain
names, such as gethostbyaddr() and getnameinfo(), are wrappers
around byte-oriented interfaces but return Unicode strings in
3.x, and have not been updated to deal with undecodable byte
sequences in the results, as discussed in PEP 383.

Some DNS resolvers do discard hostnames not matching the
ASCII-only RFC 1123 syntax, but checks for this may be absent or
turned off, and non-ASCII bytes can be returned via other lookup
facilities such as /etc/hosts.

Currently, names are converted to str objects using
PyUnicode_FromString(), i.e. by attempting to decode them as
UTF-8.  This can fail with UnicodeError of course, but even if it
succeeds, any non-ASCII names returned will fail to round-trip
correctly because most socket functions encode string arguments
into IDNA ASCII-compatible form before using them.  For example,
with UTF-8 encoded entries

127.0.0.2       €
127.0.0.3       xn--lzg

in /etc/hosts, I get:

Python 3.1.2 (r312:79147, Mar 23 2010, 19:02:21) 
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> from socket import *
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]

Here, getaddrinfo() has encoded "€" to its corresponding ACE
label "xn--lzg", which maps to a different address.

PEP 383 can't be applied as-is here, since if the name happened
to be decodable in the file system encoding (and thus was
returned as valid non-ASCII Unicode), the result would fail to
round-trip correctly as shown above, but I think there is a
solution which follows the general idea of PEP 383.

Surrogate characters are not allowed in IDNs, since they are
prohibited by Nameprep[1][2], so if names were instead decoded as
ASCII with the surrogateescape error handler, strings
representing non-ASCII names would always contain surrogate
characters representing the non-ASCII bytes, and would therefore
fail to encode with the IDNA codec.  Thus there would be no
ambiguity between these strings and valid IDNs.  The attached
ascii-surrogateescape.diff does this.

The returned strings could then be made to round-trip as
arguments, by having functions that take hostname arguments
attempt to encode them using ASCII/surrogateescape first before
trying IDNA encoding.  Since IDNA leaves ASCII names unchanged
and surrogate characters are not allowed in IDNs, this would not
change the interpretation of any string hostnames that are
currently accepted.  It would remove the 63-octet limit on label
length currently imposed due to the IDNA encoding, for ASCII
names only, but since this is imposed due to the 63-octet limit
of the DNS, and non-IDN names may be intended for other
resolution mechanisms, I think this is a feature, not a bug :)

The patch try-surrogateescape-first.diff implements the above for
all relevant interfaces, including gethostbyaddr() and
getnameinfo(), which do currently accept hostnames, even if the
documentation is vague (in the standard library, socket.fqdn()
calls gethostbyaddr() with a hostname, and the "os" module docs
suggest calling socket.gethostbyaddr(socket.gethostname()) to get
the fully-qualified hostname).

The patch still allows hostnames to be passed as bytes objects,
but to simplify the implementation, it removes support for
bytearray (as has been done for pathnames in 3.2).  Bytearrays
are currently only accepted by the socket object methods
(.connect(), etc.), and this is undocumented and perhaps
unintentional - the get*() functions have never accepted them.

One problem with the surrogateescape scheme would be with
existing code that looks up an address and then tries to write
the hostname to a log file or use it as part of the wire
protocol, since the surrogate characters would fail to encode as
ASCII or UTF-8, but the code would appear to work normally until
it encountered a non-ASCII hostname, allowing the problem to go
undetected.

On the other hand, such code is probably broken as things stand,
given that the address lookup functions can undocumentedly raise
UnicodeError in the same situation.  Also, protocol definitions
often specify some variant of the RFC 1123 syntax for hostnames
(thus making non-ASCII bytes illegal), so code that checked for
this prior to encoding the name would probably be OK, but it's
more likely the exception than the rule.

An alternative approach might be to return all hostnames as bytes
objects, thus breaking everything immediately and obviously...


[1] http://tools.ietf.org/html/rfc3491#section-5
[2] http://tools.ietf.org/html/rfc3454#appendix-C.5

History
Date	User	Action	Args
2010-07-25 18:33:04	baikie	set	recipients: + baikie
2010-07-25 18:33:04	baikie	set	messageid: <1280082784.57.0.628810673592.issue9377@psf.upfronthosting.co.za>
2010-07-25 18:33:02	baikie	link	issue9377 messages
2010-07-25 18:32:57	baikie	create