Message 112094 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	baikie
Recipients	baikie, ezio.melotti, lemburg, loewis, vstinner
Date	2010-07-30.18:11:42
SpamBayes Score	0.00010779957
Marked as misclassified	No
Message-id	<1280513506.25.0.428915442604.issue9377@psf.upfronthosting.co.za>
In-reply-to

Content
OK, here are new versions of the original patches. I've tweaked the docs to make clear that ASCII-compatible encodings actually are ASCII, and point to an explanation as soon as they're mentioned. You're right that PyUnicode_AsEncodedString() is the preferable interface for the argument converter (I think I got PyUnicode_AsEncodedObject() from an old version of PyUnicode_FSConverter() :/), but for the ASCII step I've just short-circuited it and used PyUnicode_EncodeASCII() directly, since the converter has already checked that the object is of Unicode type. For the IDNA step, PyUnicode_AsEncodedString() should result in a less confusing error message if the codec returns some non-bytes object one day. However, the PyBytes_Check isn't to check up on the codec, but to check for a bytes argument, which the converter also supports. For that reason, I think encode_hostname would be a misleading name, but I've renamed it hostname_converter after the example of PyUnicode_FSConverter, and renamed unicode_from_hostname to decode_hostname. I've also made the converter check for UnicodeEncodeError in the ASCII step, but the end result really is UnicodeError if the IDNA step fails, because the "idna" codec does not use UnicodeEncodeError or UnicodeDecodeError. Complain about that if you wish :) I think the example I gave in the previous comment was also confusing, so just to be clear... In /etc/hosts (in UTF-8 encoding): 127.0.0.2 € 127.0.0.3 xn--lzg Without patches: >>> from socket import * >>> getnameinfo(("127.0.0.3", 0), 0) ('xn--lzg', '0') >>> getnameinfo(("127.0.0.2", 0), 0) ('€', '0') >>> getaddrinfo(_) [(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))] >>> '€'.encode("idna") b'xn--lzg' With patches: >>> from socket import >>> getnameinfo(("127.0.0.3", 0), 0) ('xn--lzg', '0') >>> getnameinfo(("127.0.0.2", 0), 0) ('\udce2\udc82\udcac', '0') >>> getaddrinfo(*_) [(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))] >>> '\udce2\udc82\udcac'.encode("idna") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/david/python-patches/python-3/Lib/encodings/idna.py", line 167, in encode result.extend(ToASCII(label)) File "/home/david/python-patches/python-3/Lib/encodings/idna.py", line 76, in ToASCII label = nameprep(label) File "/home/david/python-patches/python-3/Lib/encodings/idna.py", line 38, in nameprep raise UnicodeError("Invalid character %r" % c) UnicodeError: Invalid character '\udce2' The exception at the end demonstrates why surrogateescape strings don't get confused with IDNs.

OK, here are new versions of the original patches.

I've tweaked the docs to make clear that ASCII-compatible
encodings actually *are* ASCII, and point to an explanation as
soon as they're mentioned.

You're right that PyUnicode_AsEncodedString() is the preferable
interface for the argument converter (I think I got
PyUnicode_AsEncodedObject() from an old version of
PyUnicode_FSConverter() :/), but for the ASCII step I've just
short-circuited it and used PyUnicode_EncodeASCII() directly,
since the converter has already checked that the object is of
Unicode type.  For the IDNA step, PyUnicode_AsEncodedString()
should result in a less confusing error message if the codec
returns some non-bytes object one day.

However, the PyBytes_Check isn't to check up on the codec, but to
check for a bytes argument, which the converter also supports.
For that reason, I think encode_hostname would be a misleading
name, but I've renamed it hostname_converter after the example of
PyUnicode_FSConverter, and renamed unicode_from_hostname to
decode_hostname.

I've also made the converter check for UnicodeEncodeError in the
ASCII step, but the end result really is UnicodeError if the IDNA
step fails, because the "idna" codec does not use
UnicodeEncodeError or UnicodeDecodeError.  Complain about that if
you wish :)


I think the example I gave in the previous comment was also
confusing, so just to be clear...

In /etc/hosts (in UTF-8 encoding):

127.0.0.2       €
127.0.0.3       xn--lzg


Without patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
>>> '€'.encode("idna")
b'xn--lzg'


With patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('\udce2\udc82\udcac', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))]
>>> '\udce2\udc82\udcac'.encode("idna")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 167, in encode
    result.extend(ToASCII(label))
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 76, in ToASCII
    label = nameprep(label)
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 38, in nameprep
    raise UnicodeError("Invalid character %r" % c)
UnicodeError: Invalid character '\udce2'


The exception at the end demonstrates why surrogateescape strings
don't get confused with IDNs.

History
Date	User	Action	Args
2010-07-30 18:11:46	baikie	set	recipients: + baikie, lemburg, loewis, vstinner, ezio.melotti
2010-07-30 18:11:46	baikie	set	messageid: <1280513506.25.0.428915442604.issue9377@psf.upfronthosting.co.za>
2010-07-30 18:11:44	baikie	link	issue9377 messages
2010-07-30 18:11:43	baikie	create