This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author baikie
Recipients baikie, ezio.melotti, lemburg, loewis, vstinner
Date 2010-07-30.18:11:42
SpamBayes Score 0.00010779957
Marked as misclassified No
Message-id <1280513506.25.0.428915442604.issue9377@psf.upfronthosting.co.za>
In-reply-to
Content
OK, here are new versions of the original patches.

I've tweaked the docs to make clear that ASCII-compatible
encodings actually *are* ASCII, and point to an explanation as
soon as they're mentioned.

You're right that PyUnicode_AsEncodedString() is the preferable
interface for the argument converter (I think I got
PyUnicode_AsEncodedObject() from an old version of
PyUnicode_FSConverter() :/), but for the ASCII step I've just
short-circuited it and used PyUnicode_EncodeASCII() directly,
since the converter has already checked that the object is of
Unicode type.  For the IDNA step, PyUnicode_AsEncodedString()
should result in a less confusing error message if the codec
returns some non-bytes object one day.

However, the PyBytes_Check isn't to check up on the codec, but to
check for a bytes argument, which the converter also supports.
For that reason, I think encode_hostname would be a misleading
name, but I've renamed it hostname_converter after the example of
PyUnicode_FSConverter, and renamed unicode_from_hostname to
decode_hostname.

I've also made the converter check for UnicodeEncodeError in the
ASCII step, but the end result really is UnicodeError if the IDNA
step fails, because the "idna" codec does not use
UnicodeEncodeError or UnicodeDecodeError.  Complain about that if
you wish :)


I think the example I gave in the previous comment was also
confusing, so just to be clear...

In /etc/hosts (in UTF-8 encoding):

127.0.0.2       €
127.0.0.3       xn--lzg


Without patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('€', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))]
>>> '€'.encode("idna")
b'xn--lzg'


With patches:

>>> from socket import *
>>> getnameinfo(("127.0.0.3", 0), 0)
('xn--lzg', '0')
>>> getnameinfo(("127.0.0.2", 0), 0)
('\udce2\udc82\udcac', '0')
>>> getaddrinfo(*_)
[(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))]
>>> '\udce2\udc82\udcac'.encode("idna")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 167, in encode
    result.extend(ToASCII(label))
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 76, in ToASCII
    label = nameprep(label)
  File
  "/home/david/python-patches/python-3/Lib/encodings/idna.py",
  line 38, in nameprep
    raise UnicodeError("Invalid character %r" % c)
UnicodeError: Invalid character '\udce2'


The exception at the end demonstrates why surrogateescape strings
don't get confused with IDNs.
History
Date User Action Args
2010-07-30 18:11:46baikiesetrecipients: + baikie, lemburg, loewis, vstinner, ezio.melotti
2010-07-30 18:11:46baikiesetmessageid: <1280513506.25.0.428915442604.issue9377@psf.upfronthosting.co.za>
2010-07-30 18:11:44baikielinkissue9377 messages
2010-07-30 18:11:43baikiecreate