Message 199371 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	ezio.melotti, gvanrossum, kennyluck, lemburg, loewis, pitrou, serhiy.storchaka, tchrist, vstinner
Date	2013-10-10.08:30:26
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1381393826.56.0.626717830803.issue12892@psf.upfronthosting.co.za>
In-reply-to

Content
I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected. >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore') '[]' >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace') '[�]' >>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape') '[\udc80\udcdc\uffff' => I expected '[\udc80\udcdc]'. With a decoder, surrogateescape does not work neither: >>> '[\uDC80]'.encode('utf-16-le', 'surrogateescape') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-16-le' codec can't encode character '\udc80' in position 1: surrogates not allowed Using the PEP 383, I expect that data.decode(encoding, 'surrogateescape') does never fail, data.decode(encoding, 'surrogateescape').encode(encoding, 'surrogateescape') should give data. -- With UTF-16, there is a corner case: >>> b'[\x00\x00'.decode('utf-16-le', 'surrogateescape') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/haypo/prog/python/default/Lib/encodings/utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 2: truncated data >>> b'[\x00\x80'.decode('utf-16-le', 'surrogateescape') '[\udc80' The incomplete sequence b'\x00' raises a decoder error, wheras b'\x80' does not. Should we extend the PEP 383 to bytes in range [0; 127]? Or should we keep this behaviour? Sorry, this question is unrelated to this issue.

I tested utf_16_32_surrogates_4.patch: surrogateescape with as encoder does not work as expected.

>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'ignore')
'[]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'replace')
'[�]'
>>> b'[\x00\x80\xdc]\x00'.decode('utf-16-le', 'surrogateescape')
'[\udc80\udcdc\uffff'

=> I expected '[\udc80\udcdc]'.

With a decoder, surrogateescape does not work neither:

>>> '[\uDC80]'.encode('utf-16-le', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\udc80' in position 1: surrogates not allowed

Using the PEP 383, I expect that data.decode(encoding, 'surrogateescape') does never fail, data.decode(encoding, 'surrogateescape').encode(encoding, 'surrogateescape') should give data.

--

With UTF-16, there is a corner case:

>>> b'[\x00\x00'.decode('utf-16-le', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/haypo/prog/python/default/Lib/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 2: truncated data
>>> b'[\x00\x80'.decode('utf-16-le', 'surrogateescape')
'[\udc80'

The incomplete sequence b'\x00' raises a decoder error, wheras b'\x80' does not. Should we extend the PEP 383 to bytes in range [0; 127]? Or should we keep this behaviour?

Sorry, this question is unrelated to this issue.

History
Date	User	Action	Args
2013-10-10 08:30:26	vstinner	set	recipients: + vstinner, lemburg, gvanrossum, loewis, pitrou, ezio.melotti, tchrist, kennyluck, serhiy.storchaka
2013-10-10 08:30:26	vstinner	set	messageid: <1381393826.56.0.626717830803.issue12892@psf.upfronthosting.co.za>
2013-10-10 08:30:26	vstinner	link	issue12892 messages
2013-10-10 08:30:26	vstinner	create