Issue 27971: utf-16 decoding can't handle lone surrogates

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72158

classification

Title:	utf-16 decoding can't handle lone surrogates
Type:		Stage:
Components:	Unicode	Versions:	Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	eryksun, ezio.melotti, lazka, terry.reedy, vstinner, xiang.zhang
Priority:	normal	Keywords:

Created on 2016-09-06 09:59 by lazka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (14)
msg274546 - (view)	Author: Christoph Reiter (lazka) *	Date: 2016-09-06 09:59
Using Python 2.7.12 >>> u"\ud83d".encode("utf-16-le") '=\xd8' >>> u"\ud83d".encode("utf-16-le").decode("utf-16-le") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/encodings/utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data >>>
msg274548 - (view)	Author: Christoph Reiter (lazka) *	Date: 2016-09-06 10:23
Same problem on 3.3.6. But works on 3.4.5. So I guess this was fixed but not backported.
msg274555 - (view)	Author: Xiang Zhang (xiang.zhang) *	Date: 2016-09-06 13:43
With the latest build, even encode will fail: Python 3.6.0a4+ (default:dad4c42869f6, Sep 6 2016, 21:41:38) [GCC 5.2.1 20151010] on linux Type "help", "copyright", "credits" or "license" for more information. >>> u"\ud83d".encode("utf-16-le") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0: surrogates not allowed
msg274556 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-09-06 14:10
Probably Python 2's UTF-16 decoder should be as broken as the encoder, which will match the broken behavior of the UTF-8 and UTF-32 codecs: >>> u'\ud83d\uda12'.encode('utf-8').decode('utf-8') u'\ud83d\uda12' >>> u'\ud83d\uda12'.encode('utf-32-le').decode('utf-32-le') u'\ud83d\uda12' Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates: >>> u'\ud83d\uda12'.encode('utf-16le', 'surrogatepass') b'=\xd8\x12\xda' >>> _.decode('utf-16le', 'surrogatepass') '\ud83d\uda12'
msg274558 - (view)	Author: Christoph Reiter (lazka) *	Date: 2016-09-06 15:06
On Tue, Sep 6, 2016 at 3:43 PM, Xiang Zhang <report@bugs.python.org> wrote: > > Xiang Zhang added the comment: > > With the latest build, even encode will fail: With Python 3 you have to use the "surrogatepass" error handler. I assumed this was the default in Python 2 since it worked with other codecs.
msg274560 - (view)	Author: Christoph Reiter (lazka) *	Date: 2016-09-06 15:10
On Tue, Sep 6, 2016 at 4:10 PM, Eryk Sun <report@bugs.python.org> wrote: > Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates: To add some context: I was writing tests for windows paths containing surrogates (e.g. os.listdir can return them)
msg274565 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-09-06 16:35
UTF codecs must not encode surrogate characters: http://unicodebook.readthedocs.io/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates Python 3 is right, sadly it's too late to fix Python 2.
msg274593 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-09-06 18:34
Victor, it seems the only option here (other than closing this as won't fix) is to modify the UTF-16 decoder in 2.7 to allow lone surrogates, which would be consistent with the UTF-8 and UTF-32 decoders. While it's too late to enforce strict compliance in 2.7, it shouldn't hurt to expand the domain of acceptable encodings. Then if surrogates are always passed in 2.7, a silently ignored "surrogatepass" handler could be added for compatibility with 3.x code.
msg274620 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-09-06 20:30
I dislike the idea of changing the behaviour in a minor release :-/
msg275406 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2016-09-09 19:40
Unless the 2.7 docs specify that the utf codecs should violate the standard with respect to lone surrogates, I think this should definitely be closed (as 'not a bug').
msg275483 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-09-09 22:56
Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug. It's misbehavior, and it either will be or won't be fixed. From Victor's response it's looking like the latter.
msg275495 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-09-09 23:38
> Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug. I didn't say that it's not a bug. I said that it's not possible to modify a codec at this point in Python 2.7 without taking a risk of breaking applications relying on the current behaviour. Even in Python 3, we don't do such change in minor releases, but only in major releases.
msg275522 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-09-10 01:07
I wasn't trying to put words in your mouth, Victor. I was replying to Terry (msg275406).
msg275590 - (view)	Author: Christoph Reiter (lazka) *	Date: 2016-09-10 07:29
Closing as wontfix if there are concerns regarding compatibility seems fine to me. Thanks for looking into this. I've also found a workaround for my usecase in the meantime: https://github.com/lazka/senf/commit/b7dadb05a29db5f0d74f659971b0a86d5e579028

History
Date	User	Action	Args
2022-04-11 14:58:35	admin	set	github: 72158
2016-12-06 09:51:53	lazka	set	status: open -> closed resolution: wont fix
2016-09-10 07:29:35	lazka	set	messages: + msg275590
2016-09-10 01:07:27	eryksun	set	messages: + msg275522
2016-09-09 23:38:09	vstinner	set	messages: + msg275495
2016-09-09 22:56:50	eryksun	set	messages: + msg275483
2016-09-09 19:40:24	terry.reedy	set	nosy: + terry.reedy messages: + msg275406
2016-09-06 20:30:38	vstinner	set	messages: + msg274620
2016-09-06 18:34:13	eryksun	set	messages: + msg274593
2016-09-06 16:35:45	vstinner	set	messages: + msg274565
2016-09-06 15:10:57	lazka	set	messages: + msg274560
2016-09-06 15:06:12	lazka	set	messages: + msg274558
2016-09-06 14:10:11	eryksun	set	nosy: + eryksun messages: + msg274556
2016-09-06 13:43:16	xiang.zhang	set	nosy: + xiang.zhang messages: + msg274555
2016-09-06 10:23:52	lazka	set	messages: + msg274548
2016-09-06 09:59:43	lazka	create