This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: utf-16 decoding can't handle lone surrogates
Type: Stage:
Components: Unicode Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, lazka, terry.reedy, vstinner, xiang.zhang
Priority: normal Keywords:

Created on 2016-09-06 09:59 by lazka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (14)
msg274546 - (view) Author: Christoph Reiter (lazka) * Date: 2016-09-06 09:59
Using Python 2.7.12

>>> u"\ud83d".encode("utf-16-le")
'=\xd8'
>>> u"\ud83d".encode("utf-16-le").decode("utf-16-le")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
>>>
msg274548 - (view) Author: Christoph Reiter (lazka) * Date: 2016-09-06 10:23
Same problem on 3.3.6. But works on 3.4.5. So I guess this was fixed but not backported.
msg274555 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-09-06 13:43
With the latest build, even encode will fail:

Python 3.6.0a4+ (default:dad4c42869f6, Sep  6 2016, 21:41:38) 
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\ud83d".encode("utf-16-le")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0: surrogates not allowed
msg274556 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-06 14:10
Probably Python 2's UTF-16 decoder should be as broken as the encoder, which will match the broken behavior of the UTF-8 and UTF-32 codecs:

    >>> u'\ud83d\uda12'.encode('utf-8').decode('utf-8')
    u'\ud83d\uda12'
    >>> u'\ud83d\uda12'.encode('utf-32-le').decode('utf-32-le')
    u'\ud83d\uda12'

Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates: 

    >>> u'\ud83d\uda12'.encode('utf-16le', 'surrogatepass')
    b'=\xd8\x12\xda'
    >>> _.decode('utf-16le', 'surrogatepass')
    '\ud83d\uda12'
msg274558 - (view) Author: Christoph Reiter (lazka) * Date: 2016-09-06 15:06
On Tue, Sep 6, 2016 at 3:43 PM, Xiang Zhang <report@bugs.python.org> wrote:
>
> Xiang Zhang added the comment:
>
> With the latest build, even encode will fail:

With Python 3 you have to use the "surrogatepass" error handler. I
assumed this was the default in Python 2 since it worked with other
codecs.
msg274560 - (view) Author: Christoph Reiter (lazka) * Date: 2016-09-06 15:10
On Tue, Sep 6, 2016 at 4:10 PM, Eryk Sun <report@bugs.python.org> wrote:
> Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates:

To add some context: I was writing tests for windows paths containing
surrogates (e.g. os.listdir can return them)
msg274565 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-06 16:35
UTF codecs must not encode surrogate characters:
http://unicodebook.readthedocs.io/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates

Python 3 is right, sadly it's too late to fix Python 2.
msg274593 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-06 18:34
Victor, it seems the only option here (other than closing this as won't fix) is to modify the UTF-16 decoder in 2.7 to allow lone surrogates, which would be consistent with the UTF-8 and UTF-32 decoders. While it's too late to enforce strict compliance in 2.7, it shouldn't hurt to expand the domain of acceptable encodings. Then if surrogates are always passed in 2.7, a silently ignored "surrogatepass" handler could be added for compatibility with 3.x code.
msg274620 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-06 20:30
I dislike the idea of changing the behaviour in a minor release :-/
msg275406 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-09-09 19:40
Unless the 2.7 docs specify that the utf codecs should violate the standard with respect to lone surrogates, I think this should definitely be closed (as 'not a bug').
msg275483 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-09 22:56
Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug. It's misbehavior, and it either will be or won't be fixed. From Victor's response it's looking like the latter.
msg275495 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-09 23:38
> Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug.

I didn't say that it's not a bug. I said that it's not possible to
modify a codec at this point in Python 2.7 without taking a risk of
breaking applications relying on the current behaviour. Even in Python
3, we don't do such change in minor releases, but only in major
releases.
msg275522 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-09-10 01:07
I wasn't trying to put words in your mouth, Victor. I was replying to Terry (msg275406).
msg275590 - (view) Author: Christoph Reiter (lazka) * Date: 2016-09-10 07:29
Closing as wontfix if there are concerns regarding compatibility seems fine to me.

Thanks for looking into this.

I've also found a workaround for my usecase in the meantime: https://github.com/lazka/senf/commit/b7dadb05a29db5f0d74f659971b0a86d5e579028
History
Date User Action Args
2022-04-11 14:58:35adminsetgithub: 72158
2016-12-06 09:51:53lazkasetstatus: open -> closed
resolution: wont fix
2016-09-10 07:29:35lazkasetmessages: + msg275590
2016-09-10 01:07:27eryksunsetmessages: + msg275522
2016-09-09 23:38:09vstinnersetmessages: + msg275495
2016-09-09 22:56:50eryksunsetmessages: + msg275483
2016-09-09 19:40:24terry.reedysetnosy: + terry.reedy
messages: + msg275406
2016-09-06 20:30:38vstinnersetmessages: + msg274620
2016-09-06 18:34:13eryksunsetmessages: + msg274593
2016-09-06 16:35:45vstinnersetmessages: + msg274565
2016-09-06 15:10:57lazkasetmessages: + msg274560
2016-09-06 15:06:12lazkasetmessages: + msg274558
2016-09-06 14:10:11eryksunsetnosy: + eryksun
messages: + msg274556
2016-09-06 13:43:16xiang.zhangsetnosy: + xiang.zhang
messages: + msg274555
2016-09-06 10:23:52lazkasetmessages: + msg274548
2016-09-06 09:59:43lazkacreate