msg274546 - (view) |
Author: Christoph Reiter (lazka) * |
Date: 2016-09-06 09:59 |
Using Python 2.7.12
>>> u"\ud83d".encode("utf-16-le")
'=\xd8'
>>> u"\ud83d".encode("utf-16-le").decode("utf-16-le")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data
>>>
|
msg274548 - (view) |
Author: Christoph Reiter (lazka) * |
Date: 2016-09-06 10:23 |
Same problem on 3.3.6. But works on 3.4.5. So I guess this was fixed but not backported.
|
msg274555 - (view) |
Author: Xiang Zhang (xiang.zhang) * |
Date: 2016-09-06 13:43 |
With the latest build, even encode will fail:
Python 3.6.0a4+ (default:dad4c42869f6, Sep 6 2016, 21:41:38)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\ud83d".encode("utf-16-le")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0: surrogates not allowed
|
msg274556 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-09-06 14:10 |
Probably Python 2's UTF-16 decoder should be as broken as the encoder, which will match the broken behavior of the UTF-8 and UTF-32 codecs:
>>> u'\ud83d\uda12'.encode('utf-8').decode('utf-8')
u'\ud83d\uda12'
>>> u'\ud83d\uda12'.encode('utf-32-le').decode('utf-32-le')
u'\ud83d\uda12'
Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates:
>>> u'\ud83d\uda12'.encode('utf-16le', 'surrogatepass')
b'=\xd8\x12\xda'
>>> _.decode('utf-16le', 'surrogatepass')
'\ud83d\uda12'
|
msg274558 - (view) |
Author: Christoph Reiter (lazka) * |
Date: 2016-09-06 15:06 |
On Tue, Sep 6, 2016 at 3:43 PM, Xiang Zhang <report@bugs.python.org> wrote:
>
> Xiang Zhang added the comment:
>
> With the latest build, even encode will fail:
With Python 3 you have to use the "surrogatepass" error handler. I
assumed this was the default in Python 2 since it worked with other
codecs.
|
msg274560 - (view) |
Author: Christoph Reiter (lazka) * |
Date: 2016-09-06 15:10 |
On Tue, Sep 6, 2016 at 4:10 PM, Eryk Sun <report@bugs.python.org> wrote:
> Lone surrogate codes aren't valid Unicode. In Python 3 they get used internally for tricks like the "surrogateescape" error handler. In Python 3.4+. the 'surrogatepass' error handler allows encoding and decoding lone surrogates:
To add some context: I was writing tests for windows paths containing
surrogates (e.g. os.listdir can return them)
|
msg274565 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2016-09-06 16:35 |
UTF codecs must not encode surrogate characters:
http://unicodebook.readthedocs.io/issues.html#non-strict-utf-8-decoder-overlong-byte-sequences-and-surrogates
Python 3 is right, sadly it's too late to fix Python 2.
|
msg274593 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-09-06 18:34 |
Victor, it seems the only option here (other than closing this as won't fix) is to modify the UTF-16 decoder in 2.7 to allow lone surrogates, which would be consistent with the UTF-8 and UTF-32 decoders. While it's too late to enforce strict compliance in 2.7, it shouldn't hurt to expand the domain of acceptable encodings. Then if surrogates are always passed in 2.7, a silently ignored "surrogatepass" handler could be added for compatibility with 3.x code.
|
msg274620 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2016-09-06 20:30 |
I dislike the idea of changing the behaviour in a minor release :-/
|
msg275406 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2016-09-09 19:40 |
Unless the 2.7 docs specify that the utf codecs should violate the standard with respect to lone surrogates, I think this should definitely be closed (as 'not a bug').
|
msg275483 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-09-09 22:56 |
Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug. It's misbehavior, and it either will be or won't be fixed. From Victor's response it's looking like the latter.
|
msg275495 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2016-09-09 23:38 |
> Considering the UTF-16 codec isn't self-consistent, it's a stretch to say it's not a bug.
I didn't say that it's not a bug. I said that it's not possible to
modify a codec at this point in Python 2.7 without taking a risk of
breaking applications relying on the current behaviour. Even in Python
3, we don't do such change in minor releases, but only in major
releases.
|
msg275522 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2016-09-10 01:07 |
I wasn't trying to put words in your mouth, Victor. I was replying to Terry (msg275406).
|
msg275590 - (view) |
Author: Christoph Reiter (lazka) * |
Date: 2016-09-10 07:29 |
Closing as wontfix if there are concerns regarding compatibility seems fine to me.
Thanks for looking into this.
I've also found a workaround for my usecase in the meantime: https://github.com/lazka/senf/commit/b7dadb05a29db5f0d74f659971b0a86d5e579028
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:35 | admin | set | github: 72158 |
2016-12-06 09:51:53 | lazka | set | status: open -> closed resolution: wont fix |
2016-09-10 07:29:35 | lazka | set | messages:
+ msg275590 |
2016-09-10 01:07:27 | eryksun | set | messages:
+ msg275522 |
2016-09-09 23:38:09 | vstinner | set | messages:
+ msg275495 |
2016-09-09 22:56:50 | eryksun | set | messages:
+ msg275483 |
2016-09-09 19:40:24 | terry.reedy | set | nosy:
+ terry.reedy messages:
+ msg275406
|
2016-09-06 20:30:38 | vstinner | set | messages:
+ msg274620 |
2016-09-06 18:34:13 | eryksun | set | messages:
+ msg274593 |
2016-09-06 16:35:45 | vstinner | set | messages:
+ msg274565 |
2016-09-06 15:10:57 | lazka | set | messages:
+ msg274560 |
2016-09-06 15:06:12 | lazka | set | messages:
+ msg274558 |
2016-09-06 14:10:11 | eryksun | set | nosy:
+ eryksun messages:
+ msg274556
|
2016-09-06 13:43:16 | xiang.zhang | set | nosy:
+ xiang.zhang messages:
+ msg274555
|
2016-09-06 10:23:52 | lazka | set | messages:
+ msg274548 |
2016-09-06 09:59:43 | lazka | create | |