Issue12892
Created on 2011-09-04 10:49 by ezio.melotti, last changed 2012-11-04 20:46 by serhiy.storchaka.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| utf_16_32_surrogates.patch | haypo, 2011-11-29 21:13 | review | ||
| utf-16&32_reject_surrogates.patch | kennyluck, 2012-01-30 07:29 | rejects surrogate and calls the error handlers | review | |
| surrogatepass_for_utf-16&32.patch | kennyluck, 2012-02-01 00:36 | surrogatepass erro handler for utf-16/32 | ||
| Messages (9) | |||
|---|---|---|---|
| msg143490 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2011-09-04 10:49 | |
From Chapter 03 of the Unicode Standard 6[0], D91: """ • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5. • Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed. """ I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and encode a non-BMP character using a valid surrogate pair, but it should reject lone surrogates both during encoding and decoding. On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate). ---------------------------------------------- From Chapter 03 of the Unicode Standard 6[0], D90: """ • UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value. • Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed. """ I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both during encoding and during decoding. On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates). ---------------------------------------------- I think that: * this should be fixed in 3.3; * it's a bug, so the fix /might/ be backported to 3.2. Hoverver it's also a fairly big change in behavior, so it might be better to leave it for 3.3 only; * it's better to leave 2.7 alone, even the utf-8 codec is broken there; * the surrogatepass error handler should work with the utf-16 and utf-32 codecs too. Note that this has been already reported in #3672, but eventually only the utf-8 codec was fixed. [0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf |
|||
| msg143495 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2011-09-04 12:46 | |
Ezio Melotti wrote: > > New submission from Ezio Melotti <ezio.melotti@gmail.com>: > >>From Chapter 03 of the Unicode Standard 6[0], D91: > """ > • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5. > • Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed. > """ > I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and encode a non-BMP character using a valid surrogate pair, but it should reject lone surrogates both during encoding and decoding. > > On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate). > > ---------------------------------------------- > >>From Chapter 03 of the Unicode Standard 6[0], D90: > """ > • UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value. > • Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed. > """ > I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both during encoding and during decoding. > > On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates). > > ---------------------------------------------- > > I think that: > * this should be fixed in 3.3; > * it's a bug, so the fix /might/ be backported to 3.2. Hoverver it's also a fairly big change in behavior, so it might be better to leave it for 3.3 only; > * it's better to leave 2.7 alone, even the utf-8 codec is broken there; > * the surrogatepass error handler should work with the utf-16 and utf-32 codecs too. > > > Note that this has been already reported in #3672, but eventually only the utf-8 codec was fixed. All UTF codecs should reject lone surrogates in strict error mode, but let them pass using the surrogatepass error handler (the UTF-8 codec already does) and apply the usual error handling for ignore and replace. |
|||
| msg148603 - (view) | Author: STINNER Victor (haypo) * ![]() |
Date: 2011-11-29 20:42 | |
Python 3.3 has a strange behaviour:
>>> '\uDBFF\uDFFF'.encode('utf-16-le').decode('utf-16-le')
'\U0010ffff'
>>> '\U0010ffff'.encode('utf-16-le').decode('utf-16-le')
'\U0010ffff'
I would expect text.decode(encoding).encode(encoding)==text or an encode or decode error.
So I agree that the encoder should reject lone surogates.
|
|||
| msg148613 - (view) | Author: STINNER Victor (haypo) * ![]() |
Date: 2011-11-29 21:13 | |
Patch rejecting surrogates in UTF-16 and UTF-32 encoders. I don't think that Python 2.7 and 3.2 should be changed in a minor version. |
|||
| msg148614 - (view) | Author: STINNER Victor (haypo) * ![]() |
Date: 2011-11-29 21:15 | |
Hum, my patch doesn't call the error handler. |
|||
| msg152310 - (view) | Author: Kang-Hao (Kenny) Lu (kennyluck) | Date: 2012-01-30 07:29 | |
Attached patch does the following beyond what the patch from haypo does:
* call the error handler
* reject 0xd800~0xdfff when decoding utf-32
The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately:
* make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish this by today)
* fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)
* make unicode_encode_call_errorhandler return bytes so that we can simplify this patch. (This arguably belongs to a separate bug so I'll file it when needed)
> All UTF codecs should reject lone surrogates in strict error mode,
Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea.
> but let them pass using the surrogatepass error handler (the UTF-8
> codec already does) and apply the usual error handling for ignore
> and replace.
For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching
# Implements the ``replace`` error handling: malformed data is replaced
# with a suitable replacement character such as ``'?'`` in bytestrings
# and ``'\ufffd'`` in Unicode strings.
in the documentation. What do we do? Are there other encodings that are not ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think it would be better to use encoded U+fffd whenever possible and fall back to '?'. What do you think?
Some other self comments on my patch:
* In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch & 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this is actually better or if this makes any difference at all.
* The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's complexity comes from issue #8092 . Not sure if there are ways to simplify these. For example, are there suitable functions there so that we don't need to check integer overflow at these places?
|
|||
| msg152316 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2012-01-30 08:51 | |
Thanks for the patch!
> * fix an error in the error handler for utf-16-le. (In, Python3.2
> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00"
> instead of "A" for some reason)
This should probably be done on a separate patch that will be applied to 3.2/3.3 (assuming that it can go to 3.2). Rejecting surrogates will go in 3.3 only. (Note that lot of Unicode-related code changed between 3.2 and 3.3.)
> Should we really reject lone surrogates for UTF-7?
No, I meant only UTF-8/16/32; UTF-7 is fine as is.
|
|||
| msg152420 - (view) | Author: Kang-Hao (Kenny) Lu (kennyluck) | Date: 2012-02-01 00:36 | |
> The followings are on my TODO list, although this patch doesn't depend
> on any of these and can be reviewed and landed separately:
> * make the surrogatepass error handler work for utf-16 and utf-32. (I
> should be able to finish this by today)
Unfortunately this took longer than I thought but here comes the patch.
>> * fix an error in the error handler for utf-16-le. (In, Python3.2
>> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00"
>> instead of "A" for some reason)
>
> This should probably be done on a separate patch that will be applied
> to 3.2/3.3 (assuming that it can go to 3.2). Rejecting surrogates will
> go in 3.3 only. (Note that lot of Unicode-related code changed between
> 3.2 and 3.3.)
This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3.
You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue #13913 for this.
Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead...
In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8.
>> Should we really reject lone surrogates for UTF-7?
>
> No, I meant only UTF-8/16/32; UTF-7 is fine as is.
Good to know.
|
|||
| msg159129 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2012-04-24 09:54 | |
> * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)
The patch for issue14579 fixes this in Python 3.2.
The patch for issue14624 fixes this in Python 3.3.
|
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2013-05-05 13:11:10 | serhiy.storchaka | link | issue17909 dependencies |
| 2012-11-04 20:46:00 | serhiy.storchaka | set | stage: test needed -> patch review versions: + Python 3.4, - Python 3.3 |
| 2012-04-24 09:54:07 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg159129 |
| 2012-02-01 00:36:10 | kennyluck | set | files:
+ surrogatepass_for_utf-16&32.patch messages: + msg152420 |
| 2012-01-30 08:51:05 | ezio.melotti | set | messages: + msg152316 |
| 2012-01-30 07:29:59 | kennyluck | set | files:
+ utf-16&32_reject_surrogates.patch nosy: + kennyluck messages: + msg152310 |
| 2011-11-29 21:15:37 | haypo | set | messages: + msg148614 |
| 2011-11-29 21:13:31 | haypo | set | files:
+ utf_16_32_surrogates.patch keywords: + patch dependencies: - Refactor code using unicode_encode_call_errorhandler() in unicodeobject.c messages: + msg148613 |
| 2011-11-29 20:42:29 | haypo | set | messages: + msg148603 |
| 2011-10-26 04:34:05 | ezio.melotti | set | dependencies: + Refactor code using unicode_encode_call_errorhandler() in unicodeobject.c |
| 2011-09-04 12:46:14 | lemburg | set | messages: + msg143495 |
| 2011-09-04 10:49:30 | ezio.melotti | create | |
