classification
Title: UTF-16 and UTF-32 codecs should reject (lone) surrogates
Type: behavior Stage: patch review
Components: Unicode Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, gvanrossum, haypo, kennyluck, lemburg, loewis, serhiy.storchaka, tchrist
Priority: high Keywords: patch

Created on 2011-09-04 10:49 by ezio.melotti, last changed 2012-11-04 20:46 by serhiy.storchaka.

Files
File name Uploaded Description Edit
utf_16_32_surrogates.patch haypo, 2011-11-29 21:13 review
utf-16&32_reject_surrogates.patch kennyluck, 2012-01-30 07:29 rejects surrogate and calls the error handlers review
surrogatepass_for_utf-16&32.patch kennyluck, 2012-02-01 00:36 surrogatepass erro handler for utf-16/32
Messages (9)
msg143490 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-04 10:49
From Chapter 03 of the Unicode Standard 6[0], D91:
"""
• UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
• Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed.
"""
I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and encode a non-BMP character using a  valid surrogate pair, but it should reject lone surrogates both during encoding and decoding.

On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate).

----------------------------------------------

From Chapter 03 of the Unicode Standard 6[0], D90:
"""
• UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.
• Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed.
"""
I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both during encoding and during decoding.

On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates).

----------------------------------------------

I think that:
  * this should be fixed in 3.3;
  * it's a bug, so the fix /might/ be backported to 3.2.  Hoverver it's also a fairly big change in behavior, so it might be better to leave it for 3.3 only;
  * it's better to leave 2.7 alone, even the utf-8 codec is broken there;
  * the surrogatepass error handler should work with the utf-16 and utf-32 codecs too.


Note that this has been already reported in #3672, but eventually only the utf-8 codec was fixed.

[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
msg143495 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-09-04 12:46
Ezio Melotti wrote:
> 
> New submission from Ezio Melotti <ezio.melotti@gmail.com>:
> 
>>From Chapter 03 of the Unicode Standard 6[0], D91:
> """
> • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
> • Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed.
> """
> I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and encode a non-BMP character using a  valid surrogate pair, but it should reject lone surrogates both during encoding and decoding.
> 
> On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate).
> 
> ----------------------------------------------
> 
>>From Chapter 03 of the Unicode Standard 6[0], D90:
> """
> • UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.
> • Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed.
> """
> I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both during encoding and during decoding.
> 
> On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates).
> 
> ----------------------------------------------
> 
> I think that:
>   * this should be fixed in 3.3;
>   * it's a bug, so the fix /might/ be backported to 3.2.  Hoverver it's also a fairly big change in behavior, so it might be better to leave it for 3.3 only;
>   * it's better to leave 2.7 alone, even the utf-8 codec is broken there;
>   * the surrogatepass error handler should work with the utf-16 and utf-32 codecs too.
> 
> 
> Note that this has been already reported in #3672, but eventually only the utf-8 codec was fixed.

All UTF codecs should reject lone surrogates in strict error mode,
but let them pass using the surrogatepass error handler (the UTF-8
codec already does) and apply the usual error handling for ignore
and replace.
msg148603 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-11-29 20:42
Python 3.3 has a strange behaviour:

>>> '\uDBFF\uDFFF'.encode('utf-16-le').decode('utf-16-le')
'\U0010ffff'
>>> '\U0010ffff'.encode('utf-16-le').decode('utf-16-le')
'\U0010ffff'

I would expect text.decode(encoding).encode(encoding)==text or an encode or decode error.

So I agree that the encoder should reject lone surogates.
msg148613 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-11-29 21:13
Patch rejecting surrogates in UTF-16 and UTF-32 encoders.

I don't think that Python 2.7 and 3.2 should be changed in a minor version.
msg148614 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-11-29 21:15
Hum, my patch doesn't call the error handler.
msg152310 - (view) Author: Kang-Hao (Kenny) Lu (kennyluck) Date: 2012-01-30 07:29
Attached patch does the following beyond what the patch from haypo does:
  * call the error handler
  * reject 0xd800~0xdfff when decoding utf-32

The followings are on my TODO list, although this patch doesn't depend on any of these and can be reviewed and landed separately:
  * make the surrogatepass error handler work for utf-16 and utf-32. (I should be able to finish this by today)
  * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)
  * make unicode_encode_call_errorhandler return bytes so that we can simplify this patch. (This arguably belongs to a separate bug so I'll file it when needed)

> All UTF codecs should reject lone surrogates in strict error mode,

Should we really reject lone surrogates for UTF-7? There's a test in test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7 is not really part of the Unicode Standard and it is more like a "data encoding" than a "text encoding" to me, I am not sure it's a good idea.

> but let them pass using the surrogatepass error handler (the UTF-8
> codec already does) and apply the usual error handling for ignore
> and replace.

For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16 stream doesn't get corrupted. It is not "usual" and not matching

  # Implements the ``replace`` error handling: malformed data is replaced
  # with a suitable replacement character such as ``'?'`` in bytestrings 
  # and ``'\ufffd'`` in Unicode strings.

in the documentation. What do we do? Are there other encodings that are not ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think it would be better to use encoded U+fffd whenever possible and fall back to '?'. What do you think?

Some other self comments on my patch:
  * In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch & 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this is actually better or if this makes any difference at all.
  * The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's complexity comes from issue #8092 . Not sure if there are ways to simplify these. For example, are there suitable functions there so that we don't need to check integer overflow at these places?
msg152316 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-01-30 08:51
Thanks for the patch!

>  * fix an error in the error handler for utf-16-le. (In, Python3.2 
> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" 
> instead of "A" for some reason)

This should probably be done on a separate patch that will be applied to 3.2/3.3 (assuming that it can go to 3.2).  Rejecting surrogates will go in 3.3 only.  (Note that lot of Unicode-related code changed between 3.2 and 3.3.)

> Should we really reject lone surrogates for UTF-7?

No, I meant only UTF-8/16/32; UTF-7 is fine as is.
msg152420 - (view) Author: Kang-Hao (Kenny) Lu (kennyluck) Date: 2012-02-01 00:36
> The followings are on my TODO list, although this patch doesn't depend
> on any of these and can be reviewed and landed separately:
>  * make the surrogatepass error handler work for utf-16 and utf-32. (I
>    should be able to finish this by today)

Unfortunately this took longer than I thought but here comes the patch.

>>  * fix an error in the error handler for utf-16-le. (In, Python3.2 
>> b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" 
>> instead of "A" for some reason)
>
> This should probably be done on a separate patch that will be applied
> to 3.2/3.3 (assuming that it can go to 3.2).  Rejecting surrogates will
> go in 3.3 only.  (Note that lot of Unicode-related code changed between
> 3.2 and 3.3.)

This turns out to be just two liners so I fixed that on the way. I can create separate patch with separate test for 3.2 (certainly doable) and even for 3.3, but since the test is now part of test_lone_surrogates, I feel less willing to do that for 3.3.

You might notice the codec naming inconsistency (utf-16-be and utf16be for encoding and decoding respectively). I have filed issue #13913 for this.

Also, the strcmps are quite crappy. I am working on issue #13916 (disallow the "surrogatepass" handler for non utf-* encodings). As long as we have that we can examine individual character instead...

In this patch, The "encoding" attribute for UnicodeDecodeException is now changed to return utf16(be|le) for utf-16. This is necessary info for "surrogatepass" to work although admittedly this is rather ugly. Any good idea? A new attribute for Unicode(Decode|Encode)Exception might be helpful but utf-16/32 are fairly uncommon encodings anyway and we should not add more burden for, say, utf-8.

>> Should we really reject lone surrogates for UTF-7?
>
> No, I meant only UTF-8/16/32; UTF-7 is fine as is.

Good to know.
msg159129 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-04-24 09:54
>  * fix an error in the error handler for utf-16-le. (In, Python3.2 b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A" for some reason)

The patch for issue14579 fixes this in Python 3.2.

The patch for issue14624 fixes this in Python 3.3.
History
Date User Action Args
2013-05-05 13:11:10serhiy.storchakalinkissue17909 dependencies
2012-11-04 20:46:00serhiy.storchakasetstage: test needed -> patch review
versions: + Python 3.4, - Python 3.3
2012-04-24 09:54:07serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg159129
2012-02-01 00:36:10kennylucksetfiles: + surrogatepass_for_utf-16&32.patch

messages: + msg152420
2012-01-30 08:51:05ezio.melottisetmessages: + msg152316
2012-01-30 07:29:59kennylucksetfiles: + utf-16&32_reject_surrogates.patch
nosy: + kennyluck
messages: + msg152310

2011-11-29 21:15:37hayposetmessages: + msg148614
2011-11-29 21:13:31hayposetfiles: + utf_16_32_surrogates.patch
keywords: + patch
dependencies: - Refactor code using unicode_encode_call_errorhandler() in unicodeobject.c
messages: + msg148613
2011-11-29 20:42:29hayposetmessages: + msg148603
2011-10-26 04:34:05ezio.melottisetdependencies: + Refactor code using unicode_encode_call_errorhandler() in unicodeobject.c
2011-09-04 12:46:14lemburgsetmessages: + msg143495
2011-09-04 10:49:30ezio.melotticreate