Message 143490 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ezio.melotti, gvanrossum, lemburg, loewis, tchrist, vstinner
Date	2011-09-04.10:49:29
SpamBayes Score	8.785067e-11
Marked as misclassified	No
Message-id	<1315133370.95.0.855318807974.issue12892@psf.upfronthosting.co.za>
In-reply-to

Content
From Chapter 03 of the Unicode Standard 6[0], D91: """ • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5. • Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed. """ I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and encode a non-BMP character using a valid surrogate pair, but it should reject lone surrogates both during encoding and decoding. On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate). ---------------------------------------------- From Chapter 03 of the Unicode Standard 6[0], D90: """ • UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value. • Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed. """ I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both during encoding and during decoding. On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates). ---------------------------------------------- I think that: * this should be fixed in 3.3; * it's a bug, so the fix /might/ be backported to 3.2. Hoverver it's also a fairly big change in behavior, so it might be better to leave it for 3.3 only; * it's better to leave 2.7 alone, even the utf-8 codec is broken there; * the surrogatepass error handler should work with the utf-16 and utf-32 codecs too. Note that this has been already reported in #3672, but eventually only the utf-8 codec was fixed. [0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

From Chapter 03 of the Unicode Standard 6[0], D91:
"""
• UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
• Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed.
"""
I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and encode a non-BMP character using a valid surrogate pair, but it should reject lone surrogates both during encoding and decoding.

On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to U+10FFFF (including (lone) surrogates), but it's not able to decode lone surrogates (not sure if this is by design or if it just fails because it expects another (missing) surrogate).

----------------------------------------------

From Chapter 03 of the Unicode Standard 6[0], D90:
"""
• UTF-32 encoding form: The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.
• Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed.
"""
I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both during encoding and during decoding.

On Python 3, the utf-32 codec can encode and decode all the codepoints from U+0000 to U+10FFFF (including surrogates).

----------------------------------------------

I think that:
* this should be fixed in 3.3;
* it's a bug, so the fix /might/ be backported to 3.2. Hoverver it's also a fairly big change in behavior, so it might be better to leave it for 3.3 only;
* it's better to leave 2.7 alone, even the utf-8 codec is broken there;
* the surrogatepass error handler should work with the utf-16 and utf-32 codecs too.

Note that this has been already reported in #3672, but eventually only the utf-8 codec was fixed.

[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

History
Date	User	Action	Args
2011-09-04 10:49:31	ezio.melotti	set	recipients: + ezio.melotti, lemburg, gvanrossum, loewis, vstinner, tchrist
2011-09-04 10:49:30	ezio.melotti	set	messageid: <1315133370.95.0.855318807974.issue12892@psf.upfronthosting.co.za>
2011-09-04 10:49:30	ezio.melotti	link	issue12892 messages
2011-09-04 10:49:29	ezio.melotti	create