Message 71889 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rhamphoryncus
Recipients	Rhamphoryncus
Date	2008-08-24.21:56:50
SpamBayes Score	2.400627e-05
Marked as misclassified	No
Message-id	<1219615011.7.0.279537729369.issue3672@psf.upfronthosting.co.za>
In-reply-to

Content
The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or UTF-32 should be treated as errors. Lone surrogates in UTF-16 should probably be treated as errors too (but only during encoding/decoding; unicode objects on UTF-16 builds should allow them to be created through slicing). http://unicode.org/faq/utf_bom.html#30 http://unicode.org/faq/utf_bom.html#42 http://unicode.org/faq/utf_bom.html#40 Lone surrogate in UTF-8 (effectively CESU-8): >>> '\xED\xA0\x81'.decode('utf-8') u'\ud801' Surrogate pair in UTF-8: >>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8') u'\ud801\udc00' On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding again will produce the proper non-surrogate scalar value. This has security implications, although rare as characters outside the BMP are rare: >>> u'\ud801\udc00'.encode('utf-16').decode('utf-16') u'\U00010400' Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails (correctly), but encoding one does not: >>> u'\ud801'.encode('utf-16') '\xff\xfe\x01\xd8' I have gotten a report of a user decoding bad data using x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the ill-formed surrogates reached it. Fixing this would cause issue 3297 to blow up loudly, rather than silently.

The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or
UTF-32 should be treated as errors.  Lone surrogates in UTF-16 should
probably be treated as errors too (but only during encoding/decoding;
unicode objects on UTF-16 builds should allow them to be created through
slicing).

http://unicode.org/faq/utf_bom.html#30
http://unicode.org/faq/utf_bom.html#42
http://unicode.org/faq/utf_bom.html#40

Lone surrogate in UTF-8 (effectively CESU-8):
>>> '\xED\xA0\x81'.decode('utf-8')
u'\ud801'

Surrogate pair in UTF-8:
>>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8')
u'\ud801\udc00'

On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding
again will produce the proper non-surrogate scalar value.  This has
security implications, although rare as characters outside the BMP are rare:
>>> u'\ud801\udc00'.encode('utf-16').decode('utf-16')
u'\U00010400'

Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails
(correctly), but encoding one does not:
>>> u'\ud801'.encode('utf-16')
'\xff\xfe\x01\xd8'


I have gotten a report of a user decoding bad data using
x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the
ill-formed surrogates reached it.

Fixing this would cause issue 3297 to blow up loudly, rather than silently.

History
Date	User	Action	Args
2008-08-24 21:56:51	Rhamphoryncus	set	recipients: + Rhamphoryncus
2008-08-24 21:56:51	Rhamphoryncus	set	messageid: <1219615011.7.0.279537729369.issue3672@psf.upfronthosting.co.za>
2008-08-24 21:56:51	Rhamphoryncus	link	issue3672 messages
2008-08-24 21:56:50	Rhamphoryncus	create