Author Rhamphoryncus
Recipients Rhamphoryncus
Date 2008-08-24.21:56:50
SpamBayes Score 2.40063e-05
Marked as misclassified No
Message-id <1219615011.7.0.279537729369.issue3672@psf.upfronthosting.co.za>
In-reply-to
Content
The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or
UTF-32 should be treated as errors.  Lone surrogates in UTF-16 should
probably be treated as errors too (but only during encoding/decoding;
unicode objects on UTF-16 builds should allow them to be created through
slicing).

http://unicode.org/faq/utf_bom.html#30
http://unicode.org/faq/utf_bom.html#42
http://unicode.org/faq/utf_bom.html#40

Lone surrogate in UTF-8 (effectively CESU-8):
>>> '\xED\xA0\x81'.decode('utf-8')
u'\ud801'

Surrogate pair in UTF-8:
>>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8')
u'\ud801\udc00'

On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding
again will produce the proper non-surrogate scalar value.  This has
security implications, although rare as characters outside the BMP are rare:
>>> u'\ud801\udc00'.encode('utf-16').decode('utf-16')
u'\U00010400'

Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails
(correctly), but encoding one does not:
>>> u'\ud801'.encode('utf-16')
'\xff\xfe\x01\xd8'


I have gotten a report of a user decoding bad data using
x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the
ill-formed surrogates reached it.

Fixing this would cause issue 3297 to blow up loudly, rather than silently.
History
Date User Action Args
2008-08-24 21:56:51Rhamphoryncussetrecipients: + Rhamphoryncus
2008-08-24 21:56:51Rhamphoryncussetmessageid: <1219615011.7.0.279537729369.issue3672@psf.upfronthosting.co.za>
2008-08-24 21:56:51Rhamphoryncuslinkissue3672 messages
2008-08-24 21:56:50Rhamphoryncuscreate