Message71889
The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or
UTF-32 should be treated as errors. Lone surrogates in UTF-16 should
probably be treated as errors too (but only during encoding/decoding;
unicode objects on UTF-16 builds should allow them to be created through
slicing).
http://unicode.org/faq/utf_bom.html#30
http://unicode.org/faq/utf_bom.html#42
http://unicode.org/faq/utf_bom.html#40
Lone surrogate in UTF-8 (effectively CESU-8):
>>> '\xED\xA0\x81'.decode('utf-8')
u'\ud801'
Surrogate pair in UTF-8:
>>> '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8')
u'\ud801\udc00'
On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding
again will produce the proper non-surrogate scalar value. This has
security implications, although rare as characters outside the BMP are rare:
>>> u'\ud801\udc00'.encode('utf-16').decode('utf-16')
u'\U00010400'
Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails
(correctly), but encoding one does not:
>>> u'\ud801'.encode('utf-16')
'\xff\xfe\x01\xd8'
I have gotten a report of a user decoding bad data using
x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the
ill-formed surrogates reached it.
Fixing this would cause issue 3297 to blow up loudly, rather than silently. |
|
Date |
User |
Action |
Args |
2008-08-24 21:56:51 | Rhamphoryncus | set | recipients:
+ Rhamphoryncus |
2008-08-24 21:56:51 | Rhamphoryncus | set | messageid: <1219615011.7.0.279537729369.issue3672@psf.upfronthosting.co.za> |
2008-08-24 21:56:51 | Rhamphoryncus | link | issue3672 messages |
2008-08-24 21:56:50 | Rhamphoryncus | create | |
|