Message 147457 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ezio.melotti, loewis, petri.lehtinen, pitrou
Date	2011-11-12.01:55:53
SpamBayes Score	0.00018235925
Marked as misclassified	No
Message-id	<1321062955.22.0.0733620991294.issue13333@psf.upfronthosting.co.za>
In-reply-to

Content
FWIW Wikipedia says "Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates) and then in modified Base64." So one possible interpretation is that while encoding a non-BMP char, it should be first converted in a surrogate pair and then each of the surrogates should be encoded just like any other 16bit code unit. While decoding, it seems reasonable to do the opposite, i.e. recombine the surrogate pair. The RFC doesn't say anything about lone surrogates, but I think that the fact that surrogates are used internally doesn't necessarily mean that the codec should be able to encode/decode them when they are not paired. The other UTF-* codecs reject them, but that's because it is explicitly forbidden by their respective standards. So I'm +1 about recombining them while decoding, and ±0 about allowing lone surrogates.

FWIW Wikipedia says "Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates) and then in modified Base64."

So one possible interpretation is that while encoding a non-BMP char, it should be first converted in a surrogate pair and then each of the surrogates should be encoded just like any other 16bit code unit.
While decoding, it seems reasonable to do the opposite, i.e. recombine the surrogate pair.

The RFC doesn't say anything about lone surrogates, but I think that the fact that surrogates are used internally doesn't necessarily mean that the codec should be able to encode/decode them when they are not paired.  The other UTF-* codecs reject them, but that's because it is explicitly forbidden by their respective standards.

So I'm +1 about recombining them while decoding, and ±0 about allowing lone surrogates.

History
Date	User	Action	Args
2011-11-12 01:55:55	ezio.melotti	set	recipients: + ezio.melotti, loewis, pitrou, petri.lehtinen
2011-11-12 01:55:55	ezio.melotti	set	messageid: <1321062955.22.0.0733620991294.issue13333@psf.upfronthosting.co.za>
2011-11-12 01:55:54	ezio.melotti	link	issue13333 messages
2011-11-12 01:55:53	ezio.melotti	create