Message 93595 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Rhamphoryncus, amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner
Date	2009-10-05.11:51:26
SpamBayes Score	4.9960036e-16
Marked as misclassified	No
Message-id	<4AC9DDBD.8050107@egenix.com>
In-reply-to	<1254741389.65.0.268065306507.issue5127@psf.upfronthosting.co.za>

Content
This is off-topic for the tracker item, but I'll reply anyway: Ezio Melotti wrote: > > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > >>> We might keep the old public API for compatibility, but it should be >>> clearly marked as broken for non-BMP scalar values. > >> That has always been the case. UCS2 doesn't support surrogates. > >> However, we have been slowly moving into the direction of making >> the UCS2 storage appear like UTF-16 to the Python programmer. > > UCS2 died long ago, is there any reason why we keep using an UCS2 that > "appears" like UTF-16 instead of real UTF-16? UCS2 is how we store Unicode in Python for narrow builds internally. It's a storage format, not an encoding. However, on narrow builds such as the Windows builds, you will sometimes want to create Unicode strings that use non-BMP code points. Since both UCS2 and UCS4 can represent the UTF-16 encoding, it's handy to expose a bit of automatic conversion at the Python level to make things easier for the programmer. >> This process is not yet complete and will likely never complete >> since it must still be possible to create things line lone >> surrogates for processing purposes, so care has to be taken >> when using non-BMP code points on narrow builds. > > I don't exactly know all the details of the current implementation, but > -- from what I understand reading this (correct me if I'm wrong) -- it > seems that the implementation is half-UCS2 to allow things like the > processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to > work with surrogate pairs and hence with chars outside the BMP. > > What are the use cases for processing the lone surrogates? Wouldn't be > better to use UTF-16 and disallow them (since they are illegal) and > possibly provide some other way to deal with them (if it's really needed)? No, because Python is meant to be used for working on all Unicode code points. Lone surrogates are not allowed in transfer encodings such as UTF-16 or UTF-8, but they are valid Unicode code points and you need to be able to work with them, since you may want to construct surrogate pairs by hand or get lone surrogates as a result of slicing a Unicode string.

This is off-topic for the tracker item, but I'll reply anyway:

Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
>>> We might keep the old public API for compatibility, but it should be
>>> clearly marked as broken for non-BMP scalar values.
> 
>> That has always been the case. UCS2 doesn't support surrogates.
> 
>> However, we have been slowly moving into the direction of making
>> the UCS2 storage appear like UTF-16 to the Python programmer.
> 
> UCS2 died long ago, is there any reason why we keep using an UCS2 that
> "appears" like UTF-16 instead of real UTF-16?

UCS2 is how we store Unicode in Python for narrow builds internally.
It's a storage format, not an encoding.

However, on narrow builds such as the Windows builds, you will sometimes
want to create Unicode strings that use non-BMP code points. Since
both UCS2 and UCS4 can represent the UTF-16 encoding, it's handy to
expose a bit of automatic conversion at the Python level to make
things easier for the programmer.

>> This process is not yet complete and will likely never complete
>> since it must still be possible to create things line lone
>> surrogates for processing purposes, so care has to be taken
>> when using non-BMP code points on narrow builds.
> 
> I don't exactly know all the details of the current implementation, but
> -- from what I understand reading this (correct me if I'm wrong) -- it
> seems that the implementation is half-UCS2 to allow things like the
> processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
> work with surrogate pairs and hence with chars outside the BMP.
> 
> What are the use cases for processing the lone surrogates? Wouldn't be
> better to use UTF-16 and disallow them (since they are illegal) and
> possibly provide some other way to deal with them (if it's really needed)?

No, because Python is meant to be used for working on all Unicode
code points. Lone surrogates are not allowed in transfer encodings
such as UTF-16 or UTF-8, but they are valid Unicode code points and
you need to be able to work with them, since you may want to construct
surrogate pairs by hand or get lone surrogates as a result of slicing a
Unicode string.

History
Date	User	Action	Args
2009-10-05 11:51:28	lemburg	set	recipients: + lemburg, amaury.forgeotdarc, Rhamphoryncus, vstinner, ezio.melotti, bupjae
2009-10-05 11:51:27	lemburg	link	issue5127 messages
2009-10-05 11:51:26	lemburg	create