Message 93590 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Rhamphoryncus, amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner
Date	2009-10-05.11:16:27
SpamBayes Score	9.436896e-16
Marked as misclassified	No
Message-id	<1254741389.65.0.268065306507.issue5127@psf.upfronthosting.co.za>
In-reply-to

Content
>> We might keep the old public API for compatibility, but it should be >> clearly marked as broken for non-BMP scalar values. > That has always been the case. UCS2 doesn't support surrogates. > However, we have been slowly moving into the direction of making > the UCS2 storage appear like UTF-16 to the Python programmer. UCS2 died long ago, is there any reason why we keep using an UCS2 that "appears" like UTF-16 instead of real UTF-16? > This process is not yet complete and will likely never complete > since it must still be possible to create things line lone > surrogates for processing purposes, so care has to be taken > when using non-BMP code points on narrow builds. I don't exactly know all the details of the current implementation, but -- from what I understand reading this (correct me if I'm wrong) -- it seems that the implementation is half-UCS2 to allow things like the processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to work with surrogate pairs and hence with chars outside the BMP. What are the use cases for processing the lone surrogates? Wouldn't be better to use UTF-16 and disallow them (since they are illegal) and possibly provide some other way to deal with them (if it's really needed)?

>> We might keep the old public API for compatibility, but it should be
>> clearly marked as broken for non-BMP scalar values.

> That has always been the case. UCS2 doesn't support surrogates.

> However, we have been slowly moving into the direction of making
> the UCS2 storage appear like UTF-16 to the Python programmer.

UCS2 died long ago, is there any reason why we keep using an UCS2 that
"appears" like UTF-16 instead of real UTF-16?

> This process is not yet complete and will likely never complete
> since it must still be possible to create things line lone
> surrogates for processing purposes, so care has to be taken
> when using non-BMP code points on narrow builds.

I don't exactly know all the details of the current implementation, but
-- from what I understand reading this (correct me if I'm wrong) -- it
seems that the implementation is half-UCS2 to allow things like the
processing of lone surrogates and half-UTF16 (or UTF-16-compatible) to
work with surrogate pairs and hence with chars outside the BMP.

What are the use cases for processing the lone surrogates? Wouldn't be
better to use UTF-16 and disallow them (since they are illegal) and
possibly provide some other way to deal with them (if it's really needed)?

History
Date	User	Action	Args
2009-10-05 11:16:29	ezio.melotti	set	recipients: + ezio.melotti, lemburg, amaury.forgeotdarc, Rhamphoryncus, vstinner, bupjae
2009-10-05 11:16:29	ezio.melotti	set	messageid: <1254741389.65.0.268065306507.issue5127@psf.upfronthosting.co.za>
2009-10-05 11:16:27	ezio.melotti	link	issue5127 messages
2009-10-05 11:16:27	ezio.melotti	create