Message 93585 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Rhamphoryncus, amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner
Date	2009-10-05.09:03:07
SpamBayes Score	1.110223e-16
Marked as misclassified	No
Message-id	<4AC9B649.7040308@egenix.com>
In-reply-to	<1254692819.07.0.257919015869.issue5127@psf.upfronthosting.co.za>

Content
Adam Olsen wrote: > > Adam Olsen <rhamph@gmail.com> added the comment: > > Surrogates aren't optional features of UTF-16, we really need to get > this fixed. That includes .isalpha(). We use UCS2 on narrow Python builds, not UTF-16. > We might keep the old public API for compatibility, but it should be > clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer. This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds. > I don't see a problem with changing 2.x. The existing behaviour is > broken for non-BMP scalar values, so surely nobody can claim dependence > on it. No, but changing the APIs from 16-bit integers to 32-bit integers does require a recompile of all code using it. Otherwise you end up with segfaults. Also, the Unicode type database itself uses Py_UNICODE, so case mapping would fail for non-BMP code points. So if we want to support accessing non-BMP type information on narrow builds, we'd need to change the complete Unicode type database API to work with UCS4 code points and then provide a backwards compatible C API using Py_UNICODE. Due to the UCS2/UCS4 API renaming done in unicodeobject.h, this would amount to exposing both the UCS2 and the UCS4 variants of the APIs on narrow builds. With such an approach we'd not break the binary API and still get the full UCS4 range of code points in the type database. The change would be possible in Python 2.x and 3.x (which now both use the same strategy w/r to change management). Would someone be willing to work on this ?

Adam Olsen wrote:
> 
> Adam Olsen <rhamph@gmail.com> added the comment:
> 
> Surrogates aren't optional features of UTF-16, we really need to get
> this fixed.  That includes .isalpha().

We use UCS2 on narrow Python builds, not UTF-16.

> We might keep the old public API for compatibility, but it should be
> clearly marked as broken for non-BMP scalar values.

That has always been the case. UCS2 doesn't support surrogates.

However, we have been slowly moving into the direction of making
the UCS2 storage appear like UTF-16 to the Python programmer.

This process is not yet complete and will likely never complete
since it must still be possible to create things line lone
surrogates for processing purposes, so care has to be taken
when using non-BMP code points on narrow builds.

> I don't see a problem with changing 2.x.  The existing behaviour is
> broken for non-BMP scalar values, so surely nobody can claim dependence
> on it.

No, but changing the APIs from 16-bit integers to 32-bit integers
does require a recompile of all code using it. Otherwise you
end up with segfaults.

Also, the Unicode type database itself uses Py_UNICODE, so
case mapping would fail for non-BMP code points.

So if we want to support accessing non-BMP type information
on narrow builds, we'd need to change the complete
Unicode type database API to work with UCS4 code points and then
provide a backwards compatible C API using Py_UNICODE. Due
to the UCS2/UCS4 API renaming done in unicodeobject.h, this
would amount to exposing both the UCS2 and the UCS4 variants
of the APIs on narrow builds.

With such an approach we'd not break the binary API and
still get the full UCS4 range of code points in the type
database. The change would be possible in Python 2.x and
3.x (which now both use the same strategy w/r to change
management).

Would someone be willing to work on this ?

History
Date	User	Action	Args
2009-10-05 09:03:09	lemburg	set	recipients: + lemburg, amaury.forgeotdarc, Rhamphoryncus, vstinner, ezio.melotti, bupjae
2009-10-05 09:03:08	lemburg	link	issue5127 messages
2009-10-05 09:03:07	lemburg	create