Message 93611 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rhamphoryncus
Recipients	Rhamphoryncus, amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner
Date	2009-10-05.17:25:47
SpamBayes Score	2.1033175e-13
Marked as misclassified	No
Message-id	<aac2c7cb0910051025o71f16947v79ca24546a3935f4@mail.gmail.com>
In-reply-to	<4AC9B649.7040308@egenix.com>

Content
On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote: > We use UCS2 on narrow Python builds, not UTF-16. > >> We might keep the old public API for compatibility, but it should be >> clearly marked as broken for non-BMP scalar values. > > That has always been the case. UCS2 doesn't support surrogates. > > However, we have been slowly moving into the direction of making > the UCS2 storage appear like UTF-16 to the Python programmer. > > This process is not yet complete and will likely never complete > since it must still be possible to create things line lone > surrogates for processing purposes, so care has to be taken > when using non-BMP code points on narrow builds. Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made this quite clear. UTF-16 was designed as an easy transition from UCS-2. Indeed, if your code only does searches or joins existing strings then it will Just Work; declare it UTF-16 and you are done. We have a lot more work to do than that (as in this bug report), and we can't reasonably prevent the user from splitting surrogate pairs via poor code, but a 95% solution doesn't mean we suddenly revert all the way back to UCS-2. If the intent really was to use UCS-2 then a correctly functioning UTF-16 codec would join a surrogate pair into a single scalar value, then raise an error because it's outside the range representable in UCS-2. That's not very helpful though; obviously, it's much better to use UTF-16 internally. "The alternative (no matter what the configure flag is called) is UTF-16, not UCS-2 though: there is support for surrogate pairs in various places, including the \U escape and the UTF-8 codec." http://mail.python.org/pipermail/python-dev/2008-July/080892.html "If you find places where the Python core or standard library is doing Unicode processing that would break when surrogates are present you should file a bug. However this does not mean that every bit of code that slices a string at an arbitrary point (and hence risks slicing in the middle of a surrogate) is incorrect -- it all depends on what is done next with the slice." http://mail.python.org/pipermail/python-dev/2008-July/080900.html

On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <report@bugs.python.org> wrote:
> We use UCS2 on narrow Python builds, not UTF-16.
>
>> We might keep the old public API for compatibility, but it should be
>> clearly marked as broken for non-BMP scalar values.
>
> That has always been the case. UCS2 doesn't support surrogates.
>
> However, we have been slowly moving into the direction of making
> the UCS2 storage appear like UTF-16 to the Python programmer.
>
> This process is not yet complete and will likely never complete
> since it must still be possible to create things line lone
> surrogates for processing purposes, so care has to be taken
> when using non-BMP code points on narrow builds.

Balderdash.  We expose UTF-16 code units, not UCS-2.  Guido has made
this quite clear.

UTF-16 was designed as an easy transition from UCS-2.  Indeed, if your
code only does searches or joins existing strings then it will Just
Work; declare it UTF-16 and you are done.  We have a lot more work to
do than that (as in this bug report), and we can't reasonably prevent
the user from splitting surrogate pairs via poor code, but a 95%
solution doesn't mean we suddenly revert all the way back to UCS-2.

If the intent really was to use UCS-2 then a correctly functioning
UTF-16 codec would join a surrogate pair into a single scalar value,
then raise an error because it's outside the range representable in
UCS-2.  That's not very helpful though; obviously, it's much better to
use UTF-16 internally.

"The alternative (no matter what the configure flag is called) is
UTF-16, not UCS-2 though: there is support for surrogate pairs in
various places, including the \U escape and the UTF-8 codec."
http://mail.python.org/pipermail/python-dev/2008-July/080892.html

"If you find places where the Python core or standard library is doing
Unicode processing that would break when surrogates are present you
should file a bug. However this does not mean that every bit of code
that slices a string at an arbitrary point (and hence risks slicing in
the middle of a surrogate) is incorrect -- it all depends on what is
done next with the slice."
http://mail.python.org/pipermail/python-dev/2008-July/080900.html

History
Date	User	Action	Args
2009-10-05 17:25:50	Rhamphoryncus	set	recipients: + Rhamphoryncus, lemburg, amaury.forgeotdarc, vstinner, ezio.melotti, bupjae
2009-10-05 17:25:49	Rhamphoryncus	link	issue5127 messages
2009-10-05 17:25:47	Rhamphoryncus	create