Author lemburg
Recipients Rhamphoryncus, amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner
Date 2009-10-05.18:10:17
SpamBayes Score 0.0
Marked as misclassified No
Message-id <>
In-reply-to <>
Adam Olsen wrote:
> Adam Olsen <> added the comment:
> On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg <> wrote:
>> We use UCS2 on narrow Python builds, not UTF-16.
>>> We might keep the old public API for compatibility, but it should be
>>> clearly marked as broken for non-BMP scalar values.
>> That has always been the case. UCS2 doesn't support surrogates.
>> However, we have been slowly moving into the direction of making
>> the UCS2 storage appear like UTF-16 to the Python programmer.
>> This process is not yet complete and will likely never complete
>> since it must still be possible to create things line lone
>> surrogates for processing purposes, so care has to be taken
>> when using non-BMP code points on narrow builds.
> Balderdash.  We expose UTF-16 code units, not UCS-2.  Guido has made
> this quite clear.
> UTF-16 was designed as an easy transition from UCS-2.  Indeed, if your
> code only does searches or joins existing strings then it will Just
> Work; declare it UTF-16 and you are done.  We have a lot more work to
> do than that (as in this bug report), and we can't reasonably prevent
> the user from splitting surrogate pairs via poor code, but a 95%
> solution doesn't mean we suddenly revert all the way back to UCS-2.
> If the intent really was to use UCS-2 then a correctly functioning
> UTF-16 codec would join a surrogate pair into a single scalar value,
> then raise an error because it's outside the range representable in
> UCS-2.  That's not very helpful though; obviously, it's much better to
> use UTF-16 internally.
> "The alternative (no matter what the configure flag is called) is
> UTF-16, not UCS-2 though: there is support for surrogate pairs in
> various places, including the \U escape and the UTF-8 codec."
> "If you find places where the Python core or standard library is doing
> Unicode processing that would break when surrogates are present you
> should file a bug. However this does not mean that every bit of code
> that slices a string at an arbitrary point (and hence risks slicing in
> the middle of a surrogate) is incorrect -- it all depends on what is
> done next with the slice."

All this is just nitpicking, really. UCS2 is a character set,
UTF-16 an encoding.

It so happens that when the Unicode consortium realized
that 16 bit would not be enough to represent all scripts of the
world, they added the concept of surrogates and reserved a few
ranges of code points in UCS2 to represent these extra code
points which are not part of UCS2, but the extensions UCS4.

The conversion of these surrogate pairs to UCS4 code point
values is what you find defined in UTF-16.

If we were to implement Unicode using UTF-16 as storage format,
we would not be able to store single lone surrogates, since these
are not allowed in UTF-16. Ditto for unassigned ordinals, invalid
code points, etc.

PEP 100 really says it all:

    This [internal] format will hold UTF-16 encodings of the corresponding
    Unicode ordinals.  The Python Unicode implementation will address
    these values as if they were UCS-2 values. UCS-2 and UTF-16 are
    the same for all currently defined Unicode character points.
    Future implementations can extend the 16 bit restriction to the
    full set of all UTF-16 addressable characters (around 1M

Note that I wrote the PEP and worked on the implementation at a time
when Unicode 2.x was still in use wide-spread use (mostly on Windows)
and 3.0 was just being release:

But all that is off-topic for this ticket, so please let's just
stop such discussions.
Date User Action Args
2009-10-05 18:10:20lemburgsetrecipients: + lemburg, amaury.forgeotdarc, Rhamphoryncus, vstinner, ezio.melotti, bupjae
2009-10-05 18:10:18lemburglinkissue5127 messages
2009-10-05 18:10:17lemburgcreate