This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients Rhamphoryncus, benjamin.peterson, ezio.melotti, lemburg, terry.reedy
Date 2008-09-01.20:08:57
SpamBayes Score 5.87721e-08
Marked as misclassified No
Message-id <48BC4BD8.10508@egenix.com>
In-reply-to <1220045591.76.0.520722098503.issue3297@psf.upfronthosting.co.za>
Content
On 2008-08-29 23:33, Terry J. Reedy wrote:
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
> 
> "Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16
> vs. UTF-32)"
> 
> I recently read most of the Unicode 5 standard and as near as I could
> tell it no longer uses the term UCS, if it ever did. 

UCS2 and UCS4 are terms which stem from the versions of Unicode
that were current at the time of adding Unicode support to Python,
ie. in the year 2000 when ISO 10646 and the Unicode spec co-existed.

See http://en.wikipedia.org/wiki/Universal_Character_Set for details.

UTF-16 is a transfer encoding that is based on UCS2 by adding
surrogate pair interpretations. UTF-32 is the same for UCS4,
but also restricting the range of valid code points to the range
covered by UTF-16.

Whether surrogates are supported or not and how they are supported
depends entirely on the codecs you use to convert the internal
format to some encoding.

> "If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. 
> It'd be a pair of ill-formed code units instead."

You are mixing the internal representation of Unicode code points
with the result of passing those values through one of the codecs,
e.g. the unicode-escape codec is responsible for converting between
the string representation u'\U00010123' and the internal representation.

Also note that because Python can be built using two different internal
representations, the results of the codecs may vary depending on
platform.

BTW: There's no such thing as an ill-formed code unit. What you probably
mean is an "ill-formed code unit sequence". However, those refer
to the output or accepted input values of a codec, not the internal
representation.

Please also note that because Python can be used to build valid
and parse possibly invalid Unicode encoding data, it has to have
the ability to work with Unicode code points regardless of whether
they can be interpreted as lone surrogates or not (hence the usage
of the terms UCS2/UCS4 which don't support surrogates).

Whether the codecs should raise exceptions and possibly let an
error handler decide whether or not to accept and/or generate
ill-formed code unit sequences is another question.

I hope that clears up the reasoning for using UCS2/UCS4 rather
than UTF-16/UTF-32 when referring to the internal Unicode representation
of Python.
History
Date User Action Args
2008-09-01 20:09:00lemburgsetrecipients: + lemburg, terry.reedy, Rhamphoryncus, benjamin.peterson, ezio.melotti
2008-09-01 20:08:59lemburglinkissue3297 messages
2008-09-01 20:08:57lemburgcreate