Author ezio.melotti
Recipients Arfrever, ezio.melotti, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date 2011-08-14.04:54:38
SpamBayes Score 8.01581e-14
Marked as misclassified No
Message-id <1313297680.15.0.194063822462.issue12729@psf.upfronthosting.co.za>
In-reply-to
Content
> It is simply a design error to pretend that the number of characters
> is the number of code units instead of code points.  A terrible and
> ugly one, but it does not mean you are UCS-2.

If you are referring to the value returned by len(unicode_string), it is the number of code units.  This is a matter of "practicality beats purity".  Returning the number of code units is O(1) (num_of_bytes/2).  To calculate the number of characters it's instead necessary to scan all the string looking for surrogates and then count any surrogate pair as 1 character.  It was therefore decided that it was not worth to slow down the common case just to be 100% accurate in the "uncommon" case.

That said it would be nice to have an API (maybe in unicodedata or as new str methods?) able to return the number of code units, code points, graphemes, etc, but I'm not sure that it should be the default behavior of len().

> The ugly terrible design error is digusting and wrong, just as much
> in Python as in Java, and perhaps moreso because of the idiocy of
> narrow builds even existing.

Again, wide builds use twice as much the space than narrow ones, but one the other hand you can have fast and correct behavior with e.g. len().  If people don't care about/don't need to use non-BMP chars and would rather use less space, they can do so.  Until we agree that the difference in space used/speed is no longer relevant and/or that non-BMP characters become common enough to prefer the "correct behavior" over the "fast-but-inaccurate" one, we will probably keep both.

> I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is
> broken in a bunch of ways.  You should be raising as exception in
> all kinds of places and you aren't.

I am aware of some problems of the UTF-8 codec on Python 2.  It used to follow RFC 2279 until last year and now it's been updated to follow RFC 3629.
However, for backward compatibility, it still encodes/decodes surrogate pairs.  This broken behavior has been kept because on Python 2, you can encode every code point with UTF-8, and decode it back without errors:
>>> x = [unichr(c).encode('utf-8') for c in range(0x110000)]
>>>
and breaking this invariant would probably make more harm than good.  I proposed to add a "real" utf-8 codec on Python 2, but no one seems to care enough about it.

Also note that this is fixed in Python3:
>>> x = [chr(c).encode('utf-8') for c in range(0x110000)]
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

>  I can see I need to bug report this stuff to.  

If you find other places where it's broken (both on Python 2 and/or Python 3), please do and feel free to add me to the nosy.  If you can also provide a failing test case and/or point to the relevant parts of the Unicode standard, it would be great.
History
Date User Action Args
2011-08-14 04:54:40ezio.melottisetrecipients: + ezio.melotti, terry.reedy, pitrou, mrabarnett, Arfrever, r.david.murray, tchrist
2011-08-14 04:54:40ezio.melottisetmessageid: <1313297680.15.0.194063822462.issue12729@psf.upfronthosting.co.za>
2011-08-14 04:54:39ezio.melottilinkissue12729 messages
2011-08-14 04:54:38ezio.melotticreate