Author tchrist
Recipients Arfrever, ezio.melotti, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date 2011-08-14.06:09:42
SpamBayes Score 0.0
Marked as misclassified No
Message-id <5187.1313302168@chthon>
In-reply-to <1313297680.15.0.194063822462.issue12729@psf.upfronthosting.co.za>
Content
Ezio Melotti <ezio.melotti@gmail.com> added the comment:

>> It is simply a design error to pretend that the number of characters
>> is the number of code units instead of code points.  A terrible and
>> ugly one, but it does not mean you are UCS-2.

> If you are referring to the value returned by len(unicode_string), it
> is the number of code units.  This is a matter of "practicality beats
> purity".  Returning the number of code units is O(1) (num_of_bytes/2).
> To calculate the number of characters it's instead necessary to scan
> all the string looking for surrogates and then count any surrogate
> pair as 1 character.  It was therefore decided that it was not worth
> to slow down the common case just to be 100% accurate in the
> "uncommon" case.

If speed is more important than correctness, I can make any algorithm
infinitely fast.  Given the choice between correct and quick, I will 
take correct every single time.

Plus your strings our immutable! You know how long they are and they 
never change.  Correctness comes at a negligible cost.  

It was a bad choice to return the wrong answer.

> That said it would be nice to have an API (maybe in unicodedata or as
> new str methods?) able to return the number of code units, code
> points, graphemes, etc, but I'm not sure that it should be the default
> behavior of len().

Always code points, never code units.  I even use a class whose length
method returns the grapheme count, because even code points aren't good
enough.  Yes of course graphemes have to be counted.  Big deal.   How 
would you like it if you said to move three to the left in vim and 
it *didn't* count each graphemes as one position?  Madness.

>> The ugly terrible design error is digusting and wrong, just as much
>> in Python as in Java, and perhaps moreso because of the idiocy of
>> narrow builds even existing.

> Again, wide builds use twice as much the space than narrow ones, but
> one the other hand you can have fast and correct behavior with e.g.
> len().  If people don't care about/don't need to use non-BMP chars and
> would rather use less space, they can do so.  Until we agree that the
> difference in space used/speed is no longer relevant and/or that non-
> BMP characters become common enough to prefer the "correct behavior"
> over the "fast-but-inaccurate" one, we will probably keep both.

Which is why I always put loud warnings in my Unicode-related Python
programs that they do not work right on Unicode if running under
a narrow build.  I almost feel I should just exit.

>> I haven't checked its UTF-16 codecs, but Python's UTF-8 codec is
>> broken in a bunch of ways.  You should be raising as exception in
>> all kinds of places and you aren't.

> I am aware of some problems of the UTF-8 codec on Python 2.  It used
> to follow RFC 2279 until last year and now it's been updated to follow
> RFC 3629.

Unicode says you can't put surrogates or noncharacters in a UTF-anything 
stream.  It's a bug to do so and pretend it's a UTF-whatever.

Perl has an encoding form, which it does not call "UTF-8", that you 
can use the UTF-8 algorithm on for any code point, include non-characters
and surrogates and even non-Unicode code points far above 0x10_FFFF, up
to in fact 0xFFFF_FFFF_FFFF_FFFF on 64-bit machines.  It's the internal
format we use in memory.  But we don't call it real UTF-8, either.

It sounds like this is the kind of thing that would be useful to you.

> However, for backward compatibility, it still encodes/decodes
> surrogate pairs.  This broken behavior has been kept because on Python
> 2, you can encode every code point with UTF-8, and decode it back
> without errors:

No, that's not UTF-8 then.  By definition.  See the Unicode Standard.

>>>> x = [unichr(c).encode('utf-8') for c in range(0x110000)]
>>>>

> and breaking this invariant would probably make more harm than good.

Why?  Create something called utf8-extended or utf8-lax or utf8-nonstrict
or something.  But you really can't call it UTF-8 and do that.  

We actually equate "UTF-8" and "utf8-strict".  Our internal extended
UTF-8 is something else.  It seems like you're still doing the old
relaxed version we used to have until 2003 or so.  It seems useful
to be able to have both flavors, the strict and the relaxed one,
and to call them different things.  

Perl defaults to the relaxed one, which gives warnings not exceptions,
if you do things like setting PERLUNICODE to S or SD and such for the
default I/I encoding.  If you actually use "UTF-8" as the encoding on the stream, though, you
get the version that gives exceptions instead.  

    "UTF-8" = "utf8-strict" 	strictly by the standard, raises exceptions otherwise
    "utf8"			loosely only, emits warnings on encoding illegal things

We currently only emit warnings or raise exceptions on I/O, not on chr
operations and such.  We used to raise exceptions on things like
chr(0xD800), but that was a mistake caused by misunderstanding the in-
memory requirements being different from stream requirements.  It's
really really subtle and you have to read the standard very closely to
realize this.

So you are perfectly free to use chr(0x20FFFF) in your own code.  This is
really useful for out-of-band sentinels and such.  However, if you try to
send it out a loose utf8 stream, you get a mandatory warning, and if you
try to send it out a strict UTF-8 stream, you get an exception.

In fact, if you remember the old "migrate ASCII trick" from the tr program,
doing something like this to turn on the high bit to mark characters in some way:

    tr[\0-\x7F][\x80-\xFF]

        (that's what killed WordStar BTW, as they used that trick
         on their ASCII internally so couldn't port to 8-bit
         encodings.  Ooops.)

Given the full 32- or 64-bit (or higher) character range for internal use, 
you can actually do this as a sort of corresponding transform:

    tr[\0-\x{10_FFFF}][\x{20_0000}-\x{3F_FFFF}]

Just don't try to output it. :)  For internal use only.  Blah blah.

(Hm, I just realized you couldn't ever do that sort of thing at all on a
narrow build because you're stuck with UTF-16.  On a wide build, though,
you could, because you'd have UTF-32.  Not that I'm suggesting it!!!)

Yes, that's not necessarily the best way to do most of what one might
naively try using it for, but there are all kinds of intersting things you
can do when your characters' internal code points don't have the same upper
bound as Unicode.

It's taken us years to unravel all this Unicode stuff so it's usable.  We used
to a lot of really um unfortunate things, whether too many errors or too few.
I'm certainly not suggesting you go down those roads.  In some ways, Python's
Unicode support reminds me of ours from rather a long time ago.

We've worked pretty hard at Unicode in Perl for the last few years, although
even ten years ago we already supported \X, all regex properties, and full
casemapping and full casefolding.  So there's always been a strong Unicode
sensitivity in Perl.  It's just taken us a long long long long time to 
get all the kinks out.

I don't imagine most of the Python devel team knows Perl very well, and maybe
not even Java or ICU.  So I get the idea that there isn't as much awareness of
Unicode in your team as there tends to be in those others. From my point of
view, learning from other people's mistakes is a way to get ahead without
incurring all the learning-bumps oneself, so if there's a way to do that for
you, that could be to your benefit, and I'm very happy to share some of our
blunders so you can avoid them yourselves.

> I proposed to add a "real" utf-8 codec on Python 2, but no one seems
> to care enough about it.

Hm.  See previous paragraph. :)

> Also note that this is fixed in Python3:
>>>> x = [chr(c).encode('utf-8') for c in range(0x110000)]
> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Yes, I've noticed that Python3 is better about some of this,
but it doesn't detect the 66 noncharacter code points.  

I haven't checked on decoding yet, but I bet it's the same.

I think having something that does the lax Python2 way and something
else that does the stricter Standard way makes the most sense.  

>>  I can see I need to bug report this stuff to.  

> If you find other places where it's broken (both on Python 2 and/or
> Python 3), please do and feel free to add me to the nosy.  If you can
> also provide a failing test case and/or point to the relevant parts of
> the Unicode standard, it would be great.

I'll wait to report it till I have all my references at ready.

I can probably pretty easily find the part of the Unicode Standard where it
says no UTF can contain code points that are illegal for interchange.
Finding the part that explains that/why you can and indeed must be able to 
have them internally is going to be harder, but I know it's there.

Also, there is a tr18 update that adds a bit of clarification about how it
is sometimes ok to allow a regex engine in a UTF-16 language to find
unpaired surrogates, like checking whether "foo\x{D800}bar" matches
the pattern /\p{Cs}/.  You could never have that string read in from a valid
UTF-{8,16,32} stream, but because it can happen in your program, you have
to be able to match it.  So they finally admit this in the next tr18 update.
But it's still a bit odd, eh?  (And no, that doesn't make it UCS-2! :)

--tom
History
Date User Action Args
2011-08-14 06:09:45tchristsetrecipients: + tchrist, terry.reedy, pitrou, ezio.melotti, mrabarnett, Arfrever, r.david.murray
2011-08-14 06:09:44tchristlinkissue12729 messages
2011-08-14 06:09:42tchristcreate