This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tchrist
Recipients Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date 2011-08-15.02:17:23
SpamBayes Score 0.0
Marked as misclassified No
Message-id <26823.1313374629@chthon>
In-reply-to <1313368013.46.0.107285515249.issue12729@psf.upfronthosting.co.za>
Content
"Terry J. Reedy" <report@bugs.python.org> wrote
   on Mon, 15 Aug 2011 00:26:53 -0000: 

> PS: The OSCON link in msg142036 currently gives me 404 not found

Sorry, I wrote 

     http://training.perl.com/OSCON/index.html

but meant 

     http://training.perl.com/OSCON2011/index.html

I'll fix it on the server in a short spell.

I am trying to keep the document up to date as I learn more, so it
isn't precisely the talk I gave in Portland.

> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

So I'm finding.  Perhaps that's why I keep getting confused. I do have a pretty firm
notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results.
Can you think of anywhere that Python acts like UCS-2 and not UTF-16?  I'm not sure I
have found one, although the regex thing might count.

Thank you guys for being so helpful and understanding.

> They support non-BMP chars but only partially, because, BY DESIGN*,
> indexing and len are by code units, not codepoints. 

That's what Java did, too, and for the same reason.  Because they had
a UCS-2 implementation for Unicode 1.1 so when Unicode 2.0 came out
and they learned that they would need more than 16 bits, they piggybacked
UTF-16 onto the top of it instead of going for UTF-8 or UTF-32, and they're
still paying that price, and to my mind, heavily and continually.

Do you use Java?  It is very like Python in many of its 16-bit character issues.
Most of the length and indexing type functions address things by code unit
only, not copepoint.  But they would never claim to be UCS-2.

Oh, I realize why they did it.  For one thing, they had bytecode out there
that they had to support.  For another, they had some pretty low-level APIs
that didn't have enough flexibility of abstraction, so old source had to keep
working as before, even though this penalized the future.  Forever, kinda.

While I wish they had done better, and kinda think they could have, it
isn't my place to say.  I wasn't there (well, not paying attention) when
this was all happening, because I was so underwhelmed by the how annoyingly
overhyped it was.  A billion dollars of marketing can't be wrong, you know?
I know that smart people looked at it, seriously.  I just find the cure
they devised to be more in the problem set than the solution set.

I like how Python works on wide builds, especially with Python3. I was
pretty surprised that the symbolic names weren't working right on the
earlier version of the 2.6 wide build I tried them on.

I know have both wide and narrow builds installed of both 2.7 and 3.2,
so that shouldn't happen again.

> They are documented as being UCS-2 because that is what M-A Lemburg,
> the original designer and writer of Python's unicode type and the unicode-
> capable re module, wants them to be called. The link to msg142037,
> which is one of 50+ in the thread (and many or most other disagree),
> pretty well explains his viewpoint.

Count me as one of those many/most others who disagree. :)

> The positive side is that we deliver more than we promise. The
> negative side is that by not promising what perhaps we should allows
> is not to deliver what perhaps we should.

It is always better to deliver more than you say than to deliver less.

> * While I think this design decision may have been OK a decade ago for
>   a first implementation of an *optional* text type, I do not think it
>   so for the future for revised implementations of what is now *the*
>   text type. I think narrow builds can and should be revised and
>   upgraded to index, slice, and measure by codepoints. 

Yes, I think so, too.  If you look at the growth curve of UTF-8 alone,
it has followed a mathematically exponential growth curve in the 
first decade of this century.  I suspect that will turn into an S
surve with with aymtoptotic shoulders any time now.  I haven't looked
at it lately, so maybe it already has.  I know that huge corpora I work
with at work are all absolutely 100% Unicode now.  Thank XML for that.

> Here is my current idea:

> If the code unit stream contains any non-BMP characters (ie, surrogate
> pair of 16-bit code units), construct a sequence of *indexes* of such
> characters (pairs). The fixed length of the string in codepoints is
> n-k, where n is the number of code units (the current length) and k is
> the length of the auxiliary sequence and the number of pairs. For
> indexing, look up the character index in the list of indexes by binary
> search and increment the codepoint index by the index of the index
> found to get the corresponding code unit index. (I have omitted the
> details needed avoid off-by-1 errors.)

> This would make indexing O(log(k)) when there are surrogates. If that
> is really a problem because k is a substantial fraction of a 'large'
> n, then one should use a wide build. By using a separate internal
> class, there would be no time or space penalty for all-BMP text. I
> will work on a prototype in Python.

You are a brave man, and good.  Bravo!

It may be that that was the sort of thing that Larry was talking to me
about 6-8 months ago regarding how to construct a better way to access
strings by grapheme index.  

Everyone always talks about important they're sure O(1) access must be, and how they
therefore abosolutely have to have it no matter the tradeoffs.  But without two
implementations to compare real-world access patterns against, one really can't know.  
I know that index/rindex and substring operations are very rare in how I myself process
strings, but I have seen how Python people turn to those all the time when I would
reflexively use a pattern match.  So usage patterns may very; hence the desire for real
comparisons.  I'm perfectly willing to be convinced, but I want to see real data.

If I get time, I'll check into whether the Perl 6 people have any real data about that.
I had thought that that Parrot was currently using ICU4C for its string handling, which
may mean they're afflicted by UTF-16, something I wouldn't think they would tolerate,
especially since they need code points above 0x10_FFFF for their Normalization Form G
(Grapheme).  Piggy packing that on UTF-16 would require stealing some Private Use code
point to act as multilevel surrogates so that UTF-16 is infinitely extensible the way
UTF-8 is.  Not sure what I think about that, but it's been mentioned as a loophole
escape for when Unicode has to renege on its 21-bit promise.  I sure hope everyone has
stopped using UTF-16 by then myself.  It's trouble enough right now.

Hm, now that I think ICU about it, ICU just might use int sequences interally, so UTF-
32, for its own strings, so that might be it.  Yes, I see they too are going for O(1)
access.  Nonetheless, a careful enough UTF-16 implementation with a rich enough API is
able to access all of Unicode with no trouble.  It's just that the Java API from Sun is
not one such.  The Perl 6 spec is all about graphemes, and graphemes are all about code
points, which means an implementation could work around 16-bit code units so the user
never has to think about them. That would be just like the Perl 5 implmentation works
around 8-bit code units and never let the user notice it.

Hey, if you can build TCP out of IP, anything is possible. :)

--tom
History
Date User Action Args
2011-08-15 02:17:26tchristsetrecipients: + tchrist, terry.reedy, pitrou, jkloth, ezio.melotti, mrabarnett, Arfrever, r.david.murray
2011-08-15 02:17:24tchristlinkissue12729 messages
2011-08-15 02:17:23tchristcreate