Message 143061 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date	2011-08-27.11:51:45
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<26480.1314445868@chthon>
In-reply-to	<CAP7+vJLTw8Ya0hJzfcZLNqJ3p3wGF68k2rPzaR5D_dKae6XEVw@mail.gmail.com>

Content
Guido van Rossum <report@bugs.python.org> wrote on Sat, 27 Aug 2011 03:26:21 -0000: > To me, making (default) iteration deviate from indexing is anathema. So long is there's a way to interate through a string some other way that by code unit, that's fine. However, the Java way of 16-bit code units is so annoying because there often aren't code point APIs, and so you get a lot of niggling errors creeping in. This is part of why I strongly prefer wide builds, so that code point and code unit are the same thing again. > However, there is nothing wrong with providing a library function that > takes a string and returns an iterator that iterates over code points, > joining surrogate pairs as needed. You could even have one that > iterates over characters (I think Tom calls them graphemes), if that > is well-defined and useful. "Character" can sometimes be a confusing term when it means something different to us programmers as it does to users. Code point to mean the integer is a lot clearer to us but to no one else. At work I often just give in and go along with the crowd and say character for the number that sits in a char or wchar_t or Character variable, even though of course that's a code point. I only rebel when they start calling code units characters, which (inexperienced) Java people tend to do, because that leads to surrogate splitting and related errors. By grapheme I mean something the user perceives as a single character. In full Unicodese, this is an extended grapheme cluster. These are code point sequences that start with a grapheme base and have zero or more grapheme extenders following it. For our purposes, that's mostly like saying you have a non-Mark followed by any number of Mark code points, the main excepting being that a CR followed by a LF also counts as a single grapheme in Unicode. If you are in an editor and wanted to swap two "characters", the one under the user's cursor and the one next to it, you have to deal with graphemes not individual code points, or else you'd get the wrong answer. Imagine swapping the last two characters of the first string below, or the first two characters of second one: contrôlée contro\x{302}le\x{301}e élève e\x{301}le\x{300}ve While you can sometimes fake a correct answer by considering things in NFC not NFD, that's doesn't work in the general case, as there are only a few compatibility glyphs for round-tripping for legacy encodings (like ISO 8859-1) compared with infinitely many combinations of combining marks. Particularly in mathematics and in phonetics, you often end up using marks on characters for which no pre-combined variant glyph exists. Here's the IPA for a couple of Spanish words with their tight (phonetic, not phonemic) transcriptions: anécdota [a̠ˈne̞ɣ̞ð̞o̞t̪a̠] rincón [rĩŋˈkõ̞n] NFD: ane\x{301}cdota [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}] rinco\x{301}n [ri\x{303}\x{14B}\x{2C8}ko\x{31E}\x{303}n] NFD: an\x{E9}cdota [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}] rinc\x{F3}n [r\x{129}\x{14B}\x{2C8}k\x{F5}\x{31E}n] So combining marks don't "just go away" in NFC, and you really do have to deal with them. Notice that to get the tabs right (your favorite subject :), you have to deal with print widths, which is another place that you get into trouble if you only count code points. BTW, did you know that the stress mark used in the phonetics above is actually a (modifier) letter in Unicode, not punctuation? # uniprops -a 2c8 U+02C8 ‹ˈ› \N{MODIFIER LETTER VERTICAL LINE} \w \pL \p{L_} \p{Lm} All Any Alnum Alpha Alphabetic Assigned InSpacingModifierLetters Case_Ignorable CI Common Zyyy Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Print Spacing_Modifier_Letters Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Spacing_Modifier_Letters Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=BB Line_Break=Break_Before LB=BB Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE Word_Break=LE _Case_Ignorable _X_Begin That means those would all be matched by \w+, as unlike \p{alpha}, \p{word} includes not just \pL etc but also all the combining marks. That's how you want it to work, although I think you have to use regex not re in Python to get that. Iterating by grapheme is easy in a regex engine that supports \X. Instead of using "." to match a code point, you use a \X to match a grapheme. So the swapping problem goes away, and many others. To capture a pair of graphemes for swapping you'd use (\X)(\X), and to grab the first 6 graphemes without breaking them up you'd use \X{6}. That means to interate by grpaheme you just split up your string one \X at a time. Here's a real-world example: In the vim editor, when you're editing UTF-8 as I am this mail message, because it is all about user-perceived characters, they actually use "." to match an entire grapheme. This is different form th eayw perl and everybody else uses "." for a code point, not a grapheme. If I did s/^.// or s/.$// in vim, I would need s/^\X// or s/\X$// for in perl. Similarly, to swap "characters" with the "xp" command, it will grab the entire \X. Put some of those phonetic transcriptions above into a vim buffer and play with them to see what I mean. Imagine using a format like "%-6.6s" on "contrôlée": that should produce "contrô" not "contro". That's because code points with the property Bidi_Class=Non_Spacing_Mark (BC=NSM) do not advance the cursor, they just stack up. It gets even worse in that some code points advance the cursor by two not by zero or one. These include those with the East_Asian_Width property value Full or Wide. And they aren't always Asian characters, either. For example, these code points all have the EA=W property, so take up to print columns: 〈 U+2329 LEFT-POINTING ANGLE BRACKET 〉 U+232A RIGHT-POINTING ANGLE BRACKET 〃 U+3003 DITTO MARK 〜 U+301C WAVE DASH 〝 U+301D REVERSED DOUBLE PRIME QUOTATION MARK 〞 U+301E DOUBLE PRIME QUOTATION MARK 〟 U+301F LOW DOUBLE PRIME QUOTATION MARK Perl's built-in string indexing, and hence its substrings, is strictly by code point and not by grapheme. This is really frustrating at times, because something like this screws up: printf "%-6.6", "contrôlée"; printf "%-6.6", "a̠ˈne̞ɣ̞ð̞o̞t̪a̠"; Logically, those should produce "contrô" and "a̠ˈne̞ɣ̞ð̞", but of course when considering only code points, they won't. Well, not unless the 1st is in NFC, but there's no hope for the second. Perl does have a grapheme cluster string class which provides a way to figure out the columns and also allows for substring operation by grapheme. But it is not at all integrated into anything, which makes it tedious to use. use Unicode::GCString; # on CPAN only, not yet in core my $string = "a̠ˈne̞ɣ̞ð̞o̞t̪a̠"; my $gcstring = Unicode::GCString->new($string); my $colwidth = $gcstring->columns; if ($colwidth > 6) { print $gcstring->substr(0,6); } else { print " " x (6 - $colwidth); print $gcstring; } Isn't that simply horrible? You will get the right answer that way, but what a pain! Really, there needs to be a way for the built-in formatters to understand graphemes. But first, I think, you have to have the regex engine understand them. Matthew's regex does, because it supports \X. There's a lot more to dealing with Unicode text than just extending the character repertoire. How much should fundamental to the language and how much should be relegated to modules isn't always clear. I do know I've had to rewrite a lot of standard Unix tools to deal with Unicode properly. For the wc(1) rewrite I only needed to consider graphemes with \X and Unicode line break sequences with \R, but other tools need better smarts. For example, just getting the fmt(1) rewrite to wrap lines in paragraphs correctly requires understanding not just graphemes but the Unicode Linebreak Algorithm, which in turn relies upon understanding the print widths for grapheme cluster strings and East Asian wide or full characters. It's something you only want to do once and never think about again. :( --tom

Guido van Rossum <report@bugs.python.org> wrote
   on Sat, 27 Aug 2011 03:26:21 -0000: 

> To me, making (default) iteration deviate from indexing is anathema.

So long is there's a way to interate through a string some other way
that by code unit, that's fine.  However, the Java way of 16-bit code
units is so annoying because there often aren't code point APIs, and 
so you get a lot of niggling errors creeping in.  This is part of why
I strongly prefer wide builds, so that code point and code unit are the
same thing again.

> However, there is nothing wrong with providing a library function that
> takes a string and returns an iterator that iterates over code points,
> joining surrogate pairs as needed. You could even have one that
> iterates over characters (I think Tom calls them graphemes), if that
> is well-defined and useful.

"Character" can sometimes be a confusing term when it means something
different to us programmers as it does to users.  Code point to mean the
integer is a lot clearer to us but to no one else.  At work I often just
give in and go along with the crowd and say character for the number that
sits in a char or wchar_t or Character variable, even though of course
that's a code point.  I only rebel when they start calling code units 
characters, which (inexperienced) Java people tend to do, because that
leads to surrogate splitting and related errors.

By grapheme I mean something the user perceives as a single character.  In
full Unicodese, this is an extended grapheme cluster.  These are code point
sequences that start with a grapheme base and have zero or more grapheme
extenders following it.  For our purposes, that's *mostly* like saying you
have a non-Mark followed by any number of Mark code points, the main
excepting being that a CR followed by a LF also counts as a single grapheme
in Unicode.

If you are in an editor and wanted to swap two "characters", the one 
under the user's cursor and the one next to it, you have to deal with
graphemes not individual code points, or else you'd get the wrong answer.
Imagine swapping the last two characters of the first string below,
or the first two characters of second one:

    contrôlée    contro\x{302}le\x{301}e
    élève        e\x{301}le\x{300}ve        

While you can sometimes fake a correct answer by considering things
in NFC not NFD, that's doesn't work in the general case, as there
are only a few compatibility glyphs for round-tripping for legacy
encodings (like ISO 8859-1) compared with infinitely many combinations
of combining marks.  Particularly in mathematics and in phonetics, 
you often end up using marks on characters for which no pre-combined
variant glyph exists.  Here's the IPA for a couple of Spanish words
with their tight (phonetic, not phonemic) transcriptions:

        anécdota    [a̠ˈne̞ɣ̞ð̞o̞t̪a̠]
        rincón      [rĩŋˈkõ̞n]

    NFD:
        ane\x{301}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
        rinco\x{301}n      [ri\x{303}\x{14B}\x{2C8}ko\x{31E}\x{303}n]

    NFD:
        an\x{E9}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
        rinc\x{F3}n      [r\x{129}\x{14B}\x{2C8}k\x{F5}\x{31E}n]

So combining marks don't "just go away" in NFC, and you really do have to
deal with them.  Notice that to get the tabs right (your favorite subject :),
you have to deal with print widths, which is another place that you get
into trouble if you only count code points.

BTW, did you know that the stress mark used in the phonetics above
is actually a (modifier) letter in Unicode, not punctuation?

    # uniprops -a 2c8
    U+02C8 ‹ˈ› \N{MODIFIER LETTER VERTICAL LINE}
        \w \pL \p{L_} \p{Lm}
    All Any Alnum Alpha Alphabetic Assigned InSpacingModifierLetters Case_Ignorable CI Common Zyyy Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Print Spacing_Modifier_Letters Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
    Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Spacing_Modifier_Letters Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=BB Line_Break=Break_Before LB=BB Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE Word_Break=LE _Case_Ignorable _X_Begin

That means those would all be matched by \w+, as unlike \p{alpha},
\p{word} includes not just \pL etc but also all the combining marks.
That's how you want it to work, although I think you have to use
regex not re in Python to get that.

Iterating by grapheme is easy in a regex engine that supports \X.
Instead of using "." to match a code point, you use a \X to match
a grapheme.  So the swapping problem goes away, and many others.
To capture a pair of graphemes for swapping you'd use (\X)(\X), and
to grab the first 6 graphemes without breaking them up you'd use \X{6}.
That means to interate by grpaheme you just split up your string one
\X at a time.

Here's a real-world example:

In the vim editor, when you're editing UTF-8 as I am this mail message,
because it is all about user-perceived characters, they actually use "." to
match an entire grapheme.  This is different form th eayw perl and
everybody else uses "." for a code point, not a grapheme.  If I did s/^.//
or s/.$// in vim, I would need s/^\X// or s/\X$// for in perl.  Similarly,
to swap "characters" with the "xp" command, it will grab the entire \X.
Put some of those phonetic transcriptions above into a vim buffer and play
with them to see what I mean.

Imagine using a format like "%-6.6s" on "contrôlée": that should produce
"contrô" not "contro".  That's because code points with the property
Bidi_Class=Non_Spacing_Mark (BC=NSM) do not advance the cursor, they just
stack up.

It gets even worse in that some code points advance the cursor by two
not by zero or one.  These include those with the East_Asian_Width
property value Full or Wide.  And they aren't always Asian characters,
either.  For example, these code points all have the EA=W property, so
take up to print columns:

     〈  U+2329 LEFT-POINTING ANGLE BRACKET
     〉  U+232A RIGHT-POINTING ANGLE BRACKET
     〃  U+3003 DITTO MARK
     〜  U+301C WAVE DASH
     〝  U+301D REVERSED DOUBLE PRIME QUOTATION MARK
     〞  U+301E DOUBLE PRIME QUOTATION MARK
     〟  U+301F LOW DOUBLE PRIME QUOTATION MARK

Perl's built-in string indexing, and hence its substrings, is strictly 
by code point and not by grapheme.  This is really frustrating at times,
because something like this screws up:

    printf "%-6.6", "contrôlée";
    printf "%-6.6", "a̠ˈne̞ɣ̞ð̞o̞t̪a̠";

Logically, those should produce "contrô" and "a̠ˈne̞ɣ̞ð̞", but of course
when considering only code points, they won't.  Well, not unless the 
1st is in NFC, but there's no hope for the second.

Perl does have a grapheme cluster string class which provides a way 
to figure out the columns and also allows for substring operation by
grapheme. But it is not at all integrated into anything, which makes 
it tedious to use.

    use Unicode::GCString;  # on CPAN only, not yet in core

    my $string   = "a̠ˈne̞ɣ̞ð̞o̞t̪a̠";
    my $gcstring = Unicode::GCString->new($string);
    my $colwidth = $gcstring->columns;
    if ($colwidth > 6) {
        print $gcstring->substr(0,6);
    } else {
        print " " x (6 - $colwidth);
        print $gcstring;
    }

Isn't that simply horrible?  You *will* get the right answer that way, but
what a pain!  Really, there needs to be a way for the built-in formatters
to understand graphemes.  But first, I think, you have to have the regex
engine understand them.  Matthew's regex does, because it supports \X.

There's a lot more to dealing with Unicode text than just extending the
character repertoire.  How much should fundamental to the language and how
much should be relegated to modules isn't always clear.  I do know I've had
to rewrite a *lot* of standard Unix tools to deal with Unicode properly.
For the wc(1) rewrite I only needed to consider graphemes with \X and 
Unicode line break sequences with \R, but other tools need better smarts.
For example, just getting the fmt(1) rewrite to wrap lines in paragraphs 
correctly requires understanding not just graphemes but the Unicode 
Linebreak Algorithm, which in turn relies upon understanding the print
widths for grapheme cluster strings and East Asian wide or full characters.

It's something you only want to do once and never think about again. :(

--tom

History
Date	User	Action	Args
2011-08-27 11:51:50	tchrist	set	recipients: + tchrist, lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray
2011-08-27 11:51:48	tchrist	link	issue12729 messages
2011-08-27 11:51:45	tchrist	create