Message 144708 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	ezio.melotti, gvanrossum, lemburg, loewis, mrabarnett, tchrist, terry.reedy
Date	2011-09-30.22:07:24
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<29624.1317420430@chthon>
In-reply-to	<1317414642.85.0.0881821462071.issue12753@psf.upfronthosting.co.za>

Content
>Ezio Melotti <ezio.melotti@gmail.com> added the comment: > Leaving named sequences for unicodedata.lookup() only (and not for > \N{}) makes sense. There are certainly advantages to that strategy: you don't have to deal with [\N{sequence}] issues. If the argument to unicode.lookup() and be any of name, alias, or sequence, that seems ok. \N{} should still do aliases, though, since those don't have the complication that sequences have. You may wish unicode.name() to return the alias in preference, however. That's what we do. And of course, there is no issue of sequences there. The rest of this perhaps painfully long message is just elaboration and icing on what I've said above. --tom > The list of aliases is so small (11 entries) that I'm not sure using a > binary search for it would bring any advantage. Having a single > lookup algorithm that looks in both tables doesn't work because the > aliases lookup must be in _getcode for \N{...} to work, whereas the > lookup of named sequences will happen in unicodedata_lookup > (Modules/unicodedata.c:1187). I think we can leave the for loop over > aliases in _getcode and implement a separate (and binary) search in > unicodedata_lookup for the named sequences. Does that sound fine? If you mean, is it ok to add just the aliases and not the named sequences to \N{}, it is certainly better than not doing so at all. Plus that way you do not have to figure out what in the world to to do with [^a-c\N{sequence}], since that would have be something like (?!\N{sequence})[^a-c]), which is hardly obvious, especially if \N{sequence} actually starts with [a-c]. However, because the one namespace comprises all three of names, aliases, and named sequences, it might be best to have a functional (meaning, non-regex) API that allows one to do a fetch on the whole namespace, or on each individual component. The ICU library supports this sort of thing. In ICU4J's Java bindings, we find this: static int getCharFromExtendedName(String name) [icu] Find a Unicode character by either its name and return its code point value. static int getCharFromName(String name) [icu] Finds a Unicode code point by its most current Unicode name and return its code point value. static int getCharFromName1_0(String name) [icu] Find a Unicode character by its version 1.0 Unicode name and return its code point value. static int getCharFromNameAlias(String name) [icu] Find a Unicode character by its corrected name alias and return its code point value. The first one obviously has a bug in its definition, as the English doesn't scan. Looking at the full definition is even worse. Rather than dig out the src jar, I looked at ICU4C, but its own bindings are completely different. There you have only one function, with an enum to say what namespace to access: UChar32 u_charFromName ( UCharNameChoice nameChoice, const char * name, UErrorCode * pErrorCode ) The UCharNameChoice enum tells what sort of thing you want: U_UNICODE_CHAR_NAME, U_UNICODE_10_CHAR_NAME, U_EXTENDED_CHAR_NAME, U_CHAR_NAME_ALIAS, U_CHAR_NAME_CHOICE_COUNT Looking at the src for the Java is no more immediately illuminating, but I think that "extended" may refer to a union of the old 1.0 names with the current names. Now I'll tell you what Perl does. I do this not to say it is "right", but just to show you one possible strategy. I also am in the middle of writing about this for the Camel, so it is in my head. Perl does not provide the old 1.0 names at all. We don't have a Unicode 1.0 legacy to support, which makes this cleaner. However, we do provide for the names of the C0 and C1 Control Codes, because apart from Unicode 1.0, they don't condescend to name the ASCII or Latin1 control codes. We also provide for certain well known aliases from the Names file: anything that says "* commonly abbreviated as ...", so things like LRO and ZWJ and such. Perl makes no distinction between anything in the namespace when using the \N{} form for string and regex escapes. That means when you use "\N{...}" or /\N{...}/, you don't know which it is, nor can you. (And yes, the bracketed character class issue is annoying and unsolved.) However, the "functional" API does make a slight distinction. -- charnames::vianame() takes a name or alias (as a string) and returns a single integer code point. eg: This therefore converts "LATIN SMALL LETTER A" into 0x61. It also converts both BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS and BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS into 0x1D0C5. See below. -- charnames::string_vianame() takes a string name, alias, or sequence, and gives back a string. eg: This therefore converts "LATIN SMALL LETTER A" into "a". Since it has a string return instead of an int, it now also handles everything from NamedSequences file as well. (See below.) -- charnames::viacode() takes an integer can gives back the official alias if there is one, and the official name if there is not. eg: This converts 0x61 into "LATIN SMALL LETTER A". It also converts 0x1D0C5 into "BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS". Consider BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS That was an error, and there is an official alias fixing it: BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS (That's FHTORA vs FTHORA.) You may use either as the name, and if you reverse the code point to name, you get the replacement alias. % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS")' 1D0C5 % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS")' 1D0C5 % perl -mcharnames -wle 'print charnames::viacode(charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS"))' BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS So on round-tripping, I gave it the "wrong" one (the original) and it gave me back the "right" one (the replacement). Using the \N{} thing, it again doesn't matter: % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS}"' 1D0C5 % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"' 1D0C5 The interesting thing is the named sequences. string_vianame() works just fine on those: % perl -mcharnames -wle 'print length charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")' 2 % perl -mcharnames -wle 'printf "U+%v04X\n", charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")' U+0100.0300 And that works fine with \N{} as well (provided you don't try charclasses): % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' Ā̀ % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' \| uniquote -v \N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT} % perl -mcharnames=:full -wle 'print length "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' 2 % perl -mcharnames=:full -wle 'printf "U+%v04X\n", "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' U+0100.0300 It's kinda sad that for \N{} and sequneces you can't just "do the right thing" with strings and say that charclass stuff just isn't supported. But my guess is that this simply won't work because you don't have first class regexes. If you pass both of these to the regex engine, they should behave the same (and would, assuming the regex compiler knows about \N{} escapes): "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" r'\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}' However, that falls part if you do "[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]" r'[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]' Because the compiler will do the substitution early on the first one but not the second. This seems a problem, eh? So I guess you can't do it at all? Or could you document it? I think there is no good solution here. Perl can and does actually do something quite reasonable in the noncharclass case, but that is because we know that we are compiling a regex in virtually all scenarios. % perl -Mcharnames=:full -le 'print qr/\N{LATIN SMALL LETTER A}/' (?^u:\N{U+61}) % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH MACRON}/' (?^u:\N{U+100}) % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/' (?^u:\N{U+100.300}) So you can do: % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ /\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/' 1 And it is just fine. The issue is that there are ways for you to get yoruself into trouble if you do string-string stuff: % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' 1 % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ "^[\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]+\$"' 1 That works, but only accidentally, because of course U+0100.0300 contains nothing but either U+0100 or U+0300. This is not a solved problem. I hope this helps. --tom

>Ezio Melotti <ezio.melotti@gmail.com> added the comment:

> Leaving named sequences for unicodedata.lookup() only (and not for
> \N{}) makes sense.

There are certainly advantages to that strategy: you don't have to
deal with [\N{sequence}] issues.  If the argument to unicode.lookup()
and be any of name, alias, or sequence, that seems ok.  \N{} should
still do aliases, though, since those don't have the complication that
sequences have.

You may wish unicode.name() to return the alias in preference, however.
That's what we do.  And of course, there is no issue of sequences there.

The rest of this perhaps painfully long message is just elaboration
and icing on what I've said above.

--tom

> The list of aliases is so small (11 entries) that I'm not sure using a
> binary search for it would bring any advantage.  Having a single
> lookup algorithm that looks in both tables doesn't work because the
> aliases lookup must be in _getcode for \N{...} to work, whereas the
> lookup of named sequences will happen in unicodedata_lookup
> (Modules/unicodedata.c:1187).  I think we can leave the for loop over
> aliases in _getcode and implement a separate (and binary) search in
> unicodedata_lookup for the named sequences.  Does that sound fine?

If you mean, is it ok to add just the aliases and not the named sequences to
\N{}, it is certainly better than not doing so at all.  Plus that way you do
*not* have to figure out what in the world to to do with [^a-c\N{sequence}],
since that would have be something like (?!\N{sequence})[^a-c]), which is 
hardly obvious, especially if \N{sequence} actually starts with [a-c].

However, because the one namespace comprises all three of names,
aliases, and named sequences, it might be best to have a functional
(meaning, non-regex) API that allows one to do a fetch on the whole
namespace, or on each individual component.

The ICU library supports this sort of thing.  In ICU4J's Java bindings, 
we find this:

    static int getCharFromExtendedName(String name) 
       [icu] Find a Unicode character by either its name and return its code point value.
    static int	getCharFromName(String name) 
       [icu] Finds a Unicode code point by its most current Unicode name and return its code point value.
    static int	getCharFromName1_0(String name) 
       [icu] Find a Unicode character by its version 1.0 Unicode name and return its code point value.
    static int	getCharFromNameAlias(String name) 
       [icu] Find a Unicode character by its corrected name alias and return its code point value.

The first one obviously has a bug in its definition, as the English
doesn't scan.  Looking at the full definition is even worse.  Rather
than dig out the src jar, I looked at ICU4C, but its own bindings are
completely different.  There you have only one function, with an enum to
say what namespace to access:

    UChar32 u_charFromName  (       UCharNameChoice         nameChoice, 
		    const char *    name, 
		    UErrorCode *    pErrorCode 
	    )

The UCharNameChoice enum tells what sort of thing you want:

    U_UNICODE_CHAR_NAME,
    U_UNICODE_10_CHAR_NAME,
    U_EXTENDED_CHAR_NAME,
    U_CHAR_NAME_ALIAS,          
    U_CHAR_NAME_CHOICE_COUNT

Looking at the src for the Java is no more immediately illuminating, 
but I think that "extended" may refer to a union of the old 1.0 names 
with the current names.

Now I'll tell you what Perl does.  I do this not to say it is "right",
but just to show you one possible strategy.  I also am in the middle
of writing about this for the Camel, so it is in my head.

Perl does not provide the old 1.0 names at all.  We don't have a Unicode
1.0 legacy to support, which makes this cleaner.  However, we do provide
for the names of the C0 and C1 Control Codes, because apart from Unicode
1.0, they don't condescend to name the ASCII or Latin1 control codes.  

We also provide for certain well known aliases from the Names file:
anything that says "* commonly abbreviated as ...", so things like LRO
and ZWJ and such.

Perl makes no distinction between anything in the namespace when using
the \N{} form for string and regex escapes.  That means when you use
"\N{...}" or /\N{...}/, you don't know which it is, nor can you.
(And yes, the bracketed character class issue is annoying and unsolved.)

However, the "functional" API does make a slight distinction.  

 -- charnames::vianame() takes a name or alias (as a string) and returns a single 
	integer code point.

	eg: This therefore converts "LATIN SMALL LETTER A" into 0x61.
	    It also converts both 
		BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
	    and 
		BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
	    into 0x1D0C5.  See below.

 -- charnames::string_vianame() takes a string name, alias, *or* sequence, 
	and gives back a string.   

	eg: This therefore converts "LATIN SMALL LETTER A" into "a".
            Since it has a string return instead of an int, it now also
            handles everything from NamedSequences file as well. (See below.)

 -- charnames::viacode() takes an integer can gives back the official alias 
	if there is one, and the official name if there is not.

	eg: This converts 0x61 into "LATIN SMALL LETTER A".
            It also converts 0x1D0C5 into "BYZANTINE MUSICAL SYMBOL FTHORA
            SKLIRON CHROMA VASIS".

Consider

    BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS

That was an error, and there is an official alias fixing it:

    BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS

(That's FHTORA vs FTHORA.)

You may use either as the name, and if you reverse the code 
point to name, you get the replacement alias.

 % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS")'
 1D0C5

 % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS")'
 1D0C5

 % perl -mcharnames -wle 'print charnames::viacode(charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS"))'
 BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS

So on round-tripping, I gave it the "wrong" one (the original) and it gave
me back the "right" one (the replacement).

Using the \N{} thing, it again doesn't matter:

 % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS}"'
 1D0C5

 % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"'
 1D0C5

The interesting thing is the named sequences. string_vianame() works just fine on those:

 % perl -mcharnames -wle 'print length charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")'
 2

 % perl -mcharnames -wle 'printf "U+%v04X\n",  charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")'
 U+0100.0300

And that works fine with \N{} as well (provided you don't try charclasses):

 % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
 Ā̀

 % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' | uniquote -v
 \N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}

 % perl -mcharnames=:full -wle 'print length "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
 2

 % perl -mcharnames=:full -wle 'printf "U+%v04X\n", "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
 U+0100.0300

It's kinda sad that for \N{} and sequneces you can't just "do the right
thing" with strings and say that charclass stuff just isn't supported.
But my guess is that this simply won't work because you don't have 
first class regexes.  If you pass both of these to the regex engine,
they should behave the same (and would, assuming the regex compiler
knows about \N{} escapes):

    "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
    r'\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}'

However, that falls part if you do 

    "[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]"
    r'[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]'

Because the compiler will do the substitution early on the first
one but not the second.  This seems a problem, eh?  So I guess
you can't do it at all?  Or could you document it?   I think there
is no good solution here.  Perl can and does actually do something
quite reasonable in the noncharclass case, but that is because we
know that we are compiling a regex in virtually all scenarios.

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN SMALL LETTER A}/'
    (?^u:\N{U+61})

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH MACRON}/'
    (?^u:\N{U+100})

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/'
    (?^u:\N{U+100.300})

So you can do:

    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ /\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/'
    1

And it is just fine.  The issue is that there are ways for you to get
yoruself into trouble if you do string-string stuff:

    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
    1
    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ "^[\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]+\$"'
    1

That works, but only accidentally, because of course U+0100.0300 contains
nothing but either U+0100 or U+0300.

This is not a solved problem.

I hope this helps.

--tom

History
Date	User	Action	Args
2011-09-30 22:07:27	tchrist	set	recipients: + tchrist, lemburg, gvanrossum, loewis, terry.reedy, ezio.melotti, mrabarnett
2011-09-30 22:07:26	tchrist	link	issue12753 messages
2011-09-30 22:07:24	tchrist	create