This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace
Type: enhancement Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder: Unicode case mappings are incorrect
View: 4610
Assigned To: ezio.melotti Nosy List: belopolsky, ezio.melotti, flox, gvanrossum, lemburg, loewis, mrabarnett, python-dev, tchrist, terry.reedy
Priority: normal Keywords: needs review, patch

Created on 2011-08-15 17:48 by tchrist, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
nametests.py tchrist, 2011-08-15 19:51 test case to check unicodedata.lookup and \N{} against named chars AND formal alias AND named sequences
issue12753.diff ezio.melotti, 2011-09-30 08:59 patch to add the aliases review
issue12753-2.diff ezio.melotti, 2011-10-01 02:15 patch to add the aliases and named sequences review
issue12753-3.diff ezio.melotti, 2011-10-02 07:34 patch to add the aliases and named sequences + tests + doc review
issue12753-4.diff ezio.melotti, 2011-10-11 02:11 patch to add the aliases and named sequences + tests + doc review
Messages (37)
msg142136 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-15 17:48
Unicode character names share a common namespace with formal aliases and with named sequences, but Python recognizes only the original name. That means not everything in the namespace is accessible from Python.  (If this is construed to be an extant bug from than an absent feature, you probably want to change this from a wish to a bug in the ticket.)

This is a problem because aliases correct errors in the original names, and are the preferred versions.  For example, ISO screwed up when they called U+01A2 LATIN CAPITAL LETTER OI.  It is actually LATIN CAPITAL LETTER GHA according to the file NameAliases.txt in the Unicode Character Database.  However, Python blows up when you try to use this:

    % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER OI}")'
    Ƣ

    % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER GHA}")'
      File "<string>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name
    Exit 1

This unfortunate, because the formal aliases correct egregious blunders, such as the Standard reading "BRAKCET" instead of "BRACKET":

$ uninames '^\s+%'
 Ƣ  01A2        LATIN CAPITAL LETTER OI
        % LATIN CAPITAL LETTER GHA
 ƣ  01A3        LATIN SMALL LETTER OI
        % LATIN SMALL LETTER GHA
        * Pan-Turkic Latin alphabets
 ೞ  0CDE        KANNADA LETTER FA
        % KANNADA LETTER LLLA
        * obsolete historic letter
        * name is a mistake for LLLA
 ຝ  0E9D        LAO LETTER FO TAM
        % LAO LETTER FO FON
        = fo fa
        * name is a mistake for fo sung
 ຟ  0E9F        LAO LETTER FO SUNG
        % LAO LETTER FO FAY
        * name is a mistake for fo tam
 ຣ  0EA3        LAO LETTER LO LING
        % LAO LETTER RO
        = ro rot
        * name is a mistake, lo ling is the mnemonic for 0EA5
 ລ  0EA5        LAO LETTER LO LOOT
        % LAO LETTER LO
        = lo ling
        * name is a mistake, lo loot is the mnemonic for 0EA3
 ࿐  0FD0        TIBETAN MARK BSKA- SHOG GI MGO RGYAN
        % TIBETAN MARK BKA- SHOG GI MGO RGYAN
        * used in Bhutan
 ꀕ A015        YI SYLLABLE WU
        % YI SYLLABLE ITERATION MARK
        * name is a misnomer
 ︘ FE18        PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
        % PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET
        * misspelling of "BRACKET" in character name is a known defect
        # <vertical> 3017
 𝃅  1D0C5       BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
        % BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
        * misspelling of "FTHORA" in character name is a known defect

There are only 

In Perl, \N{...} grants access to the single, shared, common namespace of Unicode character names, formal aliases, and named sequences without distinction:

    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")'
    Ƣ
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")'
    Ƣ

    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")'  | uniquote -x
    \x{1A2}
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")' | uniquote -x
    \x{1A2}

It is my suggestion that Python do the same thing. There are currently only 11 of these.  

The third element in this shared namespace of name, named sequences, are multiple code points masquerading under one name.  They come from the NamedSequences.txt file in the Unicode Character Database.  An example entry is:

    LATIN CAPITAL LETTER A WITH MACRON AND GRAVE;0100 0300

There are 418 of these named sequences as of Unicode 6.0.0.  This shows that Perl can also access named sequences:

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")'
  Ā̀

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' | uniquote -x
  \x{100}\x{300}

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")'            
  ㇷ゚

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")' | uniquote -x
   \x{31F7}\x{309A}


Since it is a single namespace, it makes sense that all members of that namespace should be accessible using \N{...} as a sort of equal-opportunity accessor mechanism, and it does not make sense that they not be.

Just makes sure you take only the approved named sequences from the NamedSequences.txt file. It would be unwise to give users access to the provisional sequences located in a neighboring file I shall not name :) because those are not guaranteed never to be withdrawn the way the others are, and so you would risk introducing an incompatibility.

If you look at the ICU UCharacter class, you can see that they provide a more
msg142145 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-15 19:51
Here’s the right test file for the right ticket.
msg142502 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-19 22:50
I verified that the test file raises the quoted SyntaxError on 3.2 on Win7. This:

>>> "\N{LATIN CAPITAL LETTER GHA}"
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name

is most likely a result of this:

>>> unicodedata.lookup("LATIN CAPITAL LETTER GHA")
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    unicodedata.lookup("LATIN CAPITAL LETTER GHA")
KeyError: "undefined character name 'LATIN CAPITAL LETTER GHA'"

Although the lookup comes first in nametests.py, it is never executed because of the later SyntaxError.

The Reference for string literals says" 
"\N{name} Character named name in the Unicode database"

The doc for unicodedata says
"This module provides access to the Unicode Character Database (UCD) which defines character properties for all Unicode characters. The data contained in this database is compiled from the UCD version 6.0.0.

The module uses the same names and symbols as defined by Unicode Standard Annex #44, “Unicode Character Database”." 
http://www.unicode.org/reports/tr44/tr44-6.html

So the question is, what are the 'names' therein defined?
All such should be valid inputs to 
"unicodedata.lookup(name) Look up character by name."

The annex refers to http://www.unicode.org/Public/6.0.0/ucd/
This contains NamesList.txt, derived from UnicodeData.txt. Unicodedata must be using just the latter. The ucd directory also contains NameAliases.txt, NamedSequences.txt, and the file of provisional named sequences.

As best I can tell, the annex plus files are a bit ambiguous as to  'Unicode character name'. The following quote seems neutral: "the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names." The following: "Unicode character names constitute a special case. Formally, they are values of the Name property." points toward UnicodeData.txt, which lists the Name property along with others. However, "Unicode character name, as published in the Unicode names list," indirectly points toward including aliases. NamesList.txt says it contains the "Final Unicode 6.0 names list." (but one which "should not be parsed for machine-readable information". It includes all 11 aliases in NameAliases.txt. 

My current opinion is that adding the aliases might be done in current releases. It certainly would serve the any user who does not know to misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars.

Adding named sequences is definitely a feature request. The definition of .lookup(name) would be enlarged to "Look up character by name, alias, or named sequence" with reference to the specific files. The meaning of \N{} would also have to be enlarged.

Minimal test code might be:

from unicodedata import lookup
AssertEqual(lookup("LATIN CAPITAL LETTER GHA")), "\u01a2")
AssertEqual(lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE"),
   "\u0100\u0300")
plus a test that "\N{LATIN CAPITAL LETTER GHA}" and
"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" compile without error (I have no idea how to write that).

---
> "If you look at the ICU UCharacter class, you can see that they provide a more"

More what ;-)
I presume ICU =International Components for Unicode, icu-project.org/
"Offers a portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N)."
[appears to be free, open source, and possibly usable within Python]
msg142506 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-19 23:26
"Terry J. Reedy" <report@bugs.python.org> wrote
   on Fri, 19 Aug 2011 22:50:58 -0000: 

> My current opinion is that adding the aliases might be done in current
> releases. It certainly would serve the any user who does not know to
> misspell 'FTHORA' as 'FHTORA' for just one of the 17 'FTHORA' chars.

Yes, I think the 11 aliases pose no problem.  It's amazing the trouble
you get into from having a fat-fingered amanuesis typing your laws 
into indelible stone tablets.

> Adding named sequences is definitely a feature request. The definition
> of .lookup(name) would be enlarged to "Look up character by name,
> alias, or named sequence" with reference to the specific files. The
> meaning of \N{} would also have to be enlarged.

But these do.  The problem is bracketed character classes.  
Yes, if you got named reference into the regex compiler as a raw
string, it could in theory rewrite

    [abc\N{seq}] 

as 

    (?:[abc]|\N{seq})

but that doesn't help if the sequence got replaced as a string escape.
At which point you have different behavior in the two lookalike cases.

If you ask how we do this in Perl, the answer is "poorly".  It really only
works well in strings, not charclasses, although there is a proposal to do
a rewrite during compilation like I've spelled out above.  Seems messy for
something that might(?) not get much use.  But it would be nice for \N{} to
work to access the whole namespace without prejudice.  I have a feeling
this may be a case of trying to keep one's cake and eating it too, as
the two goals seem to rule each other out.

>> "If you look at the ICU UCharacter class, you can see that they provide a more"

> More what ;-)

More expressive set of lookup functions where it is clear which thing
you are getting.  I believe the ICU regexes only support one-char returns
for \N{...}, not multis per the sequences.  But I may not be looking
at the right docs for ICU; not sure.

> I presume ICU =International Components for Unicode, icu-project.org/
> "Offers a portable set of C/C++ and Java libraries for Unicode support,
> software internationalization (I18N) and globalization (G11N)." [appears
> to be free, open source, and possibly usable within Python]

Well, there are some Python bindings for ICU that I was eager to try out,
because I wanted to see whether I couild get at full/real Unicode collation
that way, but I had trouble getting the Python bindings to compile.  Not
sure why.  The documentation for the Python bindings isn't very um wordy,
and it isn't clear how tightly integrated it all is: there's talk about C++
strings that kind of scares me. :)

Hm, and maybe they are only for Python 2 not Python 3, which I try to do
all my Python stuff in because it seems like it has a better Unicode model.

--tom
msg142507 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2011-08-19 23:36
For the "Line_Break" property, one of the possible values is "Inseparable", with 2 permitted aliases, the shorter "IN" (which is reasonable) and "Inseperable" (ouch!).
msg142508 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-19 23:57
Matthew Barnett <report@bugs.python.org> wrote
   on Fri, 19 Aug 2011 23:36:45 -0000: 

> For the "Line_Break" property, one of the possible values is
> "Inseparable", with 2 permitted aliases, the shorter "IN" (which 
> is reasonable) and "Inseperable" (ouch!).

Yeahy, I've shaken my head at that one, too.

It's one thing to make an alias for something you typo'd in the first 
place, but to have something that's correct which you then make a typo 
alias for is just encouraging bad/sloppy/wrong behavior.

    Bidi_Class=Paragraph_Separator
    Bidi_Class=Common_Separator
    Bidi_Class=European_Separator
    Bidi_Class=Segment_Separator
    General_Category=Line_Separator
    General_Category=Paragraph_Separator
    General_Category=Separator
    General_Category=Space_Separator
    Line_Break=Inseparable
    Line_Break=Inseperable

And there's still set, which makes you wonder
why they couldn't spell at least *one* of them out:

    Sentence_Break=Sep SB=SE
    Sentence_Break=Sp  SB=Sp

You really have to look those up to realize they're two different things:

    SB ; SE        ; Sep
    SB ; SP        ; Sp

And that none of them have something like SB=Space or SB=Separator
so you know what you're talking about.  Grrr.

--tom
msg143043 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-26 21:26
+1 on the feature request.
msg144679 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-30 08:59
The attached patch changes Tools/unicode/makeunicodedata.py to create a list of names and codepoints taken from http://www.unicode.org/Public/6.0.0/ucd/NameAliases.txt and adds it to Modules/unicodename_db.h.
During the lookup the _getcode function at Modules/unicodedata.c:1055 loops over the 11 aliases and checks if any of those match.
The patch also includes tests for both unicodedata.lookup and \N{}.

I'm not sure this is the best way to implement this, and someone will probably want to review and tweak both the approach and the C code, but it works fine:
>>> "\N{LATIN CAPITAL LETTER GHA}"
'Ƣ'
>>> import unicodedata
>>> unicodedata.lookup("LATIN CAPITAL LETTER GHA")
'Ƣ'
>>> "\N{LATIN CAPITAL LETTER OI}"
'Ƣ'
>>> unicodedata.lookup("LATIN CAPITAL LETTER OI")
'Ƣ'

The patch doesn't include changes for NamedSequences.txt.
msg144681 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-09-30 10:00
I propose to use a better lookup algorithm using binary search, and then integrate the NamedSequences into this as well. The search result could be a record

 struct {
   char *name;
   int len;
   Py_UCS4 chars[3]; /* no sequence is more than 3 chars */
 }

You would have two tables for these: one for the aliases, and one for the named sequences.

_getcode would continue to return a single char only, and thus not support named sequences. lookup could well return strings longer than 1, but only in 3.3.

I'm not sure that \N escapes should support named sequences: people rightfully expect that each escaped element in a string literal constitutes exactly one character.
msg144703 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-30 20:30
Leaving named sequences for unicodedata.lookup() only (and not for \N{}) makes sense.

The list of aliases is so small (11 entries) that I'm not sure using a binary search for it would bring any advantage.  Having a single lookup algorithm that looks in both tables doesn't work because the aliases lookup must be in _getcode for \N{...} to work, whereas the lookup of named sequences will happen in unicodedata_lookup (Modules/unicodedata.c:1187).
I think we can leave the for loop over aliases in _getcode and implement a separate (and binary) search in unicodedata_lookup for the named sequences.  Does that sound fine?
msg144708 - (view) Author: Tom Christiansen (tchrist) Date: 2011-09-30 22:07
>Ezio Melotti <ezio.melotti@gmail.com> added the comment:

> Leaving named sequences for unicodedata.lookup() only (and not for
> \N{}) makes sense.

There are certainly advantages to that strategy: you don't have to
deal with [\N{sequence}] issues.  If the argument to unicode.lookup()
and be any of name, alias, or sequence, that seems ok.  \N{} should
still do aliases, though, since those don't have the complication that
sequences have.

You may wish unicode.name() to return the alias in preference, however.
That's what we do.  And of course, there is no issue of sequences there.

The rest of this perhaps painfully long message is just elaboration
and icing on what I've said above.

--tom

> The list of aliases is so small (11 entries) that I'm not sure using a
> binary search for it would bring any advantage.  Having a single
> lookup algorithm that looks in both tables doesn't work because the
> aliases lookup must be in _getcode for \N{...} to work, whereas the
> lookup of named sequences will happen in unicodedata_lookup
> (Modules/unicodedata.c:1187).  I think we can leave the for loop over
> aliases in _getcode and implement a separate (and binary) search in
> unicodedata_lookup for the named sequences.  Does that sound fine?

If you mean, is it ok to add just the aliases and not the named sequences to
\N{}, it is certainly better than not doing so at all.  Plus that way you do
*not* have to figure out what in the world to to do with [^a-c\N{sequence}],
since that would have be something like (?!\N{sequence})[^a-c]), which is 
hardly obvious, especially if \N{sequence} actually starts with [a-c].

However, because the one namespace comprises all three of names,
aliases, and named sequences, it might be best to have a functional
(meaning, non-regex) API that allows one to do a fetch on the whole
namespace, or on each individual component.

The ICU library supports this sort of thing.  In ICU4J's Java bindings, 
we find this:

    static int getCharFromExtendedName(String name) 
       [icu] Find a Unicode character by either its name and return its code point value.
    static int	getCharFromName(String name) 
       [icu] Finds a Unicode code point by its most current Unicode name and return its code point value.
    static int	getCharFromName1_0(String name) 
       [icu] Find a Unicode character by its version 1.0 Unicode name and return its code point value.
    static int	getCharFromNameAlias(String name) 
       [icu] Find a Unicode character by its corrected name alias and return its code point value.

The first one obviously has a bug in its definition, as the English
doesn't scan.  Looking at the full definition is even worse.  Rather
than dig out the src jar, I looked at ICU4C, but its own bindings are
completely different.  There you have only one function, with an enum to
say what namespace to access:

    UChar32 u_charFromName  (       UCharNameChoice         nameChoice, 
		    const char *    name, 
		    UErrorCode *    pErrorCode 
	    )

The UCharNameChoice enum tells what sort of thing you want:

    U_UNICODE_CHAR_NAME,
    U_UNICODE_10_CHAR_NAME,
    U_EXTENDED_CHAR_NAME,
    U_CHAR_NAME_ALIAS,          
    U_CHAR_NAME_CHOICE_COUNT

Looking at the src for the Java is no more immediately illuminating, 
but I think that "extended" may refer to a union of the old 1.0 names 
with the current names.

Now I'll tell you what Perl does.  I do this not to say it is "right",
but just to show you one possible strategy.  I also am in the middle
of writing about this for the Camel, so it is in my head.

Perl does not provide the old 1.0 names at all.  We don't have a Unicode
1.0 legacy to support, which makes this cleaner.  However, we do provide
for the names of the C0 and C1 Control Codes, because apart from Unicode
1.0, they don't condescend to name the ASCII or Latin1 control codes.  

We also provide for certain well known aliases from the Names file:
anything that says "* commonly abbreviated as ...", so things like LRO
and ZWJ and such.

Perl makes no distinction between anything in the namespace when using
the \N{} form for string and regex escapes.  That means when you use
"\N{...}" or /\N{...}/, you don't know which it is, nor can you.
(And yes, the bracketed character class issue is annoying and unsolved.)

However, the "functional" API does make a slight distinction.  

 -- charnames::vianame() takes a name or alias (as a string) and returns a single 
	integer code point.

	eg: This therefore converts "LATIN SMALL LETTER A" into 0x61.
	    It also converts both 
		BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
	    and 
		BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
	    into 0x1D0C5.  See below.

 -- charnames::string_vianame() takes a string name, alias, *or* sequence, 
	and gives back a string.   

	eg: This therefore converts "LATIN SMALL LETTER A" into "a".
            Since it has a string return instead of an int, it now also
            handles everything from NamedSequences file as well. (See below.)

 -- charnames::viacode() takes an integer can gives back the official alias 
	if there is one, and the official name if there is not.

	eg: This converts 0x61 into "LATIN SMALL LETTER A".
            It also converts 0x1D0C5 into "BYZANTINE MUSICAL SYMBOL FTHORA
            SKLIRON CHROMA VASIS".

Consider

    BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS

That was an error, and there is an official alias fixing it:

    BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS

(That's FHTORA vs FTHORA.)

You may use either as the name, and if you reverse the code 
point to name, you get the replacement alias.

 % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS")'
 1D0C5

 % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS")'
 1D0C5

 % perl -mcharnames -wle 'print charnames::viacode(charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS"))'
 BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS

So on round-tripping, I gave it the "wrong" one (the original) and it gave
me back the "right" one (the replacement).

Using the \N{} thing, it again doesn't matter:

 % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS}"'
 1D0C5

 % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}"'
 1D0C5

The interesting thing is the named sequences. string_vianame() works just fine on those:

 % perl -mcharnames -wle 'print length charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")'
 2

 % perl -mcharnames -wle 'printf "U+%v04X\n",  charnames::string_vianame("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")'
 U+0100.0300

And that works fine with \N{} as well (provided you don't try charclasses):

 % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
 Ā̀

 % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"' | uniquote -v
 \N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}

 % perl -mcharnames=:full -wle 'print length "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
 2

 % perl -mcharnames=:full -wle 'printf "U+%v04X\n", "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
 U+0100.0300

It's kinda sad that for \N{} and sequneces you can't just "do the right
thing" with strings and say that charclass stuff just isn't supported.
But my guess is that this simply won't work because you don't have 
first class regexes.  If you pass both of these to the regex engine,
they should behave the same (and would, assuming the regex compiler
knows about \N{} escapes):

    "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
    r'\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}'

However, that falls part if you do 

    "[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]"
    r'[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]'

Because the compiler will do the substitution early on the first
one but not the second.  This seems a problem, eh?  So I guess
you can't do it at all?  Or could you document it?   I think there
is no good solution here.  Perl can and does actually do something
quite reasonable in the noncharclass case, but that is because we
know that we are compiling a regex in virtually all scenarios.

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN SMALL LETTER A}/'
    (?^u:\N{U+61})

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH MACRON}/'
    (?^u:\N{U+100})

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/'
    (?^u:\N{U+100.300})

So you can do:

    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ /\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/'
    1

And it is just fine.  The issue is that there are ways for you to get
yoruself into trouble if you do string-string stuff:

    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
    1
    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}" =~ "^[\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]+\$"'
    1

That works, but only accidentally, because of course U+0100.0300 contains
nothing but either U+0100 or U+0300.

This is not a solved problem.

I hope this helps.

--tom
msg144716 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-01 02:15
Attached a new patch that adds support for named sequences (still needs some test and can probably be improved).

> There are certainly advantages to that strategy: you don't have to
> deal with [\N{sequence}] issues.

I assume with [] you mean a regex character class, right?

> If the argument to unicode.lookup() and be any of name, alias, or 
> sequence, that seems ok. 

With my latest patch, all 3 are supported.

> \N{} should still do aliases, though, since those don't have the 
> complication that sequences have.

\N{} will only support names and aliases (maybe this can go in 2.7/3.2 too).

> You may wish unicode.name() to return the alias in preference,
> however. That's what we do.  And of course, there is no issue of 
> sequences there.

This can be done for 3.3, but I wonder if it might create problems.  People might use unicodedata.name() to get a name and use it elsewhere, and the other side might not be aware of aliases.
msg144738 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-01 15:04
> Does that sound fine?

Yes, that's fine as well.
msg144739 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-01 15:17
> You may wish unicode.name() to return the alias in preference, however.

-1. .name() is documented (and users familiar with it expect it) as
returning the name of the character from the UCD.

It doesn't really matter much to me if it's non-sensical - it's just
a label. Notice that many characters have names like "CJK UNIFIED
IDEOGRAPH-4E20", which isn't very descriptive, either. What does matter
is that the name returned matches the same name in many other places
in the net, which (rightfully) all use the UCD name (they might provide
the alias as well if they are aware of aliases, but often don't).

> If you mean, is it ok to add just the aliases and not the named sequences to
> \N{}, it is certainly better than not doing so at all.  Plus that way you do
> *not* have to figure out what in the world to to do with [^a-c\N{sequence}],

Python doesn't use regexes in the language parser, but does do \N
escapes in the parser. So there is no way this transformation could
possibly be made - except when you are talking about escapes in regexes,
and not escapes in Unicode strings.

> Perl does not provide the old 1.0 names at all.  We don't have a Unicode
> 1.0 legacy to support, which makes this cleaner.  However, we do provide
> for the names of the C0 and C1 Control Codes, because apart from Unicode
> 1.0, they don't condescend to name the ASCII or Latin1 control codes.  

If there would be a reasonably official source for these names, and one
that guarantees that there is no collision with UCD names, I could
accept doing so for Python as well.

> We also provide for certain well known aliases from the Names file:
> anything that says "* commonly abbreviated as ...", so things like LRO
> and ZWJ and such.

-1. Readability counts, writability not so much (I know this is
different for Perl :-). If there is too much aliasing, people will
wonder what these codes actually mean.
msg144757 - (view) Author: Tom Christiansen (tchrist) Date: 2011-10-02 05:33
>> Perl does not provide the old 1.0 names at all.  We don't have a Unicode
>> 1.0 legacy to support, which makes this cleaner.  However, we do provide
>> for the names of the C0 and C1 Control Codes, because apart from Unicode
>> 1.0, they don't condescend to name the ASCII or Latin1 control codes. =20

> If there would be a reasonably official source for these names, and one
> that guarantees that there is no collision with UCD names, I could
> accept doing so for Python as well.

The C0 and C1 control code names don't change.  There is/was one stability
issue where they screwed up, because they ended up having a UAX (required)
and a UTS (not required) fighting because of the dumb stuff they did with
the Emoji names. They neglected to prefix them with "Emoji ..." or some
such, the way things like "GREEK ... LETTER ..." or "MATHEMATICAL ..." or
"MUSICAL ..." did.  The problem is they stole BELL without calling it EMOJI
BELL.  This is C0 name for Control-G.  Dimwits.

The problem with official names is that they have things in them that you
are not expected in names.  Do you really and truly mean to tell me you
think it is somehow **good** that people are forced to write

    \N{LINE FEED (LF)}

Rather than the more obvious pair of 

    \N{LINE FEED}
    \N{LF}

??

If so, then I don't understand that.  Nobody in their right 
mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED}"'
    U+000A
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LF}"'
    U+000A
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{LINE FEED (LF)}"'
    U+000A

    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE}"'
    U+0085
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEL}"'
    U+0085
    % perl -Mcharnames=:full -le 'printf "U+%04X\n", ord "\N{NEXT LINE (NEL)}"'
    U+0085

>> We also provide for certain well known aliases from the Names file:
>> anything that says "* commonly abbreviated as ...", so things like LRO
>> and ZWJ and such.

> -1. Readability counts, writability not so much (I know this is
> different for Perl :-). 

I actually very strongly resent and rebuff that entire mindset in the most
extreme way possible.  Well-written Perl code is perfectly readable by
people who speak that langauge.  If you find Perl code that isn't readable,
it is by definition not well-written.

*PLEASE* don't start.  

Yes, I just got done driving 16 hours and am overtired, but it's 
something I've been fighting against all of professional career.
It's a "leyenda negra".

> If there is too much aliasing, people will
> wonder what these codes actually mean.

There are 15 "commonly abbreviated as" aliases in the Names.txt file.

    * commonly abbreviated as NBSP
    * commonly abbreviated as SHY
    * commonly abbreviated as CGJ
    * commonly abbreviated ZWSP
    * commonly abbreviated ZWNJ
    * commonly abbreviated ZWJ
    * commonly abbreviated LRM
    * commonly abbreviated RLM
    * commonly abbreviated LRE
    * commonly abbreviated RLE
    * commonly abbreviated PDF
    * commonly abbreviated LRO
    * commonly abbreviated RLO
    * commonly abbreviated NNBSP
    * commonly abbreviated WJ

All of the standards documents *talk* about things like LRO and ZWNJ.
I guess the standards aren't "readable" then, right? :)

From the charnames manpage, which shows that we really don't just make
these up as we feel like (although we could; see below).  They're all from
this or that standard:

    ALIASES
       A few aliases have been defined for convenience: instead
       of having to use the official names

           LINE FEED (LF)
           FORM FEED (FF)
           CARRIAGE RETURN (CR)
           NEXT LINE (NEL)

       (yes, with parentheses), one can use

           LINE FEED
           FORM FEED
           CARRIAGE RETURN
           NEXT LINE
           LF
           FF
           CR
           NEL

       All the other standard abbreviations for the controls,
       such as "ACK" for "ACKNOWLEDGE" also can be used.

       One can also use

           BYTE ORDER MARK
           BOM

       and these abbreviations

           Abbreviation        Full Name

           CGJ                 COMBINING GRAPHEME JOINER
           FVS1                MONGOLIAN FREE VARIATION SELECTOR ONE
           FVS2                MONGOLIAN FREE VARIATION SELECTOR TWO
           FVS3                MONGOLIAN FREE VARIATION SELECTOR THREE
           LRE                 LEFT-TO-RIGHT EMBEDDING
           LRM                 LEFT-TO-RIGHT MARK
           LRO                 LEFT-TO-RIGHT OVERRIDE
           MMSP                MEDIUM MATHEMATICAL SPACE
           MVS                 MONGOLIAN VOWEL SEPARATOR
           NBSP                NO-BREAK SPACE
           NNBSP               NARROW NO-BREAK SPACE
           PDF                 POP DIRECTIONAL FORMATTING
           RLE                 RIGHT-TO-LEFT EMBEDDING
           RLM                 RIGHT-TO-LEFT MARK
           RLO                 RIGHT-TO-LEFT OVERRIDE
           SHY                 SOFT HYPHEN
           VS1                 VARIATION SELECTOR-1
           .
           .
           .
           VS256               VARIATION SELECTOR-256
           WJ                  WORD JOINER
           ZWJ                 ZERO WIDTH JOINER
           ZWNJ                ZERO WIDTH NON-JOINER
           ZWSP                ZERO WIDTH SPACE

       For backward compatibility one can use the old names for
       certain C0 and C1 controls

           old                         new

           FILE SEPARATOR              INFORMATION SEPARATOR FOUR
           GROUP SEPARATOR             INFORMATION SEPARATOR THREE
           HORIZONTAL TABULATION       CHARACTER TABULATION
           HORIZONTAL TABULATION SET   CHARACTER TABULATION SET
           HORIZONTAL TABULATION WITH JUSTIFICATION    CHARACTER TABULATION
                                                       WITH JUSTIFICATION
           PARTIAL LINE DOWN           PARTIAL LINE FORWARD
           PARTIAL LINE UP             PARTIAL LINE BACKWARD
           RECORD SEPARATOR            INFORMATION SEPARATOR TWO
           REVERSE INDEX               REVERSE LINE FEED
           UNIT SEPARATOR              INFORMATION SEPARATOR ONE
           VERTICAL TABULATION         LINE TABULATION
           VERTICAL TABULATION SET     LINE TABULATION SET

       but the old names in addition to giving the character will
       also give a warning about being deprecated.

       And finally, certain published variants are usable,
       including some for controls that have no Unicode names:

           name                                   character

           END OF PROTECTED AREA                  END OF GUARDED AREA, U+0097
           HIGH OCTET PRESET                      U+0081
           HOP                                    U+0081
           IND                                    U+0084
           INDEX                                  U+0084
           PAD                                    U+0080
           PADDING CHARACTER                      U+0080
           PRIVATE USE 1                          PRIVATE USE ONE, U+0091
           PRIVATE USE 2                          PRIVATE USE TWO, U+0092
           SGC                                    U+0099
           SINGLE GRAPHIC CHARACTER INTRODUCER    U+0099
           SINGLE-SHIFT 2                         SINGLE SHIFT TWO, U+008E
           SINGLE-SHIFT 3                         SINGLE SHIFT THREE, U+008F
           START OF PROTECTED AREA                START OF GUARDED AREA, U+0096

    perl v5.14.0                2011-05-07                          2

Those are the defaults.  They are overridable.  That's because we feel that
people should be able to name their character constants however they feel
makes sense for them.  If they get tired of typing 

    \N{LATIN SMALL LETTER U WITH DIAERESIS}

let alone

    \N{LATIN CAPITAL LETTER THORN WITH STROKE THROUGH DESCENDER}

then they can, because there is a mechanism for making aliases:

    use charnames ":full", ":alias" => {
	U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS",
	u_uml => "LATIN SMALL LETTER U WITH DIAERESIS",
    };

That way you can do 

    s/\N{U_uml}/UE/;
    s/\N{u_uml}/ue/;

This is probably not as persuasive as the private-use case described below.

It is important to remember that all charname bindings in Perl are attached
to a *lexically-scoped declaration.  It is completely constrained to
operate only within that lexical scope.  That's why the compiler replaces
things like

    use charnames ":full", ":alias" => {
	U_uml => "LATIN CAPITAL LETTER U WITH DIAERESIS",
	u_uml => "LATIN SMALL LETTER U WITH DIAERESIS",
    };

    my $find_u_uml = qr/\N{u_uml}/i;

    print "Seach pattern is: $find_u_uml\n";

Which dutifully prints out:

    Seach pattern is: (?^ui:\N{U+FC})

So charname bindings are never "hard to read" because the effect is
completely lexically constrained, and can never leak outside of the scope.

I realize (or at least, believe) that Python has no notion of nested
lexical scopes, and like many things, this sort of thing can therefore
never work there because of that.

The most persuasive use-case for user-defined names is for private-use
area code points.  These will never have an official name.  But it is 
just fine to use them.  Don't they deserve a better name, one that makes
sense within your own program that uses them?  Of course they do.

For example, Apple has a bunch of private-use glyphs they use all the time.
In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate
logo/glyph thingie of an apple with a bite taken out of it.  (Microsoft
also has a bunch of these.)  If you upgrade MacRoman to Unicode, you will
find that that 0xF0 maps to code point U+F8FF using the regular converter.

Now what are you supposed to do in your program when you want a named character
there?  You certainly do not want to make users put an opaque magic number
as a Unicode escape.  That is always really lame, because the whole reason 
we have \N{...} escapes is so we don't have to put mysterious unreadable magic
numbers in our code!!

So all you do is 

    use charnames ":alias" => {
        "APPLE LOGO" => 0xF8FF,
    };

and now you can use \N{APPLE LOGO} anywhere within that lexical scope.  The
compiler will dutifully resolve it to U+F8FF, since all name lookups happen
at compile-time.  And it cannot leak out of the scope.

I assert that this facility makes your program more readable, and its
absence  makes your program less readable.

Private use characters are important in Asian texts, but they are also
important for other things.  For example, Unicode intends to get around
to allocating Tengwar up the the SMP.  However, lots of stupid old code
can't use full Unicode, being constrained to UCS-2 only.  So many Tengwar
fonts start at a different base, and put it in the private use area instead
or the SMP.  Here are two constants:

    use constant {
        TB_CONSCRIPT_UNICODE_REGISTRY    => 0x00_E000,  # private use
        TB_UNICODE_CONSORTIIUM           => 0x01_6080,  # where it will really go
    };

I have an entire Tengwar module that makes heavy use of named 
private-use characters.  All I do is this:

    use constant TENGWAR_BASE => TB_CONSCRIPT_UNICODE_REGISTRY;

    use charnames ":alias" => { 
      reverse (
        (TENGWAR_BASE + 0x00) => "TENGWAR LETTER TINCO",
        (TENGWAR_BASE + 0x01) => "TENGWAR LETTER PARMA",
        (TENGWAR_BASE + 0x02) => "TENGWAR LETTER CALMA",
        (TENGWAR_BASE + 0x03) => "TENGWAR LETTER QUESSE",
        (TENGWAR_BASE + 0x04) => "TENGWAR LETTER ANDO",
        ....
      )
    };

Now you can write \N{TENGWAR LETTER TINCO} etc.  See how slick that is?
Consider the alternative.  Magic numbers.  Worse, magic numbers with funny
calculations in them.  That is just so wrong that it completely justifies
letting people name things how they want to, so long as they don't make
other people do the same.  What people do in the privacy of their own
lexical scope is their own business.

It gets better.  Perl lets you define your character properties, too.
Therefore I can write things like \p{Is_Tengwar_Decimal} and such.
Right now I have these properties:

    In_Tengwar, Is_Tengwar
    In_Tengwar_Alphanumerics
    In_Tengwar_Consonants, In_Tengwar_Vowels, In_Tengwar_Alphabetics
    In_Tengwar_Numerals, Is_Tengwar_Decimal, Is_Tengwar_Duodecimal
    In_Tengwar_Punctuation
    In_Tengwar_Marks 

So I have code in my Tengwar module that does stuff like this, using
my own named characters (which again, are compile-time resolved and 
work only within this lexical scope):

     chr( $1 + ord("\N{TENGWAR DIGIT ZERO}") )

Not to mention this using my own properties:

    $TENGWAR_GRAPHEME_RX = qr/(?:(?=\p{In_Tengwar})\P{In_Tengwar_Marks}\p{In_Tengwar_Marks}*)|\p{In_Tengwar_Marks}/x;

Actually, I'm fibbing.  I *never* write regexes all on one line like
that: they are abhorrent to me.  The pattern really looks like this in
the code:

    $TENGWAR_GRAPHEME_RX = qr{
        (?:
            (?= \p{In_Tengwar} ) \P{In_Tengwar_Marks}   # Either one basechar...
            \p{In_Tengwar_Marks} *                      # ... plus 0 or more marks
        ) | 
            \p{In_Tengwar_Marks}                        # or else a naked unpaired mark.
    }x;

People who write patterns without whitespace for cognitive chunking (plus
comments for explanation) are wicked wicked wicked.  Frankly I'm surprised 
Python doesn't require it. :)/2

Anyway, do you see how much better that is than opaque unreadable magic
numbers?  Can you just imagine the sheer horror of writing that sort of
code without the ability to define your own named characters *and* your 
own character properties?  It's beautiful, simple, clean, and readable.
I'll even go so far as to call it intuitive.

No, I don't expect Python to do this sort of thing.  You don't have proper
scoping, so you can't ever do it cleanly the way Perl can.

I just wanted to give a concrete example where flexibility leads to a 
much more readable program than inflexibility ever can.  

--tom

    "We hates magic numberses.  We hates them forevers!"
        --Sméagol the Hacker
msg144758 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-02 06:46
> The problem with official names is that they have things in them that 
> you are not expected in names.  Do you really and truly mean to tell 
> me you think it is somehow **good** that people are forced to write
>    \N{LINE FEED (LF)}
> Rather than the more obvious pair of 
>    \N{LINE FEED}
>    \N{LF}
> ??

Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely because that's a Unicode 1 name, and nowadays these codepoints are simply marked as '<control>'.

> If so, then I don't understand that.  Nobody in their right 
> mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

They probably don't, but they just write \n anyway.  I don't think we need to support any of these aliases, especially if they are not defined in the Unicode standard.

I'm also not sure humans use \N{...}: you don't want to write
  'R\N{LATIN SMALL LETTER E WITH ACUTE}sum\N{LATIN SMALL LETTER E WITH ACUTE}'
and you would need to look up the exact name somewhere anyway before using it (unless you know them by heart).
If 'R\xe9sum\xe9' or 'R\u00e9sum\u00e9' are too obscure and/or magic, you can always print() them and get 'Résumé' (or just write 'Résumé' directly in the source).

> All of the standards documents *talk* about things like LRO and ZWNJ.
> I guess the standards aren't "readable" then, right? :)

Right, I had to read down till the table with the meanings before figuring out what they were (and I already forgot it).

> The most persuasive use-case for user-defined names is for private-use
> area code points.  These will never have an official name.  But it is
> just fine to use them.  Don't they deserve a better name, one that 
> makes sense within your own program that uses them?  Of course they do.
>
> For example, Apple has a bunch of private-use glyphs they use all the time.
> In the 8-bit MacRoman encoding, the byte 0xF0 represents the Apple corporate
> logo/glyph thingie of an apple with a bite taken out of it.  (Microsoft
> also has a bunch of these.)  If you upgrade MacRoman to Unicode, you will
> find that that 0xF0 maps to code point U+F8FF using the regular converter.
>
> Now what are you supposed to do in your program when you want a named character
> there?  You certainly do not want to make users put an opaque magic number
> as a Unicode escape.  That is always really lame, because the whole reason 
> we have \N{...} escapes is so we don't have to put mysterious unreadable magic
> numbers in our code!!
>
> So all you do is 
>    use charnames ":alias" => {
>        "APPLE LOGO" => 0xF8FF,
>    };
>
> and now you can use \N{APPLE LOGO} anywhere within that lexical scope.  The
> compiler will dutifully resolve it to U+F8FF, since all name lookups happen
> at compile-time.  And it cannot leak out of the scope.

This is actually a good use case for \N{..}.

One way to solve that problem is doing:
    apples = {
        'APPLE': '\uF8FF',
        'GREEN APPLE': '\U0001F34F',
        'RED APPLE': '\U0001F34E',
    }
and then:
   print('I like {GREEN APPLE} and {RED APPLE}, but not {APPLE}.'.format(**apples))

This requires the format call for each string and it's a workaround, but at least is readable (I hope you don't have too many apples in your strings).

I guess we could add some way to define a global list of names, and that would probably be enough for most applications.  Making it per-module would be more complicated and maybe not too elegant.

> People who write patterns without whitespace for cognitive chunking (plus
> comments for explanation) are wicked wicked wicked.  Frankly I'm surprised 
> Python doesn't require it. :)/2

I actually find those *less* readable.  If there's something fancy in the regex, a comment *before* it is welcomed, but having to read a regex divided on several lines and remove meaningless whitespace and redundant comments just makes the parsing more difficult for me.
msg144760 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-02 07:34
Attached a new patch with more tests and doc.
msg144779 - (view) Author: Tom Christiansen (tchrist) Date: 2011-10-02 18:41
Ezio Melotti <report@bugs.python.org> wrote
   on Sun, 02 Oct 2011 06:46:26 -0000: 

> Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec=
> ause that's a Unicode 1 name, and nowadays these codepoints are simply mark=
> ed as '<control>'.

Yes, but there are a lot of them, 65 of them in fact.  I do not care to 
see people being forced to use literal control characters or inscrutable
magic numbers.  It really bothers me that you have all these defined code 
points with properties and all that have no name.   People do use these.
Some of them a lot.  I don't mind \n and such -- and in fact, prefer them 
even -- but I feel I should not have scratch my head over character \033, \0177,
and brethren.  The C0 and C1 standards are not just inventions, so we use 
them.  Far better than one should write \N{ESCAPE} for \033 or \N{DELETE} 
for \0177, don't you think?  

>> If so, then I don't understand that.  Nobody in their right=20
>> mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

> They probably don't, but they just write \n anyway.  I don't think we need =
> to support any of these aliases, especially if they are not defined in the =
> Unicode standard.

If you look at Names.txt, there are significant "aliases" there for 
the C0/C1 stuff.  My bottom line is that I don't like to be forced
to use magic numbers.  I prefer to name my abstactions.  It is more
readable and more maintainble that way.   

There are still "holes" of course.  Code point 128 has no name even in C1.
But something is better than nothing.  Plus at least in Perl we *can* give
things names if we want, per the APPLE LOGO example for U+F8FF.  So nothing
needs to remain nameless.  Why, you can even name your Kanji if you want, 
using whatever Romanization you prefer.  I think the private-use case
example is really motivating, but I have no idea how to do this for Python
because there is no lexical scope.  I suppose you could attach it to the
module, but that still doesn't really work because of how things get evaluated.
With a Perl compile-time use, we can change the compiler's ideas about
things, like adding function prototypes and even extending the base types:

    % perl -Mbigrat -le 'print 1/2 + 2/3 * 4/5'
    31/30

    % perl -Mbignum -le 'print 21->is_odd'
    1
    % perl -Mbignum -le 'print 18->is_odd'
    0

    % perl -Mbignum -le 'print substr(2**5000, -3)'
    376
    % perl -Mbignum -le 'print substr(2**5000-1, -3)'
    375

    % perl -Mbignum -le 'print length(2**5000)'
    1506
    % perl -Mbignum -le 'print length(10**5000)'
    5001

    % perl -Mbignum -le 'print ref 10**5000'
    Math::BigInt
    % perl -Mbigrat -le 'print ref 1/3'
    Math::BigRat

I recognize that redefining what sort of object the compiler treats some 
of its constants as is never going to happen in Python, but we actually
did manage that with charnames without having to subclass our strings:
the hook for \N{...} doesn't require object games like the ones above.

But it still has to happen at compile time, of course, so I don't know
what you could do in Python.  Is there any way to change how the compiler
behaves even vaguely along these lines?

The run-time looks of Python's unicodedata.lookup (like Perl's
charnames::viacode) and unicodedata.name (like Perl's charnames::viacode
on the ord) could be managed with a hook, but the compile-time lookups
of \N{...} I don't see any way around.  But I don't know anything about
Python's internals, so don't even know what is or is not possible.

I do note that if you could extend \N{...} the way we do with charname
aliases for private-use characters, the user could load something that 
did the C0 and C1 control if they wanted to.  I just don't know how to 
do that early enough that the Python compiler would see it.  Your import
happens at run-time or at compile-time?  This would be some sort of
compile-time binding of constants.

d=20
>> Python doesn't require it. :)/2

> I actually find those *less* readable.  If there's something fancy in the r=
> egex, a comment *before* it is welcomed, but having to read a regex divided=
> on several lines and remove meaningless whitespace and redundant comments =
> just makes the parsing more difficult for me.

Really?  White space makes things harder to read?  I thought Pythonistas
believed the opposite of that.  Whitespace is very useful for cognitive
chunking: you see how things logically group together.

Inomorewantaregexwithoutwhitespacethananyothercodeortext. :)

I do grant you that chatty comments may be a separate matter.

White space in patterns is also good when you have successive patterns
across multiple lines that have parts that are the same and parts that
are different, as in most of these, which is from a function to render
an English headline/book/movie/etc title into its proper casing:

    # put into lowercase if on our stop list, else titlecase
    s/  ( \pL [\pL']* )  /$stoplist{$1} ? lc($1) : ucfirst(lc($1))/xge;

    # capitalize a title's last word and its first word
    s/^ ( \pL [\pL']* )  /\u\L$1/x;  
    s/  ( \pL [\pL']* ) $/\u\L$1/x;  

    # treat parenthesized portion as a complete title
    s/ \( ( \pL [\pL']* )    /(\u\L$1/x;
    s/    ( \pL [\pL']* ) \) /\u\L$1)/x;

    # capitalize first word following colon or semi-colon
    s/ ( [:;] \s+ ) ( \pL [\pL']* ) /$1\u\L$2/x;

Now, that isn't good code for all *kinds* of reasons, but white space
is not one of them.  Perhaps what it is best at demonstrating is why
Python goes about this the right way and that Perl does not.  Oh drat,
I'm about to attach this to the wrong bug.  But it was the dumb code
above that made me think about the following.

By virtue of having a "titlecase each word's first letter and lowercase the
rest" function in Python, you can put the logic in just one place, and
therefore if a bug is found, you can fix all code all at one.

But because Perl has always made it easy to grab "words" (actually,
traditional programming language identifiers) and diddle their case, 
people write this all the time:

    s/(\w+)/\u\L$1/g;

all the time, and that has all kind of problems.  If you prefer the
functional approach, that is really

    s/(\w+)/ucfirst(lc($1))/ge;

but that is still wrong.

 1. Too much code duplication.  Yes, it's nice to see \pL[\pL']* 
    stand out on each line, but shouldn't that be in a variable, like

        $word = qr/\pL[\pL']*/;

 2. What is a "word"?  That code above is better than \w because it
    avoids numbers and underscores; however, it still uses letters
    only, not letters and marks, let alone number letters like Roman
    numerals.

 3. I see the apostrophe there, which is a good start, but what if 
    it is a RIGHT SINGLE QUOTATION MARK, as in "Henry’s"?  And 
    what about hyphens?  Those should not trigger capitalization
    in normal titles.

 4. It turns out that all code that does a titlecase on the first 
    character of a string it has already converted to lowercase has
    irreversibly lost information.  Unicode casing it not reversable.
    Using \w for convenience, these can do different things:

        s/(\w+)/\u\L$1/g;
        s/(\w)(\w*)/\u$1\L$2/g;

    or in the functional approach, 

        s/(\w+)/ucfirst(lc($1))/ge;
        s/(\w)(\w*)/ucfirst($1) . lc($2)/ge;

    Now while it is true that only these code points alone do the wrong 
    thing using the naïve approach under Unicode 6.0:

     % unichars -gas 'ucfirst ne ucfirst lc'
      İ  U+00130 GC=Lu SC=Latin        LATIN CAPITAL LETTER I WITH DOT ABOVE
      ϴ  U+003F4 GC=Lu SC=Greek        GREEK CAPITAL THETA SYMBOL
      ẞ  U+01E9E GC=Lu SC=Latin        LATIN CAPITAL LETTER SHARP S
      Ω  U+02126 GC=Lu SC=Greek        OHM SIGN
      K  U+0212A GC=Lu SC=Latin        KELVIN SIGN
      Å  U+0212B GC=Lu SC=Latin        ANGSTROM SIGN

    But it is still the wrong thing, and we never know what might happen
    in the future.

I think Python is being smarter than Perl in simply providing people
with a titlecase-each-word('s-first-letterand-lowercase-the-rest)in-the-whole-
string function, because this means people won't be tempted to write

    s/(\w+)/ucfirst(lc($1))/ge;

all the time.  However, as I have written elsewhere, I question a lot of
its underlying assumptions.  It's clear that a "word" must in general
include not just Letters but also Marks, or else you get different
results in NFD and NFC, and the Unicode Standard is very against that.

However, the problem is that what a word is cannot be considered
independent of language.  Words in English can contain apostrophes
(whether written as an APOSTROPHE or as RIGHT SINGLE QUOTATION MARK) 
and hyphens (written as HYPHEN-MINUS, HYPHEN, and rarely even EN DASH).

Each of these is a single word:

    ’tisn’t
    anti‐intellectual
    earth–moon

The capitalization there should be 

    ’Tisn’t
    Anti‐intellectual
    Earth–Moon

Notice how you can't do the same with the first apostrophe+t as with the
second on "’Tisn’t"". That is all challenging to code correctly (did you
notice the EN DASH?), especially when you find something like
red‐violet–colored.  You problably want that to be Red‐violet–colored,
because it is not an equal compound like earth–moon or yin–yang, which
in correct orthography take an EN DASH not a HYPHEN, just as occurs
when you hyphenate an already hyphenated word like red‐violet against
colored, as in a red‐violet–colored flower.  English titling rules 
only capitalize the first word in hyphenated words, which is why it's
Anti‐intellectual not Anti-Intellectual.  

And of course, you can't actually create something in true English
titlecase without knowing having a stop list of articles and (short)
prepositions, and paying attention to whether it is the first or last word
in the title, and whether it follows a colon or semicolon.  Consider that
phrasal verbs are construed to take adverbs not prepositions, and so
"Bringing In the Sheaves" would be the correct capitalization of that song,
since "to bring in" is a phrasal verb, but "A Ringing in My Ears" would be
right for that.  It is remarkably complicated.  

With English titlecasing, you have to respect what your publishing house
considers a "short" preposition.  A common cut-off is that short preps
have 4 or fewer characters, but I have seen longer cutoffs.  Here is one
rather exhaustive list of English prepositions sorted by length:

 2: as  at  by  in  of  on  to  up  vs

 3: but  for  off  out  per  pro  qua  via

 4: amid atop down from into like near next onto over
    pace past plus sans save than till upon with

<cutoff point for O'Reilly Media>

 5: about above after among below circa given minus
    round since thru times under until worth

 6: across amidst around before behind beside beside beyond
    during except inside toward unlike versus within

 7: against barring beneath besides between betwixt
    despite failing outside through thruout towards without

10: throughout underneath

The thing is that prepositions become adverbs in phrasal verbs, like "to
go out" or "to come in", and all adverbs are capitalized.  So a complete
solution requires actual parsing of English!!!!  Just say no -- or stronger.

Merely getting something like this right:

    the lord of the rings: the fellowship of the ring  # Unicode lowercase
    THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING  # Unicode uppercase
    The Lord of the Rings: The Fellowship of the Ring  # English titlecase

is going to take a bit of work.  So is 

    the sad tale of king henry ⅷ   and caterina de aragón  # Unicode lowercase
    THE SAD TALE OF KING HENRY Ⅷ   AND CATERINA DE ARAGÓN  # Unicode uppercase
    The Sad Tale of King Henry Ⅷ   and Caterina de Aragón  # English titlecase

(and that must give the same answer in NFC vs NFD, of course.)

Plus what to do with something like num2ascii is ill-defined in English,
because having digits in the middle of a word is a very new phenomenon.
Yes, Y2K gets caps, but that is for another reason.  There is no agreement
on what one should do with num2ascii or people42see.  A function name
shouldn't be capitalized at all of course.

And that is just English.  Other languages have completely different rules.
For example, per Wikipedia's entry on the colon:

    In Finnish and Swedish, the colon can appear inside words in a
    manner similar to the English apostrophe, between a word (or
    abbreviation, especially an acronym) and its grammatical (mostly
    genitive) suffixes. In Swedish, it also occurs in names, for example
    Antonia Ax:son Johnson (Ax:son for Axelson). In Finnish it is used
    in loanwords and abbreviations; e.g., USA:han for the illative case
    of "USA". For loanwords ending orthographically in a consonant but
    phonetically in a vowel, the apostrophe is used instead: e.g. show'n
    for the genitive case of the English loan "show" or Versailles'n for
    the French place name Versailles.

Isn't that tricky!  I guess that you would have to treat punctuation
that has a word character immediately following it (and immediately 
preceding it) as being part of the word, and that it doesn't signal
that a change in case is merited.

I'm really not sure. It is not obvious what the right thing to do here.

I do believe that Python's titlecase function can and should be fixed to
work correctly with Unicode.  There really is no excuse for turning Aragón
into AragóN, for example, or not doing the right thing with ⅷ   and Ⅷ  .

I fear the only thing you can do with the confusion of Unicode titlecase
and English titlecase is to explain that properly rendering English titles
and headlines is a much more complicated job which you will not even
attempt.  (And shoudln't. English titelcase is clear too specialized for a
general function.)

However, I'm still bothered by things with apostrophes though.

    can't 
    isn't 
    woudn't've
    Bill's
    'tisn't

since I can't countenance the obviously wrong:

    Can'T 
    Isn'T 
    Woudn'T'Ve
    Bill'S
    'Tisn'T

with the last the hardest to get right.  I do have code that correctly
handles English words and code that correctly handles English titles,
but  it is much tricker the titlecase() function.

And Swedes might be upset seeing Antonia Ax:Son Johnson instead 
of Antonia Ax:son Johnson.

Maybe we should just go back to the Pythonic equivalent of 

    s/(\w)(\w*)/ucfirst($1) . lc($2)/ge;

where \w is specifically per tr18's Annex C, and give up on punctuation
altogether, with a footnoted caveat or something.  I wouldn't complain
about that.  The rest is just too, too hard.  Wouldn't you agree?

Thank you very much for all your hard work -- and patience with me.

--tom
msg144783 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-10-02 21:56
> Really?  White space makes things harder to read?  I thought Pythonistas
> believed the opposite of that.

I was surprised at that too ;-). One person's opinion in a specific 
context. Don't generaliza.

> English titling rules
> only capitalize the first word in hyphenated words, which is why it's
> Anti‐intellectual not Anti-Intellectual.

Except that I can imagine someone using the latter as a noun to make the 
work more officious or something. There are no official English titling 
rules and as you noted, publishers vary. I agree that str.title should 
do something sensible based on Unicode, with the improvements you mentioned.
msg144802 - (view) Author: Tom Christiansen (tchrist) Date: 2011-10-03 02:25
>> Really?  White space makes things harder to read?  I thought Pythonistas
>> believed the opposite of that.

> I was surprised at that too ;-). One person's opinion in a specific 
> context. Don't generalize.

The example I initially showed probably wasn't the best for that.
Mostly I was trying to demonstrate how useful it is to have user-defined
properties is all.  But I have no asked for that (I have asked for properties,
though).

>> English titling rules
>> only capitalize the first word in hyphenated words, which is why it's
>> Anti‐intellectual not Anti-Intellectual.

> Except that I can imagine someone using the latter as a noun to make the 
> work more officious or something. 

If Good-Looking looks more officous than Good-looking, I bet GOOD-LOOKING
is better still. :)

> There are no official English titling rules and as you noted,
> publishers vary. 

If there aren't any rules, then how come all book and movie titles always
look the same?  :)  I don't think anyone would argue with these two:

 1. Capitalize the first word, the last word, and the word right after a
    colon (or semicolon).

 2. Capitalize all intervening words except for articles (a, an, the)
    and short prepositions.

Those are the basic rules.  The main problem is that "short" isn't
well defined--and indeed, there are even places where "preposition" 
isn't well defined either.  

English has sentence casing (only the first word) and headline casing (most of them).
It's problematic that computer people call capitalizing each word titlecasing,
since in English, this is never correct.

    http://www.chicagomanualofstyle.org/CMS_FAQ/CapitalizationTitles/CapitalizationTitles23.html

     Although Chicago style lowercases prepositions (but see CMOS 8.157
     for exceptions), some style guides uppercase them. Ask your editor
     for a style guide.

I myself usually fall back to the Chicago Manual of Style or the Oxford
Guide to Style.  I don't think I do anything neither of them says to do.

But I completely agree that this should *not* be in the titlecase()
function.  I think the docs for the function might perhaps say something
about how it does not mean correct English headline case when it says
titlecase, but that's largely just nitpicking.

> I agree that str.title should do something sensible
> based on Unicode, with the improvements you mentioned.

One of the goals of Unicode is that casing not be language dependent.  And
they almost got there, too.  The Turkic I is the most notable exception.

Did you know there is a problem with all the case stuff in Python?  It 
was clearly put in before they had realized that they needed to have
things other the Lu/Lt/Ll have casing properties.  That's why there is
a difference betwen GC=Ll and the Lowercase property.

    str.islower()

    Return true if all cased characters in the string are lowercase and
    there is at least one cased character, false otherwise. Cased
    characters are those with general category property being one of
    “Lu”, “Ll”, or “Lt” and lowercase characters are those with general
    category property “Ll”.

    http://docs.python.org/release/3.2/library/stdtypes.html

That really isn't right.  A cased character is one with the Unicode "Cased"
property, and a lowercase character is one wiht the Unicode "Lowercase"
property.  The General Category is actually immaterial here.

I've spent all bloody day trying to model Python's islower, isupper, and istitle
functions, but I get all kinds of errors, both in the definitions and in the
models of the definitions.    Under both 2.7 and 3.2, I get all these bugs:

    ᶜ not islower() but has at least one cased character with all cased characters lowercase!
    ᴰ not islower() but has at least one cased character with all cased characters lowercase!
    ⓚ not islower() but has at least one cased character with all cased characters lowercase!
    ͅ not islower() but has at least one cased character with all cased characters lowercase!
    Ⅷ not isupper() but has at least one cased character with all cased characters uppercase!
    Ⅷ not istitle() but should be
    ⅷ not islower() but has at least one cased character with all cased characters lowercase!
    2ⁿᵈ not islower() but has at least one cased character with all cased characters lowercase!
    2ᴺᴰ not islower() but has at least one cased character with all cased characters lowercase!
    Ὰͅ isupper() but fails to have at least one cased character with all cased characters uppercase!
    ThisIsInTitleCaseYouKnow not istitle() but should be
    Mᶜ isupper() but fails to have at least one cased character with all cased characters uppercase!
    ᶜM isupper() but fails to have at least one cased character with all cased characters uppercase!
    ᶜM istitle() but should not be
    MᶜKINLEY isupper() but fails to have at least one cased character with all cased characters uppercase!

I really don't understand.    BTW, I feel that MᶜKinley is titlecase in that lowercase
always follows uppercase and uppercase never follows itself.  And Python agrees with me.
But that same definition should vet ThisIsInTitleCaseYouKnow, but Python disagrees.

I really don't understand any of these functions.  I'm very sad.  I think they are
wrong, but maybe I am.  It is extremely confusing.

Shall I file a separate bug report?

--tom

from __future__ import unicode_literals
from __future__ import print_function

import regex

VERBOSE = 0 

data = [

  # first test the problem cases just one at a time
    "\N{MODIFIER LETTER SMALL C}",
    "\N{SUPERSCRIPT LATIN SMALL LETTER N}",
    "\N{MODIFIER LETTER CAPITAL D}", 
    "\N{CIRCLED LATIN SMALL LETTER K}",
    "\N{COMBINING GREEK YPOGEGRAMMENI}",
    "\N{ROMAN NUMERAL EIGHT}",
    "\N{SMALL ROMAN NUMERAL EIGHT}",
    "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}",
    "\N{LATIN LETTER SMALL CAPITAL R}",

  # test superscripts
    "2\N{SUPERSCRIPT LATIN SMALL LETTER N}\N{MODIFIER LETTER SMALL D}", 
    "2\N{MODIFIER LETTER CAPITAL N}\N{MODIFIER LETTER CAPITAL D}",
    "2\N{FEMININE ORDINAL INDICATOR}", # as in "segunda"

  # test romans
    "ROMAN NUMERAL EIGHT IS \N{ROMAN NUMERAL EIGHT}",
    "roman numeral eight is \N{SMALL ROMAN NUMERAL EIGHT}",

  # test small caps
    "\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL E}",

  # test cased combining mark (this is in titlecase)
    "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}",
    "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI} \N{GREEK CAPITAL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}",

  # test cased symbols
    "circle  \N{CIRCLED LATIN SMALL LETTER K}",
    "CIRCLE  \N{CIRCLED LATIN CAPITAL LETTER K}",

  # test titlecased code point 3-way
    "\N{LATIN CAPITAL LETTER DZ}",
    "\N{LATIN CAPITAL LETTER DZ}UR",
    "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}ur",
    "\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z}",
    "\N{LATIN SMALL LETTER DZ}ur",
    "\N{LATIN SMALL LETTER DZ}",

  # test titlecase

    "FBI", "F B I", "F.B.I",
    "HP Company", "H.P. Company",
    "ThisIsInTitleCaseYouKnow",

    "M\N{MODIFIER LETTER SMALL C}",
    "\N{MODIFIER LETTER SMALL C}M",

    "M\N{MODIFIER LETTER SMALL C}Kinley",  # titlecase
    "M\N{MODIFIER LETTER SMALL C}KINLEY",  # uppercase
    "m\N{MODIFIER LETTER SMALL C}kinley",  # lowercase

    # Return true if the string is a titlecased string and there
    # is at least one character, for example uppercase characters may
    # only follow uncased characters and lowercase characters only
    # cased ones. Return false otherwise.

    # Return true if all cased characters in the string are lowercase and there is at least one cased character,
]

for s in data:

  # "Return true if all cased characters in the string are lowercase 
  #  and there is at least one cased character"

    if s.islower():
        if not (        regex.search(r'\p{cased}', s) 
                and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)):
            print(s+" islower() but fails to have at least one cased character with all cased characters lowercase!")
    else:
        if (        regex.search(r'\p{cased}', s) 
            and not regex.search(r'(?=\p{cased})\P{LOWERCASE}', s)):
            print(s+" not islower() but has at least one cased character with all cased characters lowercase!")

  # "Return true if all cased characters in the string are uppercase 
  #  and there is at least one cased character"

    if s.isupper():
        if not (        regex.search(r'\p{cased}', s) 
                and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)):
            print(s+" isupper() but fails to have at least one cased character with all cased characters uppercase!")
    else:
        if (        regex.search(r'\p{cased}', s) 
            and not regex.search(r'(?=\p{cased})\P{UPPERCASE}', s)):
            print(s+" not isupper() but has at least one cased character with all cased characters uppercase!")

  # "Return true if the string is a titlecased string and there is at
  # least one character, for example uppercase characters may only
  # follow uncased characters and lowercase characters only cased ones."

    has_it  = s.istitle()
    want_it1 = (  
          # at least one title/uppercase
                regex.search(r'[\p{Lt}\p{uppercase}]', s) 
                  and not 
          # plus no title/uppercase follows cased character
               regex.search(r'(?<=\p{cased})[\p{Lt}\p{uppercase}]', s)
                  and not 
          # plus no lowercase follows uncased character
               regex.search(r'(?<=\P{CASED})\p{lowercase}', s)
              )

    want_it  = regex.search(r'''(?x) 
        ^ 
            (?:
                \P{CASED} * 
                [\p{Lt}\p{uppercase}] 
                (?! [\p{Lt}\p{uppercase}] )
                    \p{lowercase} *
            ) +
            \P{CASED} * 
        $
    ''', s)

    if VERBOSE:
        if has_it and want_it:
            print( s + " istitle() and should be (OK)")
        if not has_it and not want_it:
            print( s + " not istitle() and should not be (OK)")

    if has_it and not want_it:
        print( s + " istitle() but should not be")

    if want_it and not has_it:
        print( s + " not istitle() but should be")
msg144803 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-03 04:15
> But it still has to happen at compile time, of course, so I don't know
> what you could do in Python.  Is there any way to change how the compiler
> behaves even vaguely along these lines?

I think things like "from __future__ import ..." do something similar, but I'm not sure it will work in this case (also because you will have to provide the list of aliases somehow).

>> Really?  White space makes things harder to read?  I thought Pythonistas
>> believed the opposite of that.  Whitespace is very useful for cognitive
>> chunking: you see how things logically group together.

> I was surprised at that too ;-). One person's opinion in a specific 
> context. Don't generaliza.

Also don't generalize my opinion regarding *where* whitespace makes thing less readable: I was just talking about regex.
What I was trying to say here is best summarized by a quote from Paul Graham's article "Succinctness is Power":
"""
If you're used to reading novels and newspaper articles, your first experience of reading a math paper can be dismaying. It could take half an hour to read a single page. And yet, I am pretty sure that the notation is not the problem, even though it may feel like it is. The math paper is hard to read because the ideas are hard. If you expressed the same ideas in prose (as mathematicians had to do before they evolved succinct notations), they wouldn't be any easier to read, because the paper would grow to the size of a book.
"""
Try replacing
  s/novels and newspaper articles|prose/Python code/g
  s/single page/single regex/
  s/math paper/regex/g.

To provide an example, I find:

# define a function to capitalize s
def my_capitalize(s):
    """This function capitalizes the argument s and returns it"""
    the_first_letter = s[0]  # 0 means the first char
    the_rest_of_s = s[1:]  # 1: means from the second till the end
    the_first_letter_uppercased = the_first_letter.upper()  # upper makes the string uppercase
    the_rest_of_s_lowercased = the_rest_of_s.lower()  # lower makes the string lowercase
    s_capitalized = the_first_letter_uppercased + the_rest_of_s_lowercased  # + concatenates
    return s_capitalized

less readable than:

def my_capitalize(s):
    return s[0].upper() + s[1:].lower()

You could argue that the first is much more explicit and in a way clearer, but overall I think you agree with me that is less readable.  Also this clearly depends on how well you know the notation you are reading: if you don't know it very well, you might still prefer the commented/verbose/extended/redundant version.  Another important thing to mention, is that notation of regular expressions is fairly simple (especially if you leave out look-arounds and Unicode-related things that are not used too often), but having a similar succinct notation for a whole programming language (like Perl) might not work as well (I'm not picking on Perl here, as you said you can write readable programs if you don't abuse the notation, and the succinctness offered by the language has some advantages, but with Python we prefer more readable, even if we have to be a little more verbose).  Another example of a trade-off between verbosity and succinctness is the new string formatting mini-language.

> That really isn't right.  A cased character is one with the Unicode "Cased"
> property, and a lowercase character is one wiht the Unicode "Lowercase"
> property.  The General Category is actually immaterial here.

You might want to take a look and possibly add a comment on #12204 about this.

> I've spent all bloody day trying to model Python's islower, isupper, and istitle
> functions, but I get all kinds of errors, both in the definitions and in the
> models of the definitions.

If by "model" you mean "trying to figure out how they work", it's probably easier to look at the implementation (I assume you know enough C to understand what they do).  You can find the code for str.istitle() at http://hg.python.org/cpython/file/default/Objects/unicodeobject.c#l10358 and the actual implementation of some macros like Py_UNICODE_ISTITLE at http://hg.python.org/cpython/file/default/Objects/unicodectype.c.

> I really don't understand any of these functions.  I'm very sad.  I think they are
> wrong, but maybe I am.  It is extremely confusing.

> Shall I file a separate bug report?

If after reading the code and/or the documentation you still think they are broken and/or that they can be improved, then you can open another issue.

BTW, instead of writing custom scripts to test things, it might be better to use unittest (see http://docs.python.org/py3k/library/unittest.html#basic-example), or even better write a patch for Lib/test/test_unicode.py.
Using unittest has the advantage that is then easy to integrate those tests within our test suite, but on the other hand as soon as something fails the failure is returned without evaluating the following assertions in the method.
This as the advantage that
msg144825 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-03 17:29
>> There are no official English titling rules and as you noted,
>> publishers vary. 
> 
> If there aren't any rules, then how come all book and movie titles always
> look the same?  :)

Can we please leave the English language out of this issue?
Else I will ask that Python uses German text-processing rules,
just so that this gets fewer comments :-)

As a point of order, please all try to stick at the issue at hand.
Linguistics discussions or general Unicode discussion have better
places than this bug tracker. I just had to stop reading Tom's
comments as too verbose (which is more difficult since it's in
a foreign language).
msg144827 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-03 17:44
The patch is pretty much complete, it just needs a review (I left some comments on the review page).
One thing that can be added is some compression for the names of the named sequences.  I'm not sure I can reuse the same compression used for the other names easily.  Does the size of the db really matters?  Are the new names using too much extra space?
msg144832 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-03 18:44
The patch needs to take versioning into account. It seems that NamedSequences where added in 4.1, and NameAliases in 5.0. So for the moment, when using 3.2 (i.e. when self is not NULL), it is fine to lookup neither. Please put an assertion into makeunicodedata that this needs to be reviewed when an old version other than 3.2 needs to be supported.

The size of the DB does matter; there are frequent complaints about it. The named sequences take 20kB on my system; not sure whether that's too much. If you want to reduce the size (and also speedup lookup), you could use private-use characters, like so:
- add the named sequences as PUA characters to the names table of makeunicodename, in the range(P, P+418) (for some P).
- in lookup, check whether the _getcode result is in range(P,P+418). If so, subtract P from the code and use this as an index into _namedsequences.
- add a _getcode wrapper that filters out all private use characters, for regular lookup.
msg144836 - (view) Author: Tom Christiansen (tchrist) Date: 2011-10-03 18:57
Ezio Melotti <report@bugs.python.org> wrote
   on Mon, 03 Oct 2011 04:15:51 -0000: 

>> But it still has to happen at compile time, of course, so I don't know
>> what you could do in Python.  Is there any way to change how the compiler
>> behaves even vaguely along these lines?

> I think things like "from __future__ import ..." do something similar,
> but I'm not sure it will work in this case (also because you will have
> to provide the list of aliases somehow).

Ah yes, that's right.  Hm.  I bet then it *would* be possible, just perhaps
a bit of a run-around to get there.  Not a high priority, but interesting.

> less readable than:
> 
> def my_capitalize(s):
>    return s[0].upper() + s[1:].lower()

> You could argue that the first is much more explicit and in a way
> clearer, but overall I think you agree with me that is less readable.

Certainly.

It's a bit like the way bug rate per lines of code is invariant across
programming languages.  When you have more opcodes, it gets harder to
understand because there are more interactions and things to remember.

>> That really isn't right.  A cased character is one with the Unicode "Cased"
>> property, and a lowercase character is one wiht the Unicode "Lowercase"
>> property.  The General Category is actually immaterial here.

> You might want to take a look and possibly add a comment on #12204 about this.

>> I've spent all bloody day trying to model Python's islower, isupper, and istitle
>> functions, but I get all kinds of errors, both in the definitions and in the
>> models of the definitions.

> If by "model" you mean "trying to figure out how they work", it's
> probably easier to look at the implementation (I assume you know
> enough C to understand what they do).  You can find the code for
> str.istitle() at http://hg.python.org/cpython/file/default/Objects/un-
> icodeobject.c#l10358 and the actual implementation of some macros like
> Py_UNICODE_ISTITLE at
> http://hg.python.org/cpython/file/default/Objects/unicodectype.c.

Thanks, that helps immensely.  I'm completely fluent in C.  I've gone 
and built a tags file of your whole v3.2 source tree to help me navigate.

The main underlying problem is that the internal macros are defined in a
way that made sense a long time ago, but no longer do ever since (for
example) the Unicode lowercase property stopped being synonymous with
GC=Ll and started also including all code points with the
Other_Lowercase property as well.

The originating culprit is Tools/unicode/makeunicodedata.py.
It builds your tables only using UnicodeData.txt, which is
not enough.  For example:

    if category in ["Lm", "Lt", "Lu", "Ll", "Lo"]:
	flags |= ALPHA_MASK
    if category == "Ll":
	flags |= LOWER_MASK
    if 'Line_Break' in properties or bidirectional == "B":
	flags |= LINEBREAK_MASK
	linebreaks.append(char)
    if category == "Zs" or bidirectional in ("WS", "B", "S"):
	flags |= SPACE_MASK
	spaces.append(char)
    if category == "Lt":
	flags |= TITLE_MASK
    if category == "Lu":
	flags |= UPPER_MASK

It needs to use DerivedCoreProperties.txt to figure out whether
something is Other_Uppercase, Other_Lowercase, etc. In particular:

    Alphabetic := Lu+Ll+Lt+Lm+Lo + Nl + Other_Alphabetic
    Lowercase  := Ll + Other_Lowercase
    Uppercase  := Ll + Other_Uppercase

This affects a lot of things, but you should be able to just fix it
in Tools/unicode/makeunicodedata.py and have all of them start
working correctly.

You will probably also want to add 

    Py_UCS4 _PyUnicode_IsWord(Py_UCS4 ch)

that uses the UTS#18 Annex C definition, so that you catch marks, too.
That definition is:

    Word := Alphabetic + Mc+Me+Mn + Nd + Pc

where Alphabetic is defined above to include Nl and Other_Alphabetic.

Soemwhat related is stuff like this:

    typedef struct {
	const Py_UCS4 upper;
	const Py_UCS4 lower;
	const Py_UCS4 title;
	const unsigned char decimal;
	const unsigned char digit;
	const unsigned short flags;
    } _PyUnicode_TypeRecord;

There are two different bugs here.  First, you are missing 

	const Py_UCS4 fold;

which is another field from UnicodeData.txt, one that is critical 
for doing case-insensitive matches correctly.

Second, there's also the problem that Py_UCS4 is an int.  That means you
are stuck with just the character-based simple versions of upper-, title-,
lower-, and foldcase.  You need to have fields for the full mappings, which
are now strings (well, int arrays) not single ints.  I'll use ??? for the
int-array type that I don't know:

	const ??? upper_full;
	const ??? lower_full;
	const ??? title_full;
	const ??? fold_full;

You will also need to extend the API from just

    Py_UCS4 _PyUnicode_ToUppercase(Py_UCS4 ch)

to something like

    ??? _PyUnicode_ToUppercase_Full(Py_UCS4 ch)

I don't know what the ??? return type is there, but it's whatever the
upper_full filed in _PyUnicode_TypeRecord would be.

I know that Matthew Barnett has had to cover a bunch of these for his regex
module, including generating his own tables.  It might be possible to
piggy-back on that effort; certainly it would be desirable to try.

> I really don't understand any of these functions.  I'm very sad.  I think they are
> wrong, but maybe I am.  It is extremely confusing.

>> Shall I file a separate bug report?

> If after reading the code and/or the documentation you still think
> they are broken and/or that they can be improved, then you can open
> another issue.

I handn't actually *looked* at capitalize yet, because I stumbled over
these errors in the way-underlying code that necessarily supports it.
The errors in definitions explain a lot of what I was 

Ok, more bugs.  Consider this:

    static 
    int fixcapitalize(PyUnicodeObject *self)
    {
	Py_ssize_t len = self->length;
	Py_UNICODE *s = self->str;
	int status = 0;

	if (len == 0)
	    return 0;
	if (Py_UNICODE_ISLOWER(*s)) {
	    *s = Py_UNICODE_TOUPPER(*s);
	    status = 1;
	}
	s++;
	while (--len > 0) {
	    if (Py_UNICODE_ISUPPER(*s)) {
		*s = Py_UNICODE_TOLOWER(*s);
		status = 1;
	    }
	    s++;
	}
	return status;
    }

There are several bugs there.  First, you have to use the TITLECASE if there
is one, and only use the uppercase if there is no titlecase.  Uppercase
is wrong.

Second, you cannot decide to do the case change only if it starts out as a
certain case.  You have to do it unconditionally, especially since your
tests for whether something is upper or lower are wrong.  For example,
Roman numerals, the iota subscript, the circled letters, and a few other
things all are case-changing but are not themselves Letters in the
GC=Ll/Lu/Lt sense.  Also, there are also cased letters in the GC=Lm
category, which you miss.  Unicode has properties like Cased that you
should be using to determine whether something is cased.  It also have
properties like Changes_When_Uppercased (aka CWU) that tell you whether
something will change.  For example, most of the small capitals are cased
code points that are considered lowercase and which do not change when
uppercase.  However, The LATIN SMALL CAPITAL R (which is a lowercase code
point) actually does have an uppercase mapping.  Strange but true.

Does this help at all?  I have to go to a meeting now.

--tom
msg144839 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-03 19:19
> The main underlying problem is that the internal macros are defined in a
> way that made sense a long time ago, but no longer do ever since (for
> example) the Unicode lowercase property stopped being synonymous with
> GC=Ll and started also including all code points with the
> Other_Lowercase property as well.

Tom: PLEASE focus on one issue at a time. This is about formal
aliases and named sequences, NOT about upper and lower case.
If you want to have a discussion about upper and lower case,
please open a separate issue. There I would explain why I
think your reasoning is flawed (i.e. just because your interpretation
of Unicode differs from Python's implementation doesn't already
make Python's implementation incorrect - just different).
msg145254 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-09 13:20
Here is a new patch that stores the names of aliases and named sequences in the Private Use Area.

To summarize a bit, this is what we want:
        | 6.0.0 | 3.2.0 |
--------+-------+-------+
\N{...} |   A   |   -   |
.name   |   -   |   -   |
.lookup |  A,NS |   -   |

I.e., \N{...} should only support aliases, unicodedata.lookup should support aliases and named sequences, unicodedata.name doesn't support either, and when 3.2.0 is used nothing is supported.

The function calls involved for these 3 functions are:

\N{...} and .lookup:
  _getcode
    _cmpname
      _getucname
    _check_alias

.name:
  _getucname

My patch adds an extra arg to _getcode and _getucname (I hope that's fine -- or are they public?).

_getcode is called by \N{...} and .lookup; both support aliases, so _getcode now resolves aliases by default.  Since only .lookup wants named sequences, _getcode now accepts an extra 'with_named_seq' arg and looks up named sequences only when its value is 1.  .lookup passes 1, gets the codepoint, and converts it to a sequence.  \N{...} passes 0 and doesn't get named sequences.

_getucname is called by .name and indirectly (through _cmpname) by .lookup and \N{...}.  Since _getcode takes care of deciding who gets aliases and sequences, _getucname now accepts an extra 'with_alias_and_seq' arg and looks up aliases and named sequences only when its value is 1.  _cmpname passes 1, gets aliases and named sequences and then lets _getcode decide what to do with them.  .name passes 0 and doesn't get aliases and named sequences.

All this happens on 6.0.0 only, when self != NULL (i.e. we are using 3.2.0) named sequences and aliases are ignored.

The patch doesn't include the changes to unicodename_db.h -- run makeunicodedata.py to get them.
I also added more tests to make sure that the names added in the PUA don't leak, and that ucd_3_2_0 is not affected.
msg145263 - (view) Author: Tom Christiansen (tchrist) Date: 2011-10-09 15:21
Ezio Melotti <report@bugs.python.org> wrote
   on Sun, 09 Oct 2011 13:21:00 -0000: 

> Here is a new patch that stores the names of aliases and named
> sequences in the Private Use Area.

Looks good!  Thanks!

--tom
msg145327 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-11 02:15
(I had to re-upload the patch a couple of time to get the review button to work.  Apparently if there are some conflicts rietveld fails to apply the patch, whereas hg is able to merge files without problems here.  Sorry for the noise.)
msg145401 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-12 16:00
If you don't use git-style diffs, Rietveld will much better accommodate patches that don't apply to tip cleanly. Unfortunately, hg git-style diffs don't indicate the base revision, so Rietveld guesses that the base line is tip, and then fails if it doesn't apply exactly.
msg146034 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-20 17:08
If the latest patch is fine I'll commit it shortly.
msg146036 - (view) Author: Tom Christiansen (tchrist) Date: 2011-10-20 17:11
Yes, it looks good.  Thank you very much.

-tom
msg146075 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-10-21 09:55
LGTM
msg146114 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-10-21 18:57
New changeset a985d733b3a3 by Ezio Melotti in branch 'default':
#12753: Add support for Unicode name aliases and named sequences.
http://hg.python.org/cpython/rev/a985d733b3a3
msg146129 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-21 20:22
I committed the patch and the buildbots seem happy.  Thanks for the report and the feedback!

Tom, about the problems you mentioned in msg144836, can you report it in a new issue or, if there are already issues about them, add a message there?
msg146135 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-10-21 21:24
New changeset 329b96fe4472 by Ezio Melotti in branch 'default':
#12753: fix compilation on Windows.
http://hg.python.org/cpython/rev/329b96fe4472
msg191737 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-23 22:32
> about the problems you mentioned in msg144836, can you report
> it in a new issue or, if there are already issues about them,
> add a message there?

I believe that would be #4610.
History
Date User Action Args
2022-04-11 14:57:20adminsetgithub: 56962
2013-06-23 22:32:03belopolskysetsuperseder: Unicode case mappings are incorrect

messages: + msg191737
nosy: + belopolsky
2011-10-21 21:24:30python-devsetmessages: + msg146135
2011-10-21 20:22:13ezio.melottisetstatus: open -> closed
resolution: fixed
messages: + msg146129

stage: commit review -> resolved
2011-10-21 18:57:56python-devsetnosy: + python-dev
messages: + msg146114
2011-10-21 09:55:47loewissetmessages: + msg146075
2011-10-20 19:49:47floxsetnosy: + flox
2011-10-20 17:11:06tchristsetmessages: + msg146036
2011-10-20 17:08:37ezio.melottisetmessages: + msg146034
stage: patch review -> commit review
2011-10-12 16:00:59loewissetmessages: + msg145401
2011-10-11 02:15:22ezio.melottisetmessages: + msg145327
2011-10-11 02:11:03ezio.melottisetfiles: + issue12753-4.diff
2011-10-11 02:10:05ezio.melottisetfiles: - issue12753-4.diff
2011-10-10 09:51:18ezio.melottisetfiles: + issue12753-4.diff
2011-10-10 09:50:57ezio.melottisetfiles: - issue12753-4.diff
2011-10-09 15:21:02tchristsetmessages: + msg145263
title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace -> \N{...} neglects formal aliases and named sequences from Unicode charnames namespace
2011-10-09 13:20:58ezio.melottisetfiles: + issue12753-4.diff

messages: + msg145254
2011-10-03 19:19:39loewissetmessages: + msg144839
title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace -> \N{...} neglects formal aliases and named sequences from Unicode charnames namespace
2011-10-03 18:57:20tchristsetmessages: + msg144836
title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace -> \N{...} neglects formal aliases and named sequences from Unicode charnames namespace
2011-10-03 18:44:25loewissetmessages: + msg144832
2011-10-03 17:44:43ezio.melottisetkeywords: + needs review

messages: + msg144827
2011-10-03 17:29:46loewissetmessages: + msg144825
title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace -> \N{...} neglects formal aliases and named sequences from Unicode charnames namespace
2011-10-03 04:15:50ezio.melottisetmessages: + msg144803
2011-10-03 02:25:23tchristsetmessages: + msg144802
2011-10-02 21:56:58terry.reedysetmessages: + msg144783
2011-10-02 18:41:11tchristsetmessages: + msg144779
2011-10-02 07:34:21ezio.melottisetfiles: + issue12753-3.diff

messages: + msg144760
2011-10-02 06:46:26ezio.melottisetmessages: + msg144758
2011-10-02 05:33:41tchristsetmessages: + msg144757
title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace -> \N{...} neglects formal aliases and named sequences from Unicode charnames namespace
2011-10-01 15:17:52loewissetmessages: + msg144739
2011-10-01 15:04:39loewissetmessages: + msg144738
title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace -> \N{...} neglects formal aliases and named sequences from Unicode charnames namespace
2011-10-01 02:15:42ezio.melottisetfiles: + issue12753-2.diff

messages: + msg144716
2011-09-30 22:07:26tchristsetmessages: + msg144708
2011-09-30 20:30:42ezio.melottisetmessages: + msg144703
2011-09-30 10:00:49loewissetmessages: + msg144681
2011-09-30 09:01:20ezio.melottisetassignee: ezio.melotti
stage: needs patch -> patch review
2011-09-30 08:59:10ezio.melottisetfiles: + issue12753.diff

nosy: + lemburg, loewis
messages: + msg144679

keywords: + patch
2011-08-26 21:26:44gvanrossumsetnosy: + gvanrossum
messages: + msg143043
2011-08-19 23:57:41tchristsetmessages: + msg142508
2011-08-19 23:36:45mrabarnettsetmessages: + msg142507
2011-08-19 23:26:18tchristsetmessages: + msg142506
2011-08-19 22:50:57terry.reedysetnosy: + terry.reedy

messages: + msg142502
stage: test needed -> needs patch
2011-08-15 19:51:39tchristsetfiles: + nametests.py

messages: + msg142145
2011-08-15 17:55:23ezio.melottisetnosy: + ezio.melotti
stage: test needed

components: + Unicode
versions: - Python 3.1, Python 2.7, Python 3.2
2011-08-15 17:48:33tchristcreate