classification
Title: Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation
Type: enhancement Stage:
Components: Interpreter Core, Unicode Versions: Python 3.6, Python 3.5, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Jean-Michel.Fauth, Jim.Jewett, belopolsky, benjamin.peterson, ezio.melotti, mrabarnett, pitrou, python-dev, tchrist, Андрей Баксаляр
Priority: normal Keywords: patch

Created on 2011-08-11 21:39 by tchrist, last changed 2016-03-11 07:39 by benjamin.peterson. This issue is now closed.

Files
File name Uploaded Description Edit
mux.python tchrist, 2011-08-11 21:39 demo program showing all casemaps and casefolds for sample tricky dataset
casing-tests.py tchrist, 2011-08-26 23:55 test suite for casemapping functions, case checking functions, and casefolding of patterns, both simple and full
casing-results.txt ezio.melotti, 2011-08-28 05:54 results on 3.2/3.3 narrow/wide
full-casemapping.patch benjamin.peterson, 2012-01-08 03:54 review
full-casemapping.patch benjamin.peterson, 2012-01-10 03:49 review
full-casemapping.patch benjamin.peterson, 2012-01-11 03:37 review
full-casemapping.patch benjamin.peterson, 2012-01-11 20:20 review
pythonbug.png Андрей Баксаляр, 2016-03-10 20:21
Messages (27)
msg141928 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-11 21:39
Python's casemapping functions only use what Unicode calls simple casemaps. These are only appropriate for functions that operate on single characters alone, not for those that operate on strings. The reason for this is that you get much better results with full casemapping. Java, Ruby, and Perl all do full casemapping for their equivalent functions that do string mapping, and Python should, too.

I include a program that has a much of mappings and foldings both simple and full.  Yes, it was machine-generated.
msg143036 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-26 21:11
I presume this applies to builtin str methods like .lower(), right?  I think it is a good thing to do for Python 3.3.

We'd need to define what should happen in edge cases, e.g. when (against all odds) a string happens to contain a lone surrogate or some other code point or sequence of code points that the Unicode standard considers illegal.  I think it should not fail but just leave those code points alone.

Does this require us to import more data files from the Unicode standard?  By itself that doesn't scare me.

Would this also affect .islower() and friends?
msg143051 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-26 23:36
Guido van Rossum <report@bugs.python.org> wrote
   on Fri, 26 Aug 2011 21:11:24 -0000: 

> Guido van Rossum <guido@python.org> added the comment:

> I presume this applies to builtin str methods like .lower(), right?  I
> think it is a good thing to do for Python 3.3.

Yes, the full casemaps are for upper, title, and lowercase.  There is 
also a full casefold and turkic case fold (which is full), but you
don't have a casefold function so I guess that doesn't matter.

> We'd need to define what should happen in edge cases, e.g. when
> (against all odds) a string happens to contain a lone surrogate or
> some other code point or sequence of code points that the Unicode
> standard considers illegal.  I think it should not fail but just leave
> those code points alone.

Well, it's a funny thing.  There are properties given for all
Unicode code points, even noncharacter code points.  This
includes the casing properties, oddly enough.

From UnicodeData.txt, which has a few surrogate entries; notice
no casing is given:

    D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
    DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
    DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
    DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
    DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
    DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;

And in SpecialCasing.txt, which does not have surrogates but does have
a default clause:

    # This file is a supplement to the UnicodeData file.
    # It contains additional information about the casing of Unicode characters.
    # (For compatibility, the UnicodeData.txt file only contains case mappings for
    # characters where they are 1-1, and independent of context and language.
    # For more information, see the discussion of Case Mappings in the Unicode Standard.
    #
    # All code points not listed in this file that do not have a simple case mappings
    # in UnicodeData.txt map to themselves.

And in CaseFolding.txt, which also does not have surrogates but again does 
have a default clause:

    # The data supports both implementations that require simple case foldings
    # (where string lengths don't change), and implementations that allow full case folding
    # (where string lengths may grow). Note that where they can be supported, the
    # full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.
    #
    # All code points not listed in this file map to themselves.

Taken all together, it follows that the surrogates have case{map,fold}s
back to themselves, since they have no case{map,fold}s listed.

It's ok to have arbitrary code points in memory, including surrogates and
the 66 noncharacters.  It just isn't legal to have them in a UTF stream
for "open interchange", whatever that means.  

> Does this require us to import more data files from the Unicode
> standard?  By itself that doesn't scare me.

One way or the other, yes, notably the SpecialCasing file for
casemapping and the CaseFolding file for casefolding (which you
should do anyway to fix re.I).  But you can and should process the
new files into some tighter format optimized for your own lookups.

Oddly, Java doesn't provide for String methods that do full casing on
titlecase, even those they do do so on lowercase and uppercase.  On
titlecase they only expose the simple casemaps via the Character class,
which are the ones from UnicodeData.  They recognize that this is flaw, 
but it was too late to fix it for JAva 7.

> Would this also affect .islower() and friends?

Well, it shouldn't, but .islower() and friends are already mistaken.
They seem to be checking for GC=Ll and such, but they need to be
checking the Unicode binary property Lowercase and such.  Watch:

    test 37 for string Ⅷ
    wanted <ⅷ> to be lowercase of <Ⅷ> but python disagrees
    wanted <Ⅷ> to be titlecase of <Ⅷ> but python disagrees
    wanted <Ⅷ> to be uppercase of <Ⅷ> but python disagrees
    test 37 failed 3 subtests

    test 39 for string Ⓚ
    wanted <ⓚ> to be lowercase of <Ⓚ> but python disagrees
    wanted <Ⓚ> to be titlecase of <Ⓚ> but python disagrees
    wanted <Ⓚ> to be uppercase of <Ⓚ> but python disagrees
    test 39 failed 3 subtests

That's because the Roman numerals are GC=Nl but still have
case and change case.  Similarly for the circled letters which
are GC=So but have case and change case.  Plus there's U+0345,
the iota subscript, which is GC=Mn but has case and changes case.

I don't remember whether I've sent in my full test suite or not.  
If I haven't yet, I should attach it to the bug report.

--tom
msg143052 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-26 23:55
Here’s my casing test suite; I thought I sent it in but the mux file here isn’t the full thing.

 It does several things, including letting you run it with regex vs re.  It also checks for the islower, etc functions. It has both simple and full (and turkic) maps and folds in it, but is configured to only check the simple versions for now.  The islower and isupper etc functions seem to be checking the wrong Unicode property.

Yes, it has my quaint Unixisms in it, because it needs to run with UTF-8 output, or you can't read what's going on.
msg143072 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-27 14:48
Guido van Rossum <report@bugs.python.org> wrote
   on Fri, 26 Aug 2011 21:11:24 -0000: 

> Would this also affect .islower() and friends?

SHORT VERSION:  (7 lines)

    I don't believe so, but the relationship between lower() and islower()
    is not as clear to me as I would have thought, and more importantly,
    the code and the documentation for Python's islower() etc currently seem
    to disagree.  For future releases, I recommend fixing the code, but if
    compatibility is an issue, then perhaps for previous releases still in
    maintenance mode fixing only the documentation would possibly be good
    enough--your call.

=======================================================================

MEDIUM VERSION: (87 lines)

I was initially confused with Python's islower() family because of the way
they are defined to operate on full strings.  They don't check that
everything is lowercase even though they say they do.

 <  http://docs.python.org/py3k/library/stdtypes.html#sequence-types-str-bytes-bytearray-list-tuple-range

    str.lower()

        Return a copy of the string with all the cased characters [4]
        converted to lowercase.

    str.islower()

        Return true if all cased characters [4] in the string are lowercase 
        and there is at least one cased character, false otherwise.

    [4] (1, 2, 3, 4) Cased characters are those with general category
        property being one of “Lu” (Letter, uppercase), “Ll” (Letter,
        lowercase), or “Lt” (Letter, titlecase).

This is strange in several ways.  Of lesser importance is that
strings can be considered lowercase even if they don't match

    ^\p{lowercase}+$

Another is that the result of calling str.lower() may not be .islower().
I'm not sure what these are particularly for, since I myself would just use
a regex to get finer-grained control.  (I suppose that's because re doesn't
give access to the Unicode properties needed that this approach never
gained any traction in the Python community.)

However, the worst of this is that the documentation defines both cased
characters and lowercase characters *differently* from how Unicode does
defines those very same terms.  This was quite confusing.

Unicode distinguishes Cased code points from Cased_*Letter* code points.
Python is using the Cased_Letter property but calling it Cased.  Cased in 
a proper superset of Cased_Letter.  From the DerivedCoreProperties file in
the Unicode Character Database:

    # Derived Property:   Cased (Cased)
    #  As defined by Unicode Standard Definition D120
    #  C has the Lowercase or Uppercase property or has a General_Category value of Titlecase_Letter.

In the same way, the Lowercase and Uppercase properties are not the same as
the Lowercase_*Letter* and Uppercase_*Letter* properties.  Rather, the former
are respectively proper supersets of the latter.  

    # Derived Property: Lowercase
    #  Generated from: Ll + Other_Lowercase

    [...]

    # Derived Property: Uppercase
    #  Generated from: Lu + Other_Uppercase

In all these, you almost always want the superset versions not the
restricted subset versions you are using.  If it were in the regex engine,
the user could select either.

Java used to miss all these, too.  But in 1.7, they updated their character
methods to use the properties that they'd all along said they were using:

  < http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)

    public static boolean isLowerCase(char ch)
    Determines if the specified character is a lowercase character. 

     A character is lowercase if its general category type, provided by
     Character.getType(ch), is LOWERCASE_LETTER, or it has contributory
->   property Other_Lowercase as defined by the Unicode Standard.

    Note: This method cannot handle supplementary characters.  To
          support all Unicode characters, including supplementary
          characters, use the isLowerCase(int) method.

(And yes, that's where Java uses "character" to mean "code unit" 
 not "code point", alas.  No wonder people get confused)

I'm pretty sure that Python needs to either update its documentation to
match its code, update its code to match its documentation, or both.  Java
chose to update the code to match the documentation, and this is the course
I would recommend if at all possible.  If you say you are checking for
cased code points, then you should use the Unicode definition of cased code
points not your own, and if you say you are checking for lowercase code
points, then you should use the Unicode definition not your own.  Both of
these require access to contributory properties from the UCD and not 
just general categories alone.

--tom

=======================================================================

LONG VERSION: (222 lines)

Essential tools I use for inspecting Unicode code points and their 
properties include

    http://training.perl.com/scripts/unichars
    http://training.perl.com/scripts/uniprops
    http://training.perl.com/scripts/uninames

And over the course of the day, these get used a fair bit, too:

    http://training.perl.com/scripts/uniquote
    http://training.perl.com/scripts/ucsort
    http://training.perl.com/scripts/unifmt

Here for example are (some of) the *non*-Letter code point that
are nonetheless considered lowercase or uppercase because
they have the Other_{Lower,Upper}case properties:

    % unichars -gs '\PL' '[\p{upper}\p{lower}]'
     ○ͅ  U+00345 GC=Mn SC=Inherited    COMBINING GREEK YPOGEGRAMMENI
     Ⅰ  U+02160 GC=Nl SC=Latin        ROMAN NUMERAL ONE
     Ⅱ  U+02161 GC=Nl SC=Latin        ROMAN NUMERAL TWO
     Ⅲ  U+02162 GC=Nl SC=Latin        ROMAN NUMERAL THREE
     [...]
     ⅰ  U+02170 GC=Nl SC=Latin        SMALL ROMAN NUMERAL ONE
     ⅱ  U+02171 GC=Nl SC=Latin        SMALL ROMAN NUMERAL TWO
     ⅲ  U+02172 GC=Nl SC=Latin        SMALL ROMAN NUMERAL THREE
     [...]
     Ⓐ  U+024B6 GC=So SC=Common       CIRCLED LATIN CAPITAL LETTER A
     Ⓑ  U+024B7 GC=So SC=Common       CIRCLED LATIN CAPITAL LETTER B
     Ⓒ  U+024B8 GC=So SC=Common       CIRCLED LATIN CAPITAL LETTER C
     [...]
     ⓐ  U+024D0 GC=So SC=Common       CIRCLED LATIN SMALL LETTER A
     ⓑ  U+024D1 GC=So SC=Common       CIRCLED LATIN SMALL LETTER B
     ⓒ  U+024D2 GC=So SC=Common       CIRCLED LATIN SMALL LETTER C
     [...]

And here are (some of) the letters that are cased but which are
not Lu, Lt, or Ll (they're all Lm, in fact):

    % unichars -gs '\p{Lm}' '\p{cased}'  | ucsort
     ᴭ  U+1D2D GC=Lm SC=Latin        MODIFIER LETTER CAPITAL AE
     ᴬ  U+1D2C GC=Lm SC=Latin        MODIFIER LETTER CAPITAL A
     ᵃ  U+1D43 GC=Lm SC=Latin        MODIFIER LETTER SMALL A
     ₐ  U+2090 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER A
     ᵅ  U+1D45 GC=Lm SC=Latin        MODIFIER LETTER SMALL ALPHA
     ᴮ  U+1D2E GC=Lm SC=Latin        MODIFIER LETTER CAPITAL B
     ᵇ  U+1D47 GC=Lm SC=Latin        MODIFIER LETTER SMALL B
     [...]
     ʷ  U+02B7 GC=Lm SC=Latin        MODIFIER LETTER SMALL W
     ᵂ  U+1D42 GC=Lm SC=Latin        MODIFIER LETTER CAPITAL W
     ˣ  U+02E3 GC=Lm SC=Latin        MODIFIER LETTER SMALL X
     ₓ  U+2093 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER X
     ʸ  U+02B8 GC=Lm SC=Latin        MODIFIER LETTER SMALL Y
     ᶻ  U+1DBB GC=Lm SC=Latin        MODIFIER LETTER SMALL Z
     ᵝ  U+1D5D GC=Lm SC=Greek        MODIFIER LETTER SMALL BETA
     ᵞ  U+1D5E GC=Lm SC=Greek        MODIFIER LETTER SMALL GREEK GAMMA
     ᵟ  U+1D5F GC=Lm SC=Greek        MODIFIER LETTER SMALL DELTA
     ᶿ  U+1DBF GC=Lm SC=Greek        MODIFIER LETTER SMALL THETA
     ͺ  U+037A GC=Lm SC=Greek        GREEK YPOGEGRAMMENI
     ᵠ  U+1D60 GC=Lm SC=Greek        MODIFIER LETTER SMALL GREEK PHI
     ᵡ  U+1D61 GC=Lm SC=Greek        MODIFIER LETTER SMALL CHI
     ᵸ  U+1D78 GC=Lm SC=Cyrillic     MODIFIER LETTER CYRILLIC EN

Perversely, here are some of the modifier letters which are *not* cased:

    % unichars -gs '\p{Lm}' '\P{CASED}'  | ucsort
     ₕ  U+2095 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER H
     ʻ  U+02BB GC=Lm SC=Common       MODIFIER LETTER TURNED COMMA
     ʽ  U+02BD GC=Lm SC=Common       MODIFIER LETTER REVERSED COMMA
     ⁱ  U+2071 GC=Lm SC=Latin        SUPERSCRIPT LATIN SMALL LETTER I
     ₖ  U+2096 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER K
     ₗ  U+2097 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER L
     ₘ  U+2098 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER M
     ⁿ  U+207F GC=Lm SC=Latin        SUPERSCRIPT LATIN SMALL LETTER N
     ₙ  U+2099 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER N
     ₚ  U+209A GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER P
     ₛ  U+209B GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER S
     ₜ  U+209C GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER T
     ʹ  U+02B9 GC=Lm SC=Common       MODIFIER LETTER PRIME
     ʺ  U+02BA GC=Lm SC=Common       MODIFIER LETTER DOUBLE PRIME
     ˆ  U+02C6 GC=Lm SC=Common       MODIFIER LETTER CIRCUMFLEX ACCENT
     ˇ  U+02C7 GC=Lm SC=Common       CARON
     ˈ  U+02C8 GC=Lm SC=Common       MODIFIER LETTER VERTICAL LINE
     ˉ  U+02C9 GC=Lm SC=Common       MODIFIER LETTER MACRON
     ˊ  U+02CA GC=Lm SC=Common       MODIFIER LETTER ACUTE ACCENT
     ˋ  U+02CB GC=Lm SC=Common       MODIFIER LETTER GRAVE ACCENT
     ˌ  U+02CC GC=Lm SC=Common       MODIFIER LETTER LOW VERTICAL LINE

(Interesting how the commas sort as breath marks next to H.)

I cannot for the life of me figure out why Unicode deems these lowercase:

     ᵃ  U+1D43 GC=Lm SC=Latin        MODIFIER LETTER SMALL A
     ₐ  U+2090 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER A
     ᵅ  U+1D45 GC=Lm SC=Latin        MODIFIER LETTER SMALL ALPHA

yet these *not* to be cased:

     ⁱ  U+2071 GC=Lm SC=Latin        SUPERSCRIPT LATIN SMALL LETTER I
     ₘ  U+2098 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER M
     ⁿ  U+207F GC=Lm SC=Latin        SUPERSCRIPT LATIN SMALL LETTER N

All I know is that the tables tell me.

Here's a fair assortment of cased and noncased, case-changing and
non-casing code points.  The variation in binary properties is pretty wide.

    $ uniprops x 00aa 1d4e 2071 2172 df 262 1d401 1d42d 2117 24c5

    U+0078 ‹x› \N{LATIN SMALL LETTER X}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic ASCII Assigned Basic_Latin Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower Lowercase PerlWord POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print POSIX_Word Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+00AA ‹ª› \N{FEMININE ORDINAL INDICATOR}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+1D4E <ᵎ> \N{MODIFIER LETTER SMALL TURNED I}
        \w \pL \p{L_} \p{Lm}
        All Any Alnum Alpha Alphabetic Assigned InPhoneticExtensions Case_Ignorable CI Cased Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Modifier_Letter Lower Lowercase Phonetic_Extensions Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+2071 <ⁱ> \N{SUPERSCRIPT LATIN SMALL LETTER I}
        \w \pL \p{L_} \p{Lm}
        All Any Alnum Alpha Alphabetic Assigned InSuperscriptsAndSubscripts Case_Ignorable CI Changes_When_NFKC_Casefolded CWKCF L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Modifier_Letter Print SD Soft_Dotted Superscripts_And_Subscripts Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word

    U+2172 <ⅲ> \N{SMALL ROMAN NUMERAL THREE}
        \w \pN \p{Nl}
        All Any Alnum Alpha Alphabetic Assigned InNumberForms Cased Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Nl N Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Latin Latn Letter_Number Lower Lowercase Number Number_Forms Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+00DF <ß> \N{LATIN SMALL LETTER SHARP S}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+0262 <ɢ> \N{LATIN LETTER SMALL CAPITAL G}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS IPA_Extensions Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+1D401 <𝐁> \N{MATHEMATICAL BOLD CAPITAL B}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
        All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Common Zyyy Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Math Mathematical_Alphanumeric_Symbols Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word

    U+1D42D <𝐭> \N{MATHEMATICAL BOLD SMALL T}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Common Zyyy Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Math Mathematical_Alphanumeric_Symbols Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+2117 ‹℗› \N{SOUND RECORDING COPYRIGHT}
        \pS \p{So}
        All Any Assigned InLetterlikeSymbols Common Zyyy So S Gr_Base Grapheme_Base Graph GrBase Letterlike_Symbols Other_Symbol Print Symbol X_POSIX_Graph X_POSIX_Print

    U+24C5 ‹Ⓟ› \N{CIRCLED LATIN CAPITAL LETTER P}
        \w \pS \p{So}
        All Any Alnum Alpha Alphabetic Assigned InEnclosedAlphanumerics Cased Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Common Zyyy Enclosed_Alphanumerics So S Gr_Base Grapheme_Base Graph GrBase Other_Symbol Print Symbol Upper Uppercase Word X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word

Unicode also has a Case_Ignorable (CI) character property, which I haven't 
thought much about but which might be useful.  

    http://www.unicode.org/reports/tr44/#Case_Ignorable

        Characters which are ignored for casing purposes. For more information,
        see D121 in Section 3.13, Default Case Algorithms in [Unicode].

        Generated from: Mn + Me + Cf + Lm + Sk + Word_Break=MidLetter + Word_Break=MidNumLet

I'm not sure if you should think about these when doing your isupper()
test; maybe you should.  That way you wouldn't fail just because you had
a code point that was technically lowercase, like if someone used
"LEONARD MᶜCOY".  That funny ᶜ wouldn't count as a spoiler then, so that
"Leonard MᶜCoy".upper().isupper() could be true, as the ᶜ wouldn't
change but wouldn't count, either.  I haven't thought about this enough
though.  I'm not used to full string-based isupper() functions, so my
instincts may be wrong here.

The only code point that is both CWCM and also CI is the notorious

     ○ͅ  U+00345 GC=Mn SC=Inherited    COMBINING GREEK YPOGEGRAMMENI

Subscripts, superscripts, modifier letters, small capitals, and mathematical
letters *tend* to be cased code points that do not change when casemapped
or casefolded, although there are exceptions.

    % uninames small capital '\b\R\b'
     ʀ  0280        LATIN LETTER SMALL CAPITAL R
            * voiced uvular trill
            * Germanic, Old Norse
            * uppercase is 01A6
     ʁ  0281        LATIN LETTER SMALL CAPITAL INVERTED R
            * voiced uvular fricative or approximant
            x (modifier letter small capital inverted r - 02B6)
     ʶ  02B6        MODIFIER LETTER SMALL CAPITAL INVERTED R
            * preceding four used for r-coloring or r-offglides
            x (latin letter small capital inverted r - 0281)
            # <super> 0281
     ᴙ  1D19        LATIN LETTER SMALL CAPITAL REVERSED R
     ᴚ  1D1A        LATIN LETTER SMALL CAPITAL TURNED R
      ᷢ  1DE2       COMBINING LATIN LETTER SMALL CAPITAL R

   % uniprops 280 1a6
    U+0280 <ʀ> \N{LATIN LETTER SMALL CAPITAL R}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS IPA_Extensions Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower
           X_POSIX_Print X_POSIX_Word
    U+01A6 <Ʀ> \N{LATIN LETTER YR}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
        All Any Alnum Alpha Alphabetic Assigned InLatinExtendedB Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_Extended_B Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word

That's right: the uppercase of LATIN LETTER SMALL CAPITAL R is LATIN LETTER
YR, and I don't know why.  No other small capital -- which are all considered
lowercase -- changes when casemapped.  Only this one alone.

Note that things like code points like U+00DF LATIN SMALL LETTER SHARP S
have these binary properties true because the normal/default sense of these
terms in Unicode is the full/string sense not the simple/character sense:

        Changes_When_Casefolded (CWCF) 
        Changes_When_Casemapped (CWCM)
        Changes_When_Titlecased (CWT) 
        Changes_When_Uppercased (CWU)

Those are true because the full uppercase map of "ß" is "SS" 
and the full casefold of "ß"  is "ss".

--tom
msg143083 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-27 16:15
Thanks you very much. We should fix the behavior in 3.3 for sure. I'm
thinking that we may be able to backport the behavior fix to 2.7 and
3.2 as well, since it just makes the behavior generally "better" (and
for most folks it won't matter anyway).

I'm not sure where the somewhat odd rules for .islower() come from, I
think in part from the desire to have "".islower() be False but "a
b".islower() to be True. Intuitively, this means that .islower() means
both "there is at least one lower case character" and "there are no
upper case characters", but not "all characters are lowercase". I
forget what we do w.r.t. titlecase, but the intuitive meaning should
not change. Although personally I don't have much of an intuition for
what titlecase means (and why it's important), perhaps because I'm not
familiar with any language where there is a third case for some
letters.
msg143084 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-27 19:17
Guido van Rossum <report@bugs.python.org> wrote
   on Sat, 27 Aug 2011 16:15:33 -0000: 

> Although personally I don't have much of an intuition for what
> titlecase means (and why it's important), perhaps because I'm not
> familiar with any language where there is a third case for some
> letters.

Neither am I.  Even in "old-style" English with ae and oe, one wrote
ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
*Aesir.  Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.

    (BTW, in French you really shouldn't split up the œ into oe, 
          nor in Old English, Old Norse, or Icelandic the æ in ae;
          although in contemporary English, it's usually ok to do so.)

I believe that almost but not quite all the sticky situations with
Unicode casing involve compatibility characters for clean round-trips
with legacy encodings.  Exceptions include the German sharp s (both of 
them now) and the two Greek lowercase sigmas.  Thank goodness we don't
use the long s in English anymore.  What is it with s's, anyway? :)

Most of the titlecase letters are in Greek, with a few in Armenian.
I know no Armenian (their letters all look the same to me :), and the
folks I talked to about the Greek are skeptical.  The German sharp s is
a red herring, because you can never have it as the first letter
(although it needn't be the last, as in Rußland).  That's no more
possible than having the old legacy ff ligature appear at the beginning
of an English world.

In any event, there are only 129 total code points that are
"problematic" in terms of their case, where by problematic 
I mean one or more of:

   --- titlecase differs from uppercase
   --- foldcase  differs from lowercase
   --- any of fold/lower/title/uppercase yields more than one code point

Of all these, it's the (now two!) sharp s's and the Turkic i that are the most annoying.
It's really quite a lot of trouble to go through for so few code points of so little
(perceived) use.  But I suppose you never know what new ones they'll uncover, either.
Here are those 129 case-problematicals arranged in UCA order.  Some of these
normilizations forms that decompose into graphemes with four code points (not shown).
There are a few other oddities, like the Kelvin sign and other "singletons", but these
are most of the trouble. They're all in the BMP; I guess we learned our lesson. :)

--tom

  1: U+0345 ○ͅ  COMBINING  GREEK YPOGEGRAMMENI
               fc=ι  U+3B9 lc=○ͅ  U+345 tc=Ι  U+399 uc=Ι  U+399 
  2: U+1E9A ẚ  LATIN SMALL LETTER A WITH RIGHT HALF RING
               fc=aʾ  U+61.2BE lc=ẚ  U+1E9A tc=Aʾ  U+41.2BE uc=Aʾ  U+41.2BE 
  3: U+01F3 dz  LATIN SMALL LETTER DZ
               fc=dz  U+1F3 lc=dz  U+1F3 tc=Dz  U+1F2 uc=DZ  U+1F1 
  4: U+01F2 Dz  LATIN CAPITAL LETTER D WITH SMALL LETTER Z
               fc=dz  U+1F3 lc=dz  U+1F3 tc=Dz  U+1F2 uc=DZ  U+1F1 
  5: U+01F1 DZ  LATIN CAPITAL LETTER DZ
               fc=dz  U+1F3 lc=dz  U+1F3 tc=Dz  U+1F2 uc=DZ  U+1F1 
  6: U+01C6 dž  LATIN SMALL LETTER DZ WITH CARON
               fc=dž  U+1C6 lc=dž  U+1C6 tc=Dž  U+1C5 uc=DŽ  U+1C4 
  7: U+01C5 Dž  LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
               fc=dž  U+1C6 lc=dž  U+1C6 tc=Dž  U+1C5 uc=DŽ  U+1C4 
  8: U+01C4 DŽ  LATIN CAPITAL LETTER DZ WITH CARON
               fc=dž  U+1C6 lc=dž  U+1C6 tc=Dž  U+1C5 uc=DŽ  U+1C4 
  9: U+FB00 ff  LATIN SMALL LIGATURE FF
               fc=ff  U+66.66 lc=ff  U+FB00 tc=Ff  U+46.66 uc=FF  U+46.46 
 10: U+FB03 ffi  LATIN SMALL LIGATURE FFI
               fc=ffi  U+66.66.69 lc=ffi  U+FB03 tc=Ffi  U+46.66.69 uc=FFI  U+46.46.49 
 11: U+FB04 ffl  LATIN SMALL LIGATURE FFL
               fc=ffl  U+66.66.6C lc=ffl  U+FB04 tc=Ffl  U+46.66.6C uc=FFL  U+46.46.4C 
 12: U+FB01 fi  LATIN SMALL LIGATURE FI
               fc=fi  U+66.69 lc=fi  U+FB01 tc=Fi  U+46.69 uc=FI  U+46.49 
 13: U+FB02 fl  LATIN SMALL LIGATURE FL
               fc=fl  U+66.6C lc=fl  U+FB02 tc=Fl  U+46.6C uc=FL  U+46.4C 
 14: U+1E96 ẖ  LATIN SMALL LETTER H WITH LINE BELOW
               fc=ẖ  U+68.331 lc=ẖ  U+1E96 tc=H̱  U+48.331 uc=H̱  U+48.331 
 15: U+0130 İ  LATIN CAPITAL LETTER I WITH DOT ABOVE
               fc=i̇  U+69.307 lc=i̇  U+69.307 tc=İ  U+130 uc=İ  U+130 
 16: U+01F0 ǰ  LATIN SMALL LETTER J WITH CARON
               fc=ǰ  U+6A.30C lc=ǰ  U+1F0 tc=J̌  U+4A.30C uc=J̌  U+4A.30C 
 17: U+01C9 lj  LATIN SMALL LETTER LJ
               fc=lj  U+1C9 lc=lj  U+1C9 tc=Lj  U+1C8 uc=LJ  U+1C7 
 18: U+01C8 Lj  LATIN CAPITAL LETTER L WITH SMALL LETTER J
               fc=lj  U+1C9 lc=lj  U+1C9 tc=Lj  U+1C8 uc=LJ  U+1C7 
 19: U+01C7 LJ  LATIN CAPITAL LETTER LJ
               fc=lj  U+1C9 lc=lj  U+1C9 tc=Lj  U+1C8 uc=LJ  U+1C7 
 20: U+01CC nj  LATIN SMALL LETTER NJ
               fc=nj  U+1CC lc=nj  U+1CC tc=Nj  U+1CB uc=NJ  U+1CA 
 21: U+01CB Nj  LATIN CAPITAL LETTER N WITH SMALL LETTER J
               fc=nj  U+1CC lc=nj  U+1CC tc=Nj  U+1CB uc=NJ  U+1CA 
 22: U+01CA NJ  LATIN CAPITAL LETTER NJ
               fc=nj  U+1CC lc=nj  U+1CC tc=Nj  U+1CB uc=NJ  U+1CA 
 23: U+017F ſ  LATIN SMALL LETTER LONG S
               fc=s  U+73 lc=ſ  U+17F tc=S  U+53 uc=S  U+53 
 24: U+1E9B ẛ  LATIN SMALL LETTER LONG S WITH DOT ABOVE
               fc=ṡ  U+1E61 lc=ẛ  U+1E9B tc=Ṡ  U+1E60 uc=Ṡ  U+1E60 
 25: U+00DF ß  LATIN SMALL LETTER SHARP S
               fc=ss  U+73.73 lc=ß  U+DF tc=Ss  U+53.73 uc=SS  U+53.53 
 26: U+1E9E ẞ  LATIN CAPITAL LETTER SHARP S
               fc=ss  U+73.73 lc=ß  U+DF tc=ẞ  U+1E9E uc=ẞ  U+1E9E 
 27: U+FB06 st  LATIN SMALL LIGATURE ST
               fc=st  U+73.74 lc=st  U+FB06 tc=St  U+53.74 uc=ST  U+53.54 
 28: U+FB05 ſt  LATIN SMALL LIGATURE LONG S T
               fc=st  U+73.74 lc=ſt  U+FB05 tc=St  U+53.74 uc=ST  U+53.54 
 29: U+1E97 ẗ  LATIN SMALL LETTER T WITH DIAERESIS
               fc=ẗ  U+74.308 lc=ẗ  U+1E97 tc=T̈  U+54.308 uc=T̈  U+54.308 
 30: U+1E98 ẘ  LATIN SMALL LETTER W WITH RING ABOVE
               fc=ẘ  U+77.30A lc=ẘ  U+1E98 tc=W̊  U+57.30A uc=W̊  U+57.30A 
 31: U+1E99 ẙ  LATIN SMALL LETTER Y WITH RING ABOVE
               fc=ẙ  U+79.30A lc=ẙ  U+1E99 tc=Y̊  U+59.30A uc=Y̊  U+59.30A 
 32: U+0149 ʼn  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
               fc=ʼn  U+2BC.6E lc=ʼn  U+149 tc=ʼN  U+2BC.4E uc=ʼN  U+2BC.4E 
 33: U+1F84 ᾄ  GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
               fc=ἄι  U+1F04.3B9 lc=ᾄ  U+1F84 tc=ᾌ  U+1F8C uc=ἌΙ  U+1F0C.399 
 34: U+1F8C ᾌ  GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
               fc=ἄι  U+1F04.3B9 lc=ᾄ  U+1F84 tc=ᾌ  U+1F8C uc=ἌΙ  U+1F0C.399 
 35: U+1F82 ᾂ  GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
               fc=ἂι  U+1F02.3B9 lc=ᾂ  U+1F82 tc=ᾊ  U+1F8A uc=ἊΙ  U+1F0A.399 
 36: U+1F8A ᾊ  GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
               fc=ἂι  U+1F02.3B9 lc=ᾂ  U+1F82 tc=ᾊ  U+1F8A uc=ἊΙ  U+1F0A.399 
 37: U+1F86 ᾆ  GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
               fc=ἆι  U+1F06.3B9 lc=ᾆ  U+1F86 tc=ᾎ  U+1F8E uc=ἎΙ  U+1F0E.399 
 38: U+1F8E ᾎ  GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
               fc=ἆι  U+1F06.3B9 lc=ᾆ  U+1F86 tc=ᾎ  U+1F8E uc=ἎΙ  U+1F0E.399 
 39: U+1F80 ᾀ  GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
               fc=ἀι  U+1F00.3B9 lc=ᾀ  U+1F80 tc=ᾈ  U+1F88 uc=ἈΙ  U+1F08.399 
 40: U+1F88 ᾈ  GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
               fc=ἀι  U+1F00.3B9 lc=ᾀ  U+1F80 tc=ᾈ  U+1F88 uc=ἈΙ  U+1F08.399 
 41: U+1F85 ᾅ  GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
               fc=ἅι  U+1F05.3B9 lc=ᾅ  U+1F85 tc=ᾍ  U+1F8D uc=ἍΙ  U+1F0D.399 
 42: U+1F8D ᾍ  GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
               fc=ἅι  U+1F05.3B9 lc=ᾅ  U+1F85 tc=ᾍ  U+1F8D uc=ἍΙ  U+1F0D.399 
 43: U+1F83 ᾃ  GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
               fc=ἃι  U+1F03.3B9 lc=ᾃ  U+1F83 tc=ᾋ  U+1F8B uc=ἋΙ  U+1F0B.399 
 44: U+1F8B ᾋ  GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
               fc=ἃι  U+1F03.3B9 lc=ᾃ  U+1F83 tc=ᾋ  U+1F8B uc=ἋΙ  U+1F0B.399 
 45: U+1F87 ᾇ  GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
               fc=ἇι  U+1F07.3B9 lc=ᾇ  U+1F87 tc=ᾏ  U+1F8F uc=ἏΙ  U+1F0F.399 
 46: U+1F8F ᾏ  GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
               fc=ἇι  U+1F07.3B9 lc=ᾇ  U+1F87 tc=ᾏ  U+1F8F uc=ἏΙ  U+1F0F.399 
 47: U+1F81 ᾁ  GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
               fc=ἁι  U+1F01.3B9 lc=ᾁ  U+1F81 tc=ᾉ  U+1F89 uc=ἉΙ  U+1F09.399 
 48: U+1F89 ᾉ  GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
               fc=ἁι  U+1F01.3B9 lc=ᾁ  U+1F81 tc=ᾉ  U+1F89 uc=ἉΙ  U+1F09.399 
 49: U+1FB4 ᾴ  GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
               fc=άι  U+3AC.3B9 lc=ᾴ  U+1FB4 tc=Άͅ  U+386.345 uc=ΆΙ  U+386.399 
 50: U+1FB2 ᾲ  GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
               fc=ὰι  U+1F70.3B9 lc=ᾲ  U+1FB2 tc=Ὰͅ  U+1FBA.345 uc=ᾺΙ  U+1FBA.399 
 51: U+1FB6 ᾶ  GREEK SMALL LETTER ALPHA WITH PERISPOMENI
               fc=ᾶ  U+3B1.342 lc=ᾶ  U+1FB6 tc=Α͂  U+391.342 uc=Α͂  U+391.342 
 52: U+1FB7 ᾷ  GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
               fc=ᾶι  U+3B1.342.3B9 lc=ᾷ  U+1FB7 tc=ᾼ͂  U+391.342.345 uc=Α͂Ι  U+391.342.399 
 53: U+1FB3 ᾳ  GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
               fc=αι  U+3B1.3B9 lc=ᾳ  U+1FB3 tc=ᾼ  U+1FBC uc=ΑΙ  U+391.399 
 54: U+1FBC ᾼ  GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
               fc=αι  U+3B1.3B9 lc=ᾳ  U+1FB3 tc=ᾼ  U+1FBC uc=ΑΙ  U+391.399 
 55: U+03D0 ϐ  GREEK BETA SYMBOL
               fc=β  U+3B2 lc=ϐ  U+3D0 tc=Β  U+392 uc=Β  U+392 
 56: U+03F5 ϵ  GREEK LUNATE EPSILON SYMBOL
               fc=ε  U+3B5 lc=ϵ  U+3F5 tc=Ε  U+395 uc=Ε  U+395 
 57: U+1F94 ᾔ  GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
               fc=ἤι  U+1F24.3B9 lc=ᾔ  U+1F94 tc=ᾜ  U+1F9C uc=ἬΙ  U+1F2C.399 
 58: U+1F9C ᾜ  GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
               fc=ἤι  U+1F24.3B9 lc=ᾔ  U+1F94 tc=ᾜ  U+1F9C uc=ἬΙ  U+1F2C.399 
 59: U+1F92 ᾒ  GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
               fc=ἢι  U+1F22.3B9 lc=ᾒ  U+1F92 tc=ᾚ  U+1F9A uc=ἪΙ  U+1F2A.399 
 60: U+1F9A ᾚ  GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
               fc=ἢι  U+1F22.3B9 lc=ᾒ  U+1F92 tc=ᾚ  U+1F9A uc=ἪΙ  U+1F2A.399 
 61: U+1F96 ᾖ  GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
               fc=ἦι  U+1F26.3B9 lc=ᾖ  U+1F96 tc=ᾞ  U+1F9E uc=ἮΙ  U+1F2E.399 
 62: U+1F9E ᾞ  GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
               fc=ἦι  U+1F26.3B9 lc=ᾖ  U+1F96 tc=ᾞ  U+1F9E uc=ἮΙ  U+1F2E.399 
 63: U+1F90 ᾐ  GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
               fc=ἠι  U+1F20.3B9 lc=ᾐ  U+1F90 tc=ᾘ  U+1F98 uc=ἨΙ  U+1F28.399 
 64: U+1F98 ᾘ  GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
               fc=ἠι  U+1F20.3B9 lc=ᾐ  U+1F90 tc=ᾘ  U+1F98 uc=ἨΙ  U+1F28.399 
 65: U+1F95 ᾕ  GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
               fc=ἥι  U+1F25.3B9 lc=ᾕ  U+1F95 tc=ᾝ  U+1F9D uc=ἭΙ  U+1F2D.399 
 66: U+1F9D ᾝ  GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
               fc=ἥι  U+1F25.3B9 lc=ᾕ  U+1F95 tc=ᾝ  U+1F9D uc=ἭΙ  U+1F2D.399 
 67: U+1F93 ᾓ  GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
               fc=ἣι  U+1F23.3B9 lc=ᾓ  U+1F93 tc=ᾛ  U+1F9B uc=ἫΙ  U+1F2B.399 
 68: U+1F9B ᾛ  GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
               fc=ἣι  U+1F23.3B9 lc=ᾓ  U+1F93 tc=ᾛ  U+1F9B uc=ἫΙ  U+1F2B.399 
 69: U+1F97 ᾗ  GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
               fc=ἧι  U+1F27.3B9 lc=ᾗ  U+1F97 tc=ᾟ  U+1F9F uc=ἯΙ  U+1F2F.399 
 70: U+1F9F ᾟ  GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
               fc=ἧι  U+1F27.3B9 lc=ᾗ  U+1F97 tc=ᾟ  U+1F9F uc=ἯΙ  U+1F2F.399 
 71: U+1F91 ᾑ  GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
               fc=ἡι  U+1F21.3B9 lc=ᾑ  U+1F91 tc=ᾙ  U+1F99 uc=ἩΙ  U+1F29.399 
 72: U+1F99 ᾙ  GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
               fc=ἡι  U+1F21.3B9 lc=ᾑ  U+1F91 tc=ᾙ  U+1F99 uc=ἩΙ  U+1F29.399 
 73: U+1FC4 ῄ  GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
               fc=ήι  U+3AE.3B9 lc=ῄ  U+1FC4 tc=Ήͅ  U+389.345 uc=ΉΙ  U+389.399 
 74: U+1FC2 ῂ  GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
               fc=ὴι  U+1F74.3B9 lc=ῂ  U+1FC2 tc=Ὴͅ  U+1FCA.345 uc=ῊΙ  U+1FCA.399 
 75: U+1FC6 ῆ  GREEK SMALL LETTER ETA WITH PERISPOMENI
               fc=ῆ  U+3B7.342 lc=ῆ  U+1FC6 tc=Η͂  U+397.342 uc=Η͂  U+397.342 
 76: U+1FC7 ῇ  GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
               fc=ῆι  U+3B7.342.3B9 lc=ῇ  U+1FC7 tc=ῌ͂  U+397.342.345 uc=Η͂Ι  U+397.342.399 
 77: U+1FC3 ῃ  GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
               fc=ηι  U+3B7.3B9 lc=ῃ  U+1FC3 tc=ῌ  U+1FCC uc=ΗΙ  U+397.399 
 78: U+1FCC ῌ  GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
               fc=ηι  U+3B7.3B9 lc=ῃ  U+1FC3 tc=ῌ  U+1FCC uc=ΗΙ  U+397.399 
 79: U+03D1 ϑ  GREEK THETA SYMBOL
               fc=θ  U+3B8 lc=ϑ  U+3D1 tc=Θ  U+398 uc=Θ  U+398 
 80: U+1FBE ι  GREEK PROSGEGRAMMENI
               fc=ι  U+3B9 lc=ι  U+1FBE tc=Ι  U+399 uc=Ι  U+399 
 81: U+1FD6 ῖ  GREEK SMALL LETTER IOTA WITH PERISPOMENI
               fc=ῖ  U+3B9.342 lc=ῖ  U+1FD6 tc=Ι͂  U+399.342 uc=Ι͂  U+399.342 
 82: U+0390 ΐ  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
               fc=ΐ  U+3B9.308.301 lc=ΐ  U+390 tc=Ϊ́  U+399.308.301 uc=Ϊ́  U+399.308.301 
 83: U+1FD3 ΐ  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
               fc=ΐ  U+3B9.308.301 lc=ΐ  U+1FD3 tc=Ϊ́  U+399.308.301 uc=Ϊ́  U+399.308.301 
 84: U+1FD2 ῒ  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
               fc=ῒ  U+3B9.308.300 lc=ῒ  U+1FD2 tc=Ϊ̀  U+399.308.300 uc=Ϊ̀  U+399.308.300 
 85: U+1FD7 ῗ  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
               fc=ῗ  U+3B9.308.342 lc=ῗ  U+1FD7 tc=Ϊ͂  U+399.308.342 uc=Ϊ͂  U+399.308.342 
 86: U+03F0 ϰ  GREEK KAPPA SYMBOL
               fc=κ  U+3BA lc=ϰ  U+3F0 tc=Κ  U+39A uc=Κ  U+39A 
 87: U+00B5 µ  MICRO SIGN
               fc=μ  U+3BC lc=µ  U+B5 tc=Μ  U+39C uc=Μ  U+39C 
 88: U+03D6 ϖ  GREEK PI SYMBOL
               fc=π  U+3C0 lc=ϖ  U+3D6 tc=Π  U+3A0 uc=Π  U+3A0 
 89: U+03F1 ϱ  GREEK RHO SYMBOL
               fc=ρ  U+3C1 lc=ϱ  U+3F1 tc=Ρ  U+3A1 uc=Ρ  U+3A1 
 90: U+1FE4 ῤ  GREEK SMALL LETTER RHO WITH PSILI
               fc=ῤ  U+3C1.313 lc=ῤ  U+1FE4 tc=Ρ̓  U+3A1.313 uc=Ρ̓  U+3A1.313 
 91: U+03C2 ς  GREEK SMALL LETTER FINAL SIGMA
               fc=σ  U+3C3 lc=ς  U+3C2 tc=Σ  U+3A3 uc=Σ  U+3A3 
 92: U+1F50 ὐ  GREEK SMALL LETTER UPSILON WITH PSILI
               fc=ὐ  U+3C5.313 lc=ὐ  U+1F50 tc=Υ̓  U+3A5.313 uc=Υ̓  U+3A5.313 
 93: U+1F54 ὔ  GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
               fc=ὔ  U+3C5.313.301 lc=ὔ  U+1F54 tc=Υ̓́  U+3A5.313.301 uc=Υ̓́  U+3A5.313.301 
 94: U+1F52 ὒ  GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
               fc=ὒ  U+3C5.313.300 lc=ὒ  U+1F52 tc=Υ̓̀  U+3A5.313.300 uc=Υ̓̀  U+3A5.313.300 
 95: U+1F56 ὖ  GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
               fc=ὖ  U+3C5.313.342 lc=ὖ  U+1F56 tc=Υ̓͂  U+3A5.313.342 uc=Υ̓͂  U+3A5.313.342 
 96: U+1FE6 ῦ  GREEK SMALL LETTER UPSILON WITH PERISPOMENI
               fc=ῦ  U+3C5.342 lc=ῦ  U+1FE6 tc=Υ͂  U+3A5.342 uc=Υ͂  U+3A5.342 
 97: U+03B0 ΰ  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
               fc=ΰ  U+3C5.308.301 lc=ΰ  U+3B0 tc=Ϋ́  U+3A5.308.301 uc=Ϋ́  U+3A5.308.301 
 98: U+1FE3 ΰ  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
               fc=ΰ  U+3C5.308.301 lc=ΰ  U+1FE3 tc=Ϋ́  U+3A5.308.301 uc=Ϋ́  U+3A5.308.301 
 99: U+1FE2 ῢ  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
               fc=ῢ  U+3C5.308.300 lc=ῢ  U+1FE2 tc=Ϋ̀  U+3A5.308.300 uc=Ϋ̀  U+3A5.308.300 
100: U+1FE7 ῧ  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
               fc=ῧ  U+3C5.308.342 lc=ῧ  U+1FE7 tc=Ϋ͂  U+3A5.308.342 uc=Ϋ͂  U+3A5.308.342 
101: U+03D5 ϕ  GREEK PHI SYMBOL
               fc=φ  U+3C6 lc=ϕ  U+3D5 tc=Φ  U+3A6 uc=Φ  U+3A6 
102: U+1FA4 ᾤ  GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
               fc=ὤι  U+1F64.3B9 lc=ᾤ  U+1FA4 tc=ᾬ  U+1FAC uc=ὬΙ  U+1F6C.399 
103: U+1FAC ᾬ  GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
               fc=ὤι  U+1F64.3B9 lc=ᾤ  U+1FA4 tc=ᾬ  U+1FAC uc=ὬΙ  U+1F6C.399 
104: U+1FA2 ᾢ  GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
               fc=ὢι  U+1F62.3B9 lc=ᾢ  U+1FA2 tc=ᾪ  U+1FAA uc=ὪΙ  U+1F6A.399 
105: U+1FAA ᾪ  GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
               fc=ὢι  U+1F62.3B9 lc=ᾢ  U+1FA2 tc=ᾪ  U+1FAA uc=ὪΙ  U+1F6A.399 
106: U+1FA6 ᾦ  GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
               fc=ὦι  U+1F66.3B9 lc=ᾦ  U+1FA6 tc=ᾮ  U+1FAE uc=ὮΙ  U+1F6E.399 
107: U+1FAE ᾮ  GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
               fc=ὦι  U+1F66.3B9 lc=ᾦ  U+1FA6 tc=ᾮ  U+1FAE uc=ὮΙ  U+1F6E.399 
108: U+1FA0 ᾠ  GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
               fc=ὠι  U+1F60.3B9 lc=ᾠ  U+1FA0 tc=ᾨ  U+1FA8 uc=ὨΙ  U+1F68.399 
109: U+1FA8 ᾨ  GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
               fc=ὠι  U+1F60.3B9 lc=ᾠ  U+1FA0 tc=ᾨ  U+1FA8 uc=ὨΙ  U+1F68.399 
110: U+1FA5 ᾥ  GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
               fc=ὥι  U+1F65.3B9 lc=ᾥ  U+1FA5 tc=ᾭ  U+1FAD uc=ὭΙ  U+1F6D.399 
111: U+1FAD ᾭ  GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
               fc=ὥι  U+1F65.3B9 lc=ᾥ  U+1FA5 tc=ᾭ  U+1FAD uc=ὭΙ  U+1F6D.399 
112: U+1FA3 ᾣ  GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
               fc=ὣι  U+1F63.3B9 lc=ᾣ  U+1FA3 tc=ᾫ  U+1FAB uc=ὫΙ  U+1F6B.399 
113: U+1FAB ᾫ  GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
               fc=ὣι  U+1F63.3B9 lc=ᾣ  U+1FA3 tc=ᾫ  U+1FAB uc=ὫΙ  U+1F6B.399 
114: U+1FA7 ᾧ  GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
               fc=ὧι  U+1F67.3B9 lc=ᾧ  U+1FA7 tc=ᾯ  U+1FAF uc=ὯΙ  U+1F6F.399 
115: U+1FAF ᾯ  GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
               fc=ὧι  U+1F67.3B9 lc=ᾧ  U+1FA7 tc=ᾯ  U+1FAF uc=ὯΙ  U+1F6F.399 
116: U+1FA1 ᾡ  GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
               fc=ὡι  U+1F61.3B9 lc=ᾡ  U+1FA1 tc=ᾩ  U+1FA9 uc=ὩΙ  U+1F69.399 
117: U+1FA9 ᾩ  GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
               fc=ὡι  U+1F61.3B9 lc=ᾡ  U+1FA1 tc=ᾩ  U+1FA9 uc=ὩΙ  U+1F69.399 
118: U+1FF4 ῴ  GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
               fc=ώι  U+3CE.3B9 lc=ῴ  U+1FF4 tc=Ώͅ  U+38F.345 uc=ΏΙ  U+38F.399 
119: U+1FF2 ῲ  GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
               fc=ὼι  U+1F7C.3B9 lc=ῲ  U+1FF2 tc=Ὼͅ  U+1FFA.345 uc=ῺΙ  U+1FFA.399 
120: U+1FF6 ῶ  GREEK SMALL LETTER OMEGA WITH PERISPOMENI
               fc=ῶ  U+3C9.342 lc=ῶ  U+1FF6 tc=Ω͂  U+3A9.342 uc=Ω͂  U+3A9.342 
121: U+1FF7 ῷ  GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
               fc=ῶι  U+3C9.342.3B9 lc=ῷ  U+1FF7 tc=ῼ͂  U+3A9.342.345 uc=Ω͂Ι  U+3A9.342.399 
122: U+1FF3 ῳ  GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
               fc=ωι  U+3C9.3B9 lc=ῳ  U+1FF3 tc=ῼ  U+1FFC uc=ΩΙ  U+3A9.399 
123: U+1FFC ῼ  GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
               fc=ωι  U+3C9.3B9 lc=ῳ  U+1FF3 tc=ῼ  U+1FFC uc=ΩΙ  U+3A9.399 
124: U+0587 և  ARMENIAN SMALL LIGATURE ECH YIWN
               fc=եւ  U+565.582 lc=և  U+587 tc=Եւ  U+535.582 uc=ԵՒ  U+535.552 
125: U+FB14 ﬔ  ARMENIAN SMALL LIGATURE MEN ECH
               fc=մե  U+574.565 lc=ﬔ  U+FB14 tc=Մե  U+544.565 uc=ՄԵ  U+544.535 
126: U+FB15 ﬕ  ARMENIAN SMALL LIGATURE MEN INI
               fc=մի  U+574.56B lc=ﬕ  U+FB15 tc=Մի  U+544.56B uc=ՄԻ  U+544.53B 
127: U+FB17 ﬗ  ARMENIAN SMALL LIGATURE MEN XEH
               fc=մխ  U+574.56D lc=ﬗ  U+FB17 tc=Մխ  U+544.56D uc=ՄԽ  U+544.53D 
128: U+FB13 ﬓ  ARMENIAN SMALL LIGATURE MEN NOW
               fc=մն  U+574.576 lc=ﬓ  U+FB13 tc=Մն  U+544.576 uc=ՄՆ  U+544.546 
129: U+FB16 ﬖ  ARMENIAN SMALL LIGATURE VEW NOW
               fc=վն  U+57E.576 lc=ﬖ  U+FB16 tc=Վն  U+54E.576 uc=ՎՆ  U+54E.546
msg143085 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-08-27 19:29
There are some oddities in Unicode case-folding.

Under full case-folding, both "\N{LATIN CAPITAL LETTER SHARP S}" and "\N{LATIN SMALL LETTER SHARP S}" fold to "ss", which means that those codepoints match each other.

However, under simple case-folding, they fold to themselves, which means that those codepoints _don't_ match each other.
msg143086 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-27 20:04
> Neither am I.  Even in "old-style" English with ae and oe, one wrote
> ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
> *Aesir.  Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.

Trying to disprove you a bit:
http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg
http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg
http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg

but classical typographies seem to write either the uppercase Œ or the lowercase œ.

That said, I wonder why Unicode even includes ligatures like ff. Sounds like mission creep to me (and horrible annoyances for people like us).
msg143089 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-28 05:54
FTR, with the latest Python 3.2/3.3 (narrow) I get:
   Total failures:   58 / 500 ( 12%)
   Total successes: 442 / 500 ( 88%)
and with the latest Python 3.2/3.3 (wide) I get:
   Total failures:   52 / 500 ( 10%)
   Total successes: 448 / 500 ( 90%)
msg143110 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-28 17:27
Thanks Tom for such a clear explanation! I hope someone will implement
this. (Matthew, does this affect regex? I am guessing it does, for
case-insensitive matching?)
msg143119 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-08-28 18:56
The regex module currently uses simple case-folding, although I'm working towards full case-folding, as listed in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt.
msg143124 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-28 21:01
Antoine Pitrou <report@bugs.python.org> wrote on Sat, 27 Aug 2011 20:04:56 -0000: 

>> Neither am I.  Even in "old-style" English with ae and oe, one wrote
>> ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
>> *Aesir.  Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology.

> Trying to disprove you a bit:
> http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg
> http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg
> http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg

> but classical typographies seem to write either the uppercase Πor the
> lowercase œ.

That's what I meant: one only ever sees œufs or ŒUFS, never OEUFS.
French doesn't fit into ISO 8859-1.  That's one of the changes to
ISO-8859-15 compared with ISO-8859-1 (and Unicode):

    iso-8859-1   A4  ⇔  U+00A4  < ¤ >  \N{CURRENCY SIGN}
    iso-8859-15  A4  ⇒  U+20AC  < € >  \N{EURO SIGN}

    iso-8859-1   A6  ⇔  U+00A6  < ¦ >  \N{BROKEN BAR}
    iso-8859-15  A6  ⇒  U+0160  < Š >  \N{LATIN CAPITAL LETTER S WITH CARON}

    iso-8859-1   A8  ⇔  U+00A8  < ¨ >  \N{DIAERESIS}
    iso-8859-15  A8  ⇒  U+0161  < š >  \N{LATIN SMALL LETTER S WITH CARON}

    iso-8859-1   B4  ⇔  U+00B4  < ´ >  \N{ACUTE ACCENT}
    iso-8859-15  B4  ⇒  U+017D  < Ž >  \N{LATIN CAPITAL LETTER Z WITH CARON}

    iso-8859-1   B8  ⇔  U+00B8  < ¸ >  \N{CEDILLA}
    iso-8859-15  B8  ⇒  U+017E  < ž >  \N{LATIN SMALL LETTER Z WITH CARON}

    iso-8859-1   BC  ⇔  U+00BC  < ¼ >  \N{VULGAR FRACTION ONE QUARTER}
    iso-8859-15  BC  ⇒  U+0152  < Œ >  \N{LATIN CAPITAL LIGATURE OE}

    iso-8859-1   BD  ⇔  U+00BD  < ½ >  \N{VULGAR FRACTION ONE HALF}
    iso-8859-15  BD  ⇒  U+0153  < œ >  \N{LATIN SMALL LIGATURE OE}

    iso-8859-1   BE  ⇔  U+00BE  < ¾ >  \N{VULGAR FRACTION THREE QUARTERS}
    iso-8859-15  BE  ⇒  U+0178  < Ÿ >  \N{LATIN CAPITAL LETTER Y WITH DIAERESIS}

> That said, I wonder why Unicode even includes ligatures like ff. Sounds
> like mission creep to me (and horrible annoyances for people like us).

I'm pretty sure that typographic ligatures are there for roundtripping
with legacy encodings.  I believe that œ/Œ is the only code point
with ligature in its name that you're "supposed" to still use, and
that all others should be figured out by modern fonting software.

--tom
msg143145 - (view) Author: Jean-Michel Fauth (Jean-Michel.Fauth) Date: 2011-08-29 13:13
Œ, œ or even & are historically ligatures or "ligatured forms".
In the French typography, they are "single plain letters" and
they belong the group of the 42 letters used in the French
typography.
Typographically speaking, using "oe" instead of "œ" is considered
as a mistake, while not using the ligatured forms for the groups
of letters like ff, ffi, ffl, fj, et, st is acceptable.

Microsoft with cp1252, Apple with mac-roman, Adobe and all
foundries and now "Unicode" are working correctly.

It should be noted, when "TeX" moved from the ascii to iso-8859-1
(more precisely "CorkEncoding") as default encoding, "they" saw
the problem and introduced the \oe or \OE commands.

From my understanding and my point of view on the subject, ISO has
somehow recognized his mistake by introducing iso-8859-15.
Infortunatelly, it was too late.

To the subject: Œdipe: correct, Oedipe, OEdipe: incorrect.

Without beeing an expert on that field, all the informations
one can find on Wikipedia (French) regarding questions about
typography are generally correct.
msg143146 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-29 13:21
> Œ, œ or even & are historically ligatures or "ligatured forms".
> In the French typography, they are "single plain letters" and
> they belong the group of the 42 letters used in the French
> typography.
> Typographically speaking, using "oe" instead of "œ" is considered
> as a mistake,

It's not only "typographically speaking", it's really a spelling error,
even in hand-written text :-)
msg143148 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-29 14:16
Antoine Pitrou <report@bugs.python.org> wrote
   on Mon, 29 Aug 2011 13:21:06 -0000: 

> It's not only "typographically speaking", it's really a spelling error,
> even in hand-written text :-)

Sure, and so too is omitting an accent mark or diaeresis.  But—alas!—you’ll
never convince most monoglot anglophones of that, the ones who keep wanting to
strip them from résumé, façade, châteaux, crème brûlée, fête, tête-à-tête, 
à la française, or naïveté, not to mention José, jalapeño, the erstwhile
American Secretary of State Federico Peña, or nearby Cañon City, Colorado, 
where I have family.  I think œnonlogy has survived solely on its rarity, 
and the Encyclopædia Britannica is that way because the ligat(ur)ed letter
is in their actual trademark.

Cell phone users sending text messages have long suffered the grievous
injuries to their language(s) that naked ASCII imparts, but this is
nothing like the crossdressing nightmare called Greeklish, also variously
known as Grenglish, Latinoellinika/Λατινοελληνικά, or ASCII Greek.

    http://en.wikipedia.org/wiki/Greeklish

    [...] The reason for this is the fact that text written in Greeklish
    is considerably less aesthetically pleasing, and also much harder to
    read, compared to text written in the Greek alphabet. A non-Greek
    speaker/reader can guess this by this example: "δις ιζ χαρντ του
    ριντ" would be the way to write "this is hard to read" in English
    but utilizing the Greek alphabet.

I especially enjoy  George Baloglou’s "Byzantine" Grenglish, wherein:

    Ὀδυσσεύς    => Oducceus    instead of Odysseus
    Ἀχιλλεύς    => Axilleus    instead of Achilleus
    Σίσυφος     => Sicuphos    instead of Sisyphus
    Περικλῆς    => 5epiklhs    instead of Pericles
    Χθονός      => X8onos      instead of Chthonos
 Οι Ατρείδες    => Oi Atpeides instead of the Atreïdes

Terrible though the depredations upon the French language that may
have been committed by ASCII, surely these go even further. :)

--tom

        Η Ιλιάδα                                        H Iliada

Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλῆος           Mhnin aeide, 8ea, 5hlhiadeo Axilhos
οὐλομένην, ἣ μυρί’ Ἀχαιοῖς ἄλγε’ ἔθηκε,       oulomenhn, 'h mupi’ Axaiois alge’ e8hke,
πολλὰς δ’ ἰφθίμους ψυχὰς Ἄϊδι προῒαψεν        nollas d’ iph8imous yuxas Aidi npoiayen
ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν        'hpoon, autous de elopia teuxe kuneccin
οἰωνοῖσί τε πᾶσι· Διὸς δ’ ἐτελείετο βουλή·    oionoici te naci· Dios d’ eteleieto boulh·
ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε          eks o'u dh ta npota diacththn epicante
Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.    Atpeidhs te anaks andpon kai dios Axilleus.
msg150844 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-08 03:54
Here is a patch. I only dealt with case mappings and not titlecase. Doing titlecase properly requires word segmentation, which I think should be another patch/issue. This patch fixes swapcase(), capitalize(), upper(), and lower(). It does not include the changes to Objects/unicodetype_db.h because those are huge. Regenerate the database if you want to test it. Please review.
msg150998 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-10 03:49
New patch. I implemented it the way Antoine desired. It seems rather inefficient to be copying around so much data...
msg151016 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-10 14:03
__ap__'s implementation method is about 2x faster than mine.
msg151088 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-11 20:20
New patch with title casing mappings added.
msg151098 - (view) Author: Roundup Robot (python-dev) Date: 2012-01-11 23:17
New changeset f7e05d205a52 by Benjamin Peterson in branch 'default':
use full unicode mappings for upper/lower/title case (#12736)
http://hg.python.org/cpython/rev/f7e05d205a52
msg151141 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-12 17:17
The currently applied patch ( http://hg.python.org/cpython/rev/f7e05d205a52 ) left some dead code in unicodeobject.c

function fixup ( http://hg.python.org/cpython/file/f7e05d205a52/Objects/unicodeobject.c#l9386 ) has a shortcut for when the fixer doesn't make any actual changes.  The removed fixers (like fixupper ) returned 0 rather than maxchar to indicate that.  The only remaining fixer, fix_decimal_and_space_to_ascii (line 8839), does not.  (I think fix_decimal_and_space_to_ascii *should* add a touched flag, but until it does, the shortcut dedup code is dead.)

Also, around line 10502, there is an #if 0 section with code that relied on one of the removed fixers; is it time to remove that section?
msg151311 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-16 00:24
Why was the delta-processing removed from the casing functions?

As best I can tell, the whole point of going through multiple levels of indirection (courtesy splitbins) is to maximize compression and minimize the amount of cache that unicode might occupy.

By using deltas, only one record is needed for each combination of (upper - lower, upper - title), which is generally only one or two combinations per script.  

Without deltas, nearly every cased letter needs its own record, and the index tables also get bigger. (It seems to be about 2.6 times as large, but cache effects may be worse, since letters from the same script will no longer be in the same record or the same index chain.)

If it is a concern about not enough room for flags, then the decimal/digit chars could be combined.  They are always the same, unless the number isn't decimal (in which case the flag is enough).
msg151314 - (view) Author: Roundup Robot (python-dev) Date: 2012-01-16 02:19
New changeset 03ea95e3b497 by Benjamin Peterson in branch 'default':
delta encoding of upper/lower/title makes a glorious return (#12736)
http://hg.python.org/cpython/rev/03ea95e3b497
msg261517 - (view) Author: Андрей Баксаляр (Андрей Баксаляр) Date: 2016-03-10 17:37
A same problem with the unicode case mapping is still present in the Python 3.4.3. You can repeat the bug with this code, for instance:

'ΰ'.upper().lower() == 'ΰ'

The case swapping is strangelly leads to character replacement:

b'\xce\xb0' → b'\xcf\x85\xcc\x88\xcc\x81'
msg261522 - (view) Author: Андрей Баксаляр (Андрей Баксаляр) Date: 2016-03-10 20:21
Interestingly, the bug is still reproducible in version 3.5.1, but fixed in 2.7.6.
msg261547 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2016-03-11 07:39
The full case mappings do not preserve normalization form.

>>> for c in 'ΰ'.upper().lower(): print(unicodedata.name(c))
... 
GREEK SMALL LETTER UPSILON
COMBINING DIAERESIS
COMBINING ACUTE ACCENT
>>> unicodedata.normalize('NFC', 'ΰ'.upper().lower()) == 'ΰ'
True
History
Date User Action Args
2016-03-11 07:39:49benjamin.petersonsetmessages: + msg261547
2016-03-10 20:44:03gvanrossumsetnosy: - gvanrossum
2016-03-10 20:42:37SilentGhostsetversions: + Python 3.4, Python 3.5, Python 3.6, - Python 2.7
2016-03-10 20:21:51Андрей Баксалярsetfiles: + pythonbug.png

messages: + msg261522
versions: + Python 2.7, - Python 3.4
2016-03-10 17:37:31Андрей Баксалярsetnosy: + Андрей Баксаляр

messages: + msg261517
versions: + Python 3.4, - Python 3.3
2013-06-23 23:56:10belopolskylinkissue4610 superseder
2012-01-16 02:19:31python-devsetmessages: + msg151314
2012-01-16 00:24:46Jim.Jewettsetmessages: + msg151311
2012-01-12 17:17:24Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg151141
2012-01-11 23:23:51benjamin.petersonsetstatus: open -> closed
resolution: fixed
2012-01-11 23:17:46python-devsetnosy: + python-dev
messages: + msg151098
2012-01-11 20:20:09benjamin.petersonsetfiles: + full-casemapping.patch

messages: + msg151088
2012-01-11 03:38:21benjamin.petersonsetfiles: + full-casemapping.patch
2012-01-10 14:03:39benjamin.petersonsetmessages: + msg151016
2012-01-10 03:49:31benjamin.petersonsetfiles: + full-casemapping.patch

messages: + msg150998
2012-01-08 03:54:29benjamin.petersonsetfiles: + full-casemapping.patch

nosy: + benjamin.peterson
messages: + msg150844

keywords: + patch
2011-08-29 14:16:04tchristsetmessages: + msg143148
2011-08-29 13:21:06pitrousetmessages: + msg143146
2011-08-29 13:13:57Jean-Michel.Fauthsetnosy: + Jean-Michel.Fauth
messages: + msg143145
2011-08-28 21:01:49tchristsetmessages: + msg143124
2011-08-28 18:56:35mrabarnettsetmessages: + msg143119
2011-08-28 17:27:28gvanrossumsetmessages: + msg143110
2011-08-28 05:54:35ezio.melottisetfiles: + casing-results.txt

messages: + msg143089
2011-08-27 20:04:56pitrousetnosy: + pitrou
messages: + msg143086
2011-08-27 19:29:28mrabarnettsetmessages: + msg143085
2011-08-27 19:17:30tchristsetmessages: + msg143084
2011-08-27 16:15:33gvanrossumsetmessages: + msg143083
2011-08-27 14:48:38tchristsetmessages: + msg143072
2011-08-26 23:55:58tchristsetfiles: + casing-tests.py

messages: + msg143052
2011-08-26 23:36:17tchristsetmessages: + msg143051
2011-08-26 21:11:23gvanrossumsetnosy: + gvanrossum
messages: + msg143036
2011-08-13 00:58:12mrabarnettsetnosy: + mrabarnett
2011-08-12 18:05:57Arfreversetnosy: + Arfrever
2011-08-12 17:30:15eric.araujosetcomponents: + Interpreter Core, Unicode, - Library (Lib)
versions: + Python 3.3, - Python 3.2
2011-08-12 00:17:23ezio.melottisetnosy: + belopolsky, ezio.melotti
2011-08-11 21:39:44tchristcreate