Message 143051 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, belopolsky, ezio.melotti, gvanrossum, mrabarnett, tchrist
Date	2011-08-26.23:36:16
SpamBayes Score	1.110223e-16
Marked as misclassified	No
Message-id	<19158.1314401763@chthon>
In-reply-to	<1314393084.09.0.900456558475.issue12736@psf.upfronthosting.co.za>

Content
Guido van Rossum <report@bugs.python.org> wrote on Fri, 26 Aug 2011 21:11:24 -0000: > Guido van Rossum <guido@python.org> added the comment: > I presume this applies to builtin str methods like .lower(), right? I > think it is a good thing to do for Python 3.3. Yes, the full casemaps are for upper, title, and lowercase. There is also a full casefold and turkic case fold (which is full), but you don't have a casefold function so I guess that doesn't matter. > We'd need to define what should happen in edge cases, e.g. when > (against all odds) a string happens to contain a lone surrogate or > some other code point or sequence of code points that the Unicode > standard considers illegal. I think it should not fail but just leave > those code points alone. Well, it's a funny thing. There are properties given for all Unicode code points, even noncharacter code points. This includes the casing properties, oddly enough. From UnicodeData.txt, which has a few surrogate entries; notice no casing is given: D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;; DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;; DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;; DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;; DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;; DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;; And in SpecialCasing.txt, which does not have surrogates but does have a default clause: # This file is a supplement to the UnicodeData file. # It contains additional information about the casing of Unicode characters. # (For compatibility, the UnicodeData.txt file only contains case mappings for # characters where they are 1-1, and independent of context and language. # For more information, see the discussion of Case Mappings in the Unicode Standard. # # All code points not listed in this file that do not have a simple case mappings # in UnicodeData.txt map to themselves. And in CaseFolding.txt, which also does not have surrogates but again does have a default clause: # The data supports both implementations that require simple case foldings # (where string lengths don't change), and implementations that allow full case folding # (where string lengths may grow). Note that where they can be supported, the # full case foldings are superior: for example, they allow "MASSE" and "Maße" to match. # # All code points not listed in this file map to themselves. Taken all together, it follows that the surrogates have case{map,fold}s back to themselves, since they have no case{map,fold}s listed. It's ok to have arbitrary code points in memory, including surrogates and the 66 noncharacters. It just isn't legal to have them in a UTF stream for "open interchange", whatever that means. > Does this require us to import more data files from the Unicode > standard? By itself that doesn't scare me. One way or the other, yes, notably the SpecialCasing file for casemapping and the CaseFolding file for casefolding (which you should do anyway to fix re.I). But you can and should process the new files into some tighter format optimized for your own lookups. Oddly, Java doesn't provide for String methods that do full casing on titlecase, even those they do do so on lowercase and uppercase. On titlecase they only expose the simple casemaps via the Character class, which are the ones from UnicodeData. They recognize that this is flaw, but it was too late to fix it for JAva 7. > Would this also affect .islower() and friends? Well, it shouldn't, but .islower() and friends are already mistaken. They seem to be checking for GC=Ll and such, but they need to be checking the Unicode binary property Lowercase and such. Watch: test 37 for string Ⅷ wanted <ⅷ> to be lowercase of <Ⅷ> but python disagrees wanted <Ⅷ> to be titlecase of <Ⅷ> but python disagrees wanted <Ⅷ> to be uppercase of <Ⅷ> but python disagrees test 37 failed 3 subtests test 39 for string Ⓚ wanted <ⓚ> to be lowercase of <Ⓚ> but python disagrees wanted <Ⓚ> to be titlecase of <Ⓚ> but python disagrees wanted <Ⓚ> to be uppercase of <Ⓚ> but python disagrees test 39 failed 3 subtests That's because the Roman numerals are GC=Nl but still have case and change case. Similarly for the circled letters which are GC=So but have case and change case. Plus there's U+0345, the iota subscript, which is GC=Mn but has case and changes case. I don't remember whether I've sent in my full test suite or not. If I haven't yet, I should attach it to the bug report. --tom

Guido van Rossum <report@bugs.python.org> wrote
   on Fri, 26 Aug 2011 21:11:24 -0000: 

> Guido van Rossum <guido@python.org> added the comment:

> I presume this applies to builtin str methods like .lower(), right?  I
> think it is a good thing to do for Python 3.3.

Yes, the full casemaps are for upper, title, and lowercase.  There is 
also a full casefold and turkic case fold (which is full), but you
don't have a casefold function so I guess that doesn't matter.

> We'd need to define what should happen in edge cases, e.g. when
> (against all odds) a string happens to contain a lone surrogate or
> some other code point or sequence of code points that the Unicode
> standard considers illegal.  I think it should not fail but just leave
> those code points alone.

Well, it's a funny thing.  There are properties given for all
Unicode code points, even noncharacter code points.  This
includes the casing properties, oddly enough.

From UnicodeData.txt, which has a few surrogate entries; notice
no casing is given:

    D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
    DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
    DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
    DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
    DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
    DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;

And in SpecialCasing.txt, which does not have surrogates but does have
a default clause:

    # This file is a supplement to the UnicodeData file.
    # It contains additional information about the casing of Unicode characters.
    # (For compatibility, the UnicodeData.txt file only contains case mappings for
    # characters where they are 1-1, and independent of context and language.
    # For more information, see the discussion of Case Mappings in the Unicode Standard.
    #
    # All code points not listed in this file that do not have a simple case mappings
    # in UnicodeData.txt map to themselves.

And in CaseFolding.txt, which also does not have surrogates but again does 
have a default clause:

    # The data supports both implementations that require simple case foldings
    # (where string lengths don't change), and implementations that allow full case folding
    # (where string lengths may grow). Note that where they can be supported, the
    # full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.
    #
    # All code points not listed in this file map to themselves.

Taken all together, it follows that the surrogates have case{map,fold}s
back to themselves, since they have no case{map,fold}s listed.

It's ok to have arbitrary code points in memory, including surrogates and
the 66 noncharacters.  It just isn't legal to have them in a UTF stream
for "open interchange", whatever that means.  

> Does this require us to import more data files from the Unicode
> standard?  By itself that doesn't scare me.

One way or the other, yes, notably the SpecialCasing file for
casemapping and the CaseFolding file for casefolding (which you
should do anyway to fix re.I).  But you can and should process the
new files into some tighter format optimized for your own lookups.

Oddly, Java doesn't provide for String methods that do full casing on
titlecase, even those they do do so on lowercase and uppercase.  On
titlecase they only expose the simple casemaps via the Character class,
which are the ones from UnicodeData.  They recognize that this is flaw, 
but it was too late to fix it for JAva 7.

> Would this also affect .islower() and friends?

Well, it shouldn't, but .islower() and friends are already mistaken.
They seem to be checking for GC=Ll and such, but they need to be
checking the Unicode binary property Lowercase and such.  Watch:

    test 37 for string Ⅷ
    wanted <ⅷ> to be lowercase of <Ⅷ> but python disagrees
    wanted <Ⅷ> to be titlecase of <Ⅷ> but python disagrees
    wanted <Ⅷ> to be uppercase of <Ⅷ> but python disagrees
    test 37 failed 3 subtests

    test 39 for string Ⓚ
    wanted <ⓚ> to be lowercase of <Ⓚ> but python disagrees
    wanted <Ⓚ> to be titlecase of <Ⓚ> but python disagrees
    wanted <Ⓚ> to be uppercase of <Ⓚ> but python disagrees
    test 39 failed 3 subtests

That's because the Roman numerals are GC=Nl but still have
case and change case.  Similarly for the circled letters which
are GC=So but have case and change case.  Plus there's U+0345,
the iota subscript, which is GC=Mn but has case and changes case.

I don't remember whether I've sent in my full test suite or not.  
If I haven't yet, I should attach it to the bug report.

--tom

History
Date	User	Action	Args
2011-08-26 23:36:22	tchrist	set	recipients: + tchrist, gvanrossum, belopolsky, ezio.melotti, mrabarnett, Arfrever
2011-08-26 23:36:17	tchrist	link	issue12736 messages
2011-08-26 23:36:17	tchrist	create