classification
Title: Python's casemapping functions are incorrect for non-BMP chars due to narrow/wide build issues
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Make the str.is* methods work with non-BMP chars on narrow builds
View: 9200
Assigned To: Nosy List: Arfrever, ezio.melotti, lemburg, loewis, mrabarnett, tchrist, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2011-08-11 19:10 by tchrist, last changed 2011-08-16 21:45 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
casemaps.python tchrist, 2011-08-11 19:10 demo of python casemapping functions being unreliable due to wide/narrow issues
casemaps.py terry.reedy, 2011-08-12 22:09 Revision of casemaps that runs on more machines
Messages (11)
msg141918 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-11 19:10
You cannot use Python's casemapping functions on Unicode data because they fail on narrow builds.  This makes it impossible to write portable code in Python that can cope with full Unicode.

I've tried several times to submit this bug, but the file selection widget blows up. I believe it was an Opera bug because I had a write lock on the file.  One more time.
msg141991 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-12 22:07
I agree that better masking of narrow-wide build difference would be good as long as it does not severely impact normal performance. Revision of the test file (see below) shows that the 'bug' is that the .upper, .lower, and .title methods leaves the tested non-BMP chars unchanged on narrow builds.

I am not sure if this is true of all upper-plane chars and whether this is by design or simply a matter of not catching up to an ever-expanding database. Hence, I am also not sure whether this is a bug report or feature request.

I made several changes in casemap.python so I could run it and get better information:
* Rename to casemap.py. Many of us use software that recognizes and special-cases the standard .py extension. All python code files uploaded should use this.
* Remove the unused 3-rd party regex import which stops the test for most people.
* Remove the unnecessary PYTHONIOENCODING exit which stop the test on Windows and possibly elsewhere. The file seems to run fine without it.
* Rewrite the test data using \Uxxxxxxxx (8 hex chars) escapes for the non-BMP chars. That will be required for new tests for test_unicode.py. (I believe the test suite avoid literal non-ascii chars unless really necessary.) Besides which, all I see (on Windowsj) in Firefox is things like
"𐐼𐐯𐑅𐐨𐑉𐐯𐐻". IDLE just has empty boxes.
* Factor the tests so the output is easier to rewrite.
* Rewrite the test output to make comparisons easier. Writing the 'wrong' answer first, directly under the original, made it easy to see that the 'wrong' answer *is* the original, unchanged.

The revised version (to be uploaded separately) has the same 6 failures.
msg142115 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-15 10:59
This is actually a duplicated of #9200.

@Terry

> Besides which, all I see (on Windowsj) in Firefox is things like
> "𐐼𐐯𐑅𐐨𐑉𐐯𐐻".

Encoding problem.  Firefox thinks this is some iso-8859-*.  You can fix this selecting 'Unicode (UTF-8)' from "View -> Character Encoding".

> IDLE just has empty boxes.

This is most likely because it doesn't use a font able to display those chars.
msg142135 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-15 17:47
My Firefox is already set at utf-8. More likely a font limitation. I will look again after installing one of the fonts Tom suggested.

The pair of boxes on IDLE are for the surrogate pairs. Perhaps tk does not even try to display a single char. I will experiment more when I have a more complete font.
msg142137 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-15 17:51
> My Firefox is already set at utf-8.

Every page can specify the encoding it uses (in HTTP headers, <meta> tag and/or xml prologue).  If none of these are specified, afaik Firefox tries to detect the encoding, and sometimes fails.  What encoding does it show for you in the menu when you open the patch?
msg142138 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-15 17:55
>Terry J. Reedy <tjreedy@udel.edu> added the comment:

> My Firefox is already set at utf-8. More likely a font limitation. I
> will look again after installing one of the fonts Tom suggested.

Symbola is best for exotic glyphs, especially astral ones.

Alfios just looks nice as a normal default roman.

--tom
msg142139 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-15 18:02
You are right, FF switched on me without notice. Bad FF.
Thank you! What I now see makes much more sense.
[ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐇𐐝𐐀𐐡𐐇𐐓"  ],
and I now know to check on other pages (although Tom's Unicode talk slides still have boxes even in utf-8, so that must be a font lack).
msg142140 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-15 18:40
>Terry J. Reedy <tjreedy@udel.edu> added the comment:

> You are right, FF switched on me without notice. Bad FF. Thank you! What
> I now see makes much more sense.

>    [ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐇𐐝𐐀𐐡𐐇𐐓"  ],

> and I now know to check on other pages (although Tom's Unicode talk
> slides still have boxes even in utf-8, so that must be a font lack).

Do you have Symbola installed?  Here's Appendix I on Fonts for things that
should look right for the presentation to look right.  

    * I recommend two free fonts from George Douros at users.teilar.gr/~g1951d/ known to
      work with this presentation: his Alfios font for regular text, and his Symbola font
      for fancy emoji. If any of these don’t look right to you, you probably need to
      supplement your system fonts:

            Ligatures: fi ffi ff ffl fl β ẞ ſt st
            Math letters: 𝒜 𝒟 𝔅 𝔎 𝔼 𝔽
            Gothic & Deseret: 𐌸𐌼𐌽𐍂, 𐐔𐐯𐑅𐐨𐑉𐐯𐐻
            Symbols: ✔ ✅ 🐪 📖 🛂 🐍
            Emoticons: 😇 😈 😉 😨 😭 😱
            Upside‐down: ¡pɐəɥ ɹnoʎ uo ƃuᴉpuɐʇs ʎq sᴉɥʇ pɐəᴚ
            Combining characters: ◌̂,◌̃,◌⃞,◌̲,◌︀,◌̵,◌̷

    * The last line with combining characters is especially hard to get to look right. 
      You may find that the shareware font Everson Mono works when all else fails.

You do need Unicode 5.1 support for the LATIN CAPITAL LETTER SHARP S, and
you need Unicode 6.0 support for most of the emoji (I think Snow Leopard
has colorized versions of these.  The Ligature line above looks good in Alfios.

It  turns out it may not always the font used with combining chars as it is whether and
well your browser supports true combining characters dynamically generated, or whether it
runs stuff through NFC and looks for substitution glyphs.  I am not a GUI person, so am
mostly just guessing.

But this I find interesting:  If you look at slide 33 of my first talk or slide 5 of my
second talk, which are duplicates entitled Canonical Conundra, the second column which is
labelled Glyphs explicitly uses Time New Roman because of this issue.  Even so you can
tell it is doing the NFC trick, because lines 1+2 have the same NFC of \x{F5} or õ, as do
3+4+5 with \x{22D} with ȭ, and and 6+7 with ō̃.

The glyphs from the first group are both identical, and so are all three those of the
second group, as both the first two groups have a single precomposed character available
for their NFC.  In contrast, there is no single precomposed glyph available for 6+7, and
you can tell that it's stacking it on the fly using slightly less tight grouping rules
than the font has in the precomposed versions above it.

I use Safari, but I am told Firefox looks ok, too.  Opera is my normal browser but it
does the copout I just described on combining chars without ever being able to
dynamically stack them if the copout fail, so I can't use it for this presentation.

--tom

  $ uniprops -a 'LATIN CAPITAL LETTER SHARP S' 'DESERET CAPITAL LETTER DEE' 'GOTHIC LETTER MANNA' 'SNAKE' 'FACE SCREAMING IN FEAR'

    U+1E9E <ẞ> \N{LATIN CAPITAL LETTER SHARP S}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
        All Any Alnum Alpha Alphabetic Assigned InLatinExtendedAdditional Cased Cased_Letter LC Changes_When_Casefolded CWCF
           Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base
           Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_Extended_Additional Uppercase_Letter Print Upper
           Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper
           X_POSIX_Word
        Age=5.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Latin_Extended_Additional Canonical_Combining_Class=0
           Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None
           East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
           Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining
           JT=U Joining_Type=U Script=Latin Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN
           NV=NaN Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Latn Script=Latn Sentence_Break=UP
           Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin

    U+10414 <𐐔> \N{DESERET CAPITAL LETTER DEE}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
        All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC Changes_When_Casefolded CWCF
           Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base
           Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word
           XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
        Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret Canonical_Combining_Class=0
           Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None
           Script=Deseret East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
           Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup
           Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
           Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
           Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Dsrt Script=Dsrt
           Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin

    U+1033C <𐌼> \N{GOTHIC LETTER MANNA}
        \w \pL \p{L_} \p{Lo}
        All Any Alnum Alpha Alphabetic Assigned InGothic Gothic Is_Gothic L Lo Goth Gr_Base Grapheme_Base Graph GrBase
           ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum
           X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
        Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Gothic Canonical_Combining_Class=0
           Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None
           East_Asian_Width=Neutral Script=Gothic Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
           Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup
           Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
           Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
           Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 Script=Goth SC=Goth
           Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE Word_Break=LE _X_Begin

    U+1F40D <🐍> \N{SNAKE}
        \pS \p{So}
        All Any Assigned InMiscellaneousSymbolsAnd_Pictographs Common Zyyy So S Gr_Base Grapheme_Base Graph GrBase
           Miscellaneous_Symbols_And_Pictographs Other_Symbol Print Symbol X_POSIX_Graph X_POSIX_Print
        Age=6.0 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Miscellaneous_Symbols_And_Pictographs
           Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common
           Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX
           Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup
           Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
           Numeric_Value=NaN NV=NaN Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX
           Word_Break=Other WB=XX Word_Break=XX _X_Begin

    U+1F631 <😱> \N{FACE SCREAMING IN FEAR}
        \pS \p{So}
        All Any Assigned InEmoticons Common Zyyy Emoticons So S Gr_Base Grapheme_Base Graph GrBase Other_Symbol Print Symbol
           X_POSIX_Graph X_POSIX_Print
        Age=6.0 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Emoticons Canonical_Combining_Class=0
           Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None
           DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
           Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining
           JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
           Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other WB=XX
           Word_Break=XX _X_Begin
msg142143 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-08-15 19:16
Adding Symbola filled in the symbols and emoticons lines.
The gothic chars are still missing even with Alfios.
msg142144 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-15 19:20
>Terry J. Reedy <tjreedy@udel.edu> added the comment:

>Adding Symbola filled in the symbols and emoticons lines.
>The gothic chars are still missing even with Alfios.

That's too bad, as the Gothic paternoster is kinda cute. :)

Hm, I wonder where I got them from.  I think there must 
be a way to figure that out using the Mac FontBook program,
but I don't know what it is other than pasting them in
the sample screen and scrolling through the fonts to see
how those get rendered.

--tom
msg142228 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-08-16 21:45
Python's casemapping functions are not at all untrustworthy or unreliable. They are entirely deterministic - just limited to the BMP in some builds (in a way that has already been discussed). Changing the title of the issue.
History
Date User Action Args
2011-08-16 21:45:48loewissetmessages: + msg142228
title: Python's casemapping functions are untrustworthy due to narrow/wide build issues -> Python's casemapping functions are incorrect for non-BMP chars due to narrow/wide build issues
2011-08-15 19:20:59tchristsetmessages: + msg142144
2011-08-15 19:16:51terry.reedysetmessages: + msg142143
2011-08-15 18:40:53tchristsetmessages: + msg142140
2011-08-15 18:02:07terry.reedysetmessages: + msg142139
2011-08-15 17:55:07tchristsetmessages: + msg142138
2011-08-15 17:51:42ezio.melottisetmessages: + msg142137
2011-08-15 17:47:01terry.reedysetmessages: + msg142135
2011-08-15 10:59:05ezio.melottisetstatus: open -> closed
resolution: duplicate
messages: + msg142115

superseder: Make the str.is* methods work with non-BMP chars on narrow builds
stage: needs patch -> resolved
2011-08-13 00:57:12mrabarnettsetnosy: + mrabarnett
2011-08-12 22:09:52terry.reedysetfiles: + casemaps.py
2011-08-12 22:07:32terry.reedysetversions: + Python 3.2, Python 3.3
nosy: + terry.reedy, lemburg, vstinner, loewis

messages: + msg141991

stage: needs patch
2011-08-12 18:02:56Arfreversetnosy: + Arfrever
2011-08-11 22:53:10ezio.melottisetnosy: + ezio.melotti
2011-08-11 20:34:37skrahsetmessages: - msg141927
2011-08-11 20:30:47sdaodensetnosy: - sdaoden
2011-08-11 20:30:22sdaodensetnosy: + sdaoden
messages: + msg141927
2011-08-11 19:10:04tchristcreate