Issue 12730: Python's casemapping functions are incorrect for non-BMP chars due to narrow/wide build issues

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56939

classification

Title:	Python's casemapping functions are incorrect for non-BMP chars due to narrow/wide build issues
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.2, Python 3.3, Python 2.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Make the str.is* methods work with non-BMP chars on narrow builds View: 9200
Assigned To:		Nosy List:	Arfrever, ezio.melotti, lemburg, loewis, mrabarnett, tchrist, terry.reedy, vstinner
Priority:	normal	Keywords:

Created on 2011-08-11 19:10 by tchrist, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
casemaps.python	tchrist, 2011-08-11 19:10	demo of python casemapping functions being unreliable due to wide/narrow issues
casemaps.py	terry.reedy, 2011-08-12 22:09	Revision of casemaps that runs on more machines

Messages (11)
msg141918 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-11 19:10
You cannot use Python's casemapping functions on Unicode data because they fail on narrow builds. This makes it impossible to write portable code in Python that can cope with full Unicode. I've tried several times to submit this bug, but the file selection widget blows up. I believe it was an Opera bug because I had a write lock on the file. One more time.
msg141991 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-08-12 22:07
I agree that better masking of narrow-wide build difference would be good as long as it does not severely impact normal performance. Revision of the test file (see below) shows that the 'bug' is that the .upper, .lower, and .title methods leaves the tested non-BMP chars unchanged on narrow builds. I am not sure if this is true of all upper-plane chars and whether this is by design or simply a matter of not catching up to an ever-expanding database. Hence, I am also not sure whether this is a bug report or feature request. I made several changes in casemap.python so I could run it and get better information: * Rename to casemap.py. Many of us use software that recognizes and special-cases the standard .py extension. All python code files uploaded should use this. * Remove the unused 3-rd party regex import which stops the test for most people. * Remove the unnecessary PYTHONIOENCODING exit which stop the test on Windows and possibly elsewhere. The file seems to run fine without it. * Rewrite the test data using \Uxxxxxxxx (8 hex chars) escapes for the non-BMP chars. That will be required for new tests for test_unicode.py. (I believe the test suite avoid literal non-ascii chars unless really necessary.) Besides which, all I see (on Windowsj) in Firefox is things like "ð¼ð¯ð‘…ð¨ð‘‰ð¯ð»". IDLE just has empty boxes. * Factor the tests so the output is easier to rewrite. * Rewrite the test output to make comparisons easier. Writing the 'wrong' answer first, directly under the original, made it easy to see that the 'wrong' answer is the original, unchanged. The revised version (to be uploaded separately) has the same 6 failures.
msg142115 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-15 10:59
This is actually a duplicated of #9200. @Terry > Besides which, all I see (on Windowsj) in Firefox is things like > "ð¼ð¯ð‘…ð¨ð‘‰ð¯ð»". Encoding problem. Firefox thinks this is some iso-8859-*. You can fix this selecting 'Unicode (UTF-8)' from "View -> Character Encoding". > IDLE just has empty boxes. This is most likely because it doesn't use a font able to display those chars.
msg142135 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-08-15 17:47
My Firefox is already set at utf-8. More likely a font limitation. I will look again after installing one of the fonts Tom suggested. The pair of boxes on IDLE are for the surrogate pairs. Perhaps tk does not even try to display a single char. I will experiment more when I have a more complete font.
msg142137 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-08-15 17:51
> My Firefox is already set at utf-8. Every page can specify the encoding it uses (in HTTP headers, <meta> tag and/or xml prologue). If none of these are specified, afaik Firefox tries to detect the encoding, and sometimes fails. What encoding does it show for you in the menu when you open the patch?
msg142138 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-15 17:55
>Terry J. Reedy <tjreedy@udel.edu> added the comment: > My Firefox is already set at utf-8. More likely a font limitation. I > will look again after installing one of the fonts Tom suggested. Symbola is best for exotic glyphs, especially astral ones. Alfios just looks nice as a normal default roman. --tom
msg142139 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-08-15 18:02
You are right, FF switched on me without notice. Bad FF. Thank you! What I now see makes much more sense. [ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐇𐐝𐐀𐐡𐐇𐐓" ], and I now know to check on other pages (although Tom's Unicode talk slides still have boxes even in utf-8, so that must be a font lack).
msg142140 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-15 18:40
>Terry J. Reedy <tjreedy@udel.edu> added the comment: > You are right, FF switched on me without notice. Bad FF. Thank you! What > I now see makes much more sense. > [ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐇𐐝𐐀𐐡𐐇𐐓" ], > and I now know to check on other pages (although Tom's Unicode talk > slides still have boxes even in utf-8, so that must be a font lack). Do you have Symbola installed? Here's Appendix I on Fonts for things that should look right for the presentation to look right. * I recommend two free fonts from George Douros at users.teilar.gr/~g1951d/ known to work with this presentation: his Alﬁos font for regular text, and his Symbola font for fancy emoji. If any of these don’t look right to you, you probably need to supplement your system fonts: Ligatures: ﬁ ﬃ ﬀ ﬄ ﬂ β ẞ ﬅ ﬆ Math letters: 𝒜 𝒟 𝔅 𝔎 𝔼 𝔽 Gothic & Deseret: 𐌸𐌼𐌽𐍂, 𐐔𐐯𐑅𐐨𐑉𐐯𐐻 Symbols: ✔ ✅ 🐪 📖 🛂 🐍 Emoticons: 😇 😈 😉 😨 😭 😱 Upside‐down: ¡pɐəɥ ɹnoʎ uo ƃuᴉpuɐʇs ʎq sᴉɥʇ pɐəᴚ Combining characters: ◌̂,◌̃,◌⃞,◌̲,◌︀,◌̵,◌̷ * The last line with combining characters is especially hard to get to look right. You may ﬁnd that the shareware font Everson Mono works when all else fails. You do need Unicode 5.1 support for the LATIN CAPITAL LETTER SHARP S, and you need Unicode 6.0 support for most of the emoji (I think Snow Leopard has colorized versions of these. The Ligature line above looks good in Alfios. It turns out it may not always the font used with combining chars as it is whether and well your browser supports true combining characters dynamically generated, or whether it runs stuff through NFC and looks for substitution glyphs. I am not a GUI person, so am mostly just guessing. But this I find interesting: If you look at slide 33 of my first talk or slide 5 of my second talk, which are duplicates entitled Canonical Conundra, the second column which is labelled Glyphs explicitly uses Time New Roman because of this issue. Even so you can tell it is doing the NFC trick, because lines 1+2 have the same NFC of \x{F5} or õ, as do 3+4+5 with \x{22D} with ȭ, and and 6+7 with ō̃. The glyphs from the first group are both identical, and so are all three those of the second group, as both the first two groups have a single precomposed character available for their NFC. In contrast, there is no single precomposed glyph available for 6+7, and you can tell that it's stacking it on the fly using slightly less tight grouping rules than the font has in the precomposed versions above it. I use Safari, but I am told Firefox looks ok, too. Opera is my normal browser but it does the copout I just described on combining chars without ever being able to dynamically stack them if the copout fail, so I can't use it for this presentation. --tom $ uniprops -a 'LATIN CAPITAL LETTER SHARP S' 'DESERET CAPITAL LETTER DEE' 'GOTHIC LETTER MANNA' 'SNAKE' 'FACE SCREAMING IN FEAR' U+1E9E <ẞ> \N{LATIN CAPITAL LETTER SHARP S} \w \pL \p{LC} \p{L_} \p{L&} \p{Lu} All Any Alnum Alpha Alphabetic Assigned InLatinExtendedAdditional Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_Extended_Additional Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word Age=5.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Latin_Extended_Additional Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Script=Latin Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Latn Script=Latn Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin U+10414 <𐐔> \N{DESERET CAPITAL LETTER DEE} \w \pL \p{LC} \p{L_} \p{L&} \p{Lu} All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None Script=Deseret East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Dsrt Script=Dsrt Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin U+1033C <𐌼> \N{GOTHIC LETTER MANNA} \w \pL \p{L_} \p{Lo} All Any Alnum Alpha Alphabetic Assigned InGothic Gothic Is_Gothic L Lo Goth Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Gothic Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Script=Gothic Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 Script=Goth SC=Goth Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE Word_Break=LE _X_Begin U+1F40D <🐍> \N{SNAKE} \pS \p{So} All Any Assigned InMiscellaneousSymbolsAnd_Pictographs Common Zyyy So S Gr_Base Grapheme_Base Graph GrBase Miscellaneous_Symbols_And_Pictographs Other_Symbol Print Symbol X_POSIX_Graph X_POSIX_Print Age=6.0 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Miscellaneous_Symbols_And_Pictographs Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX _X_Begin U+1F631 <😱> \N{FACE SCREAMING IN FEAR} \pS \p{So} All Any Assigned InEmoticons Common Zyyy Emoticons So S Gr_Base Grapheme_Base Graph GrBase Other_Symbol Print Symbol X_POSIX_Graph X_POSIX_Print Age=6.0 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Emoticons Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX _X_Begin
msg142143 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-08-15 19:16
Adding Symbola filled in the symbols and emoticons lines. The gothic chars are still missing even with Alfios.
msg142144 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-15 19:20
>Terry J. Reedy <tjreedy@udel.edu> added the comment: >Adding Symbola filled in the symbols and emoticons lines. >The gothic chars are still missing even with Alfios. That's too bad, as the Gothic paternoster is kinda cute. :) Hm, I wonder where I got them from. I think there must be a way to figure that out using the Mac FontBook program, but I don't know what it is other than pasting them in the sample screen and scrolling through the fonts to see how those get rendered. --tom
msg142228 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-08-16 21:45
Python's casemapping functions are not at all untrustworthy or unreliable. They are entirely deterministic - just limited to the BMP in some builds (in a way that has already been discussed). Changing the title of the issue.

History
Date	User	Action	Args
2022-04-11 14:57:20	admin	set	github: 56939
2011-08-16 21:45:48	loewis	set	messages: + msg142228 title: Python's casemapping functions are untrustworthy due to narrow/wide build issues -> Python's casemapping functions are incorrect for non-BMP chars due to narrow/wide build issues
2011-08-15 19:20:59	tchrist	set	messages: + msg142144
2011-08-15 19:16:51	terry.reedy	set	messages: + msg142143
2011-08-15 18:40:53	tchrist	set	messages: + msg142140
2011-08-15 18:02:07	terry.reedy	set	messages: + msg142139
2011-08-15 17:55:07	tchrist	set	messages: + msg142138
2011-08-15 17:51:42	ezio.melotti	set	messages: + msg142137
2011-08-15 17:47:01	terry.reedy	set	messages: + msg142135
2011-08-15 10:59:05	ezio.melotti	set	status: open -> closed resolution: duplicate messages: + msg142115 superseder: Make the str.is* methods work with non-BMP chars on narrow builds stage: needs patch -> resolved
2011-08-13 00:57:12	mrabarnett	set	nosy: + mrabarnett
2011-08-12 22:09:52	terry.reedy	set	files: + casemaps.py
2011-08-12 22:07:32	terry.reedy	set	versions: + Python 3.2, Python 3.3 nosy: + terry.reedy, lemburg, vstinner, loewis messages: + msg141991 stage: needs patch
2011-08-12 18:02:56	Arfrever	set	nosy: + Arfrever
2011-08-11 22:53:10	ezio.melotti	set	nosy: + ezio.melotti
2011-08-11 20:34:37	skrah	set	messages: - msg141927
2011-08-11 20:30:47	sdaoden	set	nosy: - sdaoden
2011-08-11 20:30:22	sdaoden	set	nosy: + sdaoden messages: + msg141927
2011-08-11 19:10:04	tchrist	create