Issue12736
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011-08-11 21:39 by tchrist, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
mux.python | tchrist, 2011-08-11 21:39 | demo program showing all casemaps and casefolds for sample tricky dataset | ||
casing-tests.py | tchrist, 2011-08-26 23:55 | test suite for casemapping functions, case checking functions, and casefolding of patterns, both simple and full | ||
casing-results.txt | ezio.melotti, 2011-08-28 05:54 | results on 3.2/3.3 narrow/wide | ||
full-casemapping.patch | benjamin.peterson, 2012-01-08 03:54 | review | ||
full-casemapping.patch | benjamin.peterson, 2012-01-10 03:49 | review | ||
full-casemapping.patch | benjamin.peterson, 2012-01-11 03:37 | review | ||
full-casemapping.patch | benjamin.peterson, 2012-01-11 20:20 | review | ||
pythonbug.png | Андрей Баксаляр, 2016-03-10 20:21 |
Messages (27) | |||
---|---|---|---|
msg141928 - (view) | Author: Tom Christiansen (tchrist) | Date: 2011-08-11 21:39 | |
Python's casemapping functions only use what Unicode calls simple casemaps. These are only appropriate for functions that operate on single characters alone, not for those that operate on strings. The reason for this is that you get much better results with full casemapping. Java, Ruby, and Perl all do full casemapping for their equivalent functions that do string mapping, and Python should, too. I include a program that has a much of mappings and foldings both simple and full. Yes, it was machine-generated. |
|||
msg143036 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2011-08-26 21:11 | |
I presume this applies to builtin str methods like .lower(), right? I think it is a good thing to do for Python 3.3. We'd need to define what should happen in edge cases, e.g. when (against all odds) a string happens to contain a lone surrogate or some other code point or sequence of code points that the Unicode standard considers illegal. I think it should not fail but just leave those code points alone. Does this require us to import more data files from the Unicode standard? By itself that doesn't scare me. Would this also affect .islower() and friends? |
|||
msg143051 - (view) | Author: Tom Christiansen (tchrist) | Date: 2011-08-26 23:36 | |
Guido van Rossum <report@bugs.python.org> wrote on Fri, 26 Aug 2011 21:11:24 -0000: > Guido van Rossum <guido@python.org> added the comment: > I presume this applies to builtin str methods like .lower(), right? I > think it is a good thing to do for Python 3.3. Yes, the full casemaps are for upper, title, and lowercase. There is also a full casefold and turkic case fold (which is full), but you don't have a casefold function so I guess that doesn't matter. > We'd need to define what should happen in edge cases, e.g. when > (against all odds) a string happens to contain a lone surrogate or > some other code point or sequence of code points that the Unicode > standard considers illegal. I think it should not fail but just leave > those code points alone. Well, it's a funny thing. There are properties given for all Unicode code points, even noncharacter code points. This includes the casing properties, oddly enough. From UnicodeData.txt, which has a few surrogate entries; notice no casing is given: D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;; DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;; DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;; DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;; DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;; DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;; And in SpecialCasing.txt, which does not have surrogates but does have a default clause: # This file is a supplement to the UnicodeData file. # It contains additional information about the casing of Unicode characters. # (For compatibility, the UnicodeData.txt file only contains case mappings for # characters where they are 1-1, and independent of context and language. # For more information, see the discussion of Case Mappings in the Unicode Standard. # # All code points not listed in this file that do not have a simple case mappings # in UnicodeData.txt map to themselves. And in CaseFolding.txt, which also does not have surrogates but again does have a default clause: # The data supports both implementations that require simple case foldings # (where string lengths don't change), and implementations that allow full case folding # (where string lengths may grow). Note that where they can be supported, the # full case foldings are superior: for example, they allow "MASSE" and "Maße" to match. # # All code points not listed in this file map to themselves. Taken all together, it follows that the surrogates have case{map,fold}s back to themselves, since they have no case{map,fold}s listed. It's ok to have arbitrary code points in memory, including surrogates and the 66 noncharacters. It just isn't legal to have them in a UTF stream for "open interchange", whatever that means. > Does this require us to import more data files from the Unicode > standard? By itself that doesn't scare me. One way or the other, yes, notably the SpecialCasing file for casemapping and the CaseFolding file for casefolding (which you should do anyway to fix re.I). But you can and should process the new files into some tighter format optimized for your own lookups. Oddly, Java doesn't provide for String methods that do full casing on titlecase, even those they do do so on lowercase and uppercase. On titlecase they only expose the simple casemaps via the Character class, which are the ones from UnicodeData. They recognize that this is flaw, but it was too late to fix it for JAva 7. > Would this also affect .islower() and friends? Well, it shouldn't, but .islower() and friends are already mistaken. They seem to be checking for GC=Ll and such, but they need to be checking the Unicode binary property Lowercase and such. Watch: test 37 for string Ⅷ wanted <ⅷ> to be lowercase of <Ⅷ> but python disagrees wanted <Ⅷ> to be titlecase of <Ⅷ> but python disagrees wanted <Ⅷ> to be uppercase of <Ⅷ> but python disagrees test 37 failed 3 subtests test 39 for string Ⓚ wanted <ⓚ> to be lowercase of <Ⓚ> but python disagrees wanted <Ⓚ> to be titlecase of <Ⓚ> but python disagrees wanted <Ⓚ> to be uppercase of <Ⓚ> but python disagrees test 39 failed 3 subtests That's because the Roman numerals are GC=Nl but still have case and change case. Similarly for the circled letters which are GC=So but have case and change case. Plus there's U+0345, the iota subscript, which is GC=Mn but has case and changes case. I don't remember whether I've sent in my full test suite or not. If I haven't yet, I should attach it to the bug report. --tom |
|||
msg143052 - (view) | Author: Tom Christiansen (tchrist) | Date: 2011-08-26 23:55 | |
Here’s my casing test suite; I thought I sent it in but the mux file here isn’t the full thing. It does several things, including letting you run it with regex vs re. It also checks for the islower, etc functions. It has both simple and full (and turkic) maps and folds in it, but is configured to only check the simple versions for now. The islower and isupper etc functions seem to be checking the wrong Unicode property. Yes, it has my quaint Unixisms in it, because it needs to run with UTF-8 output, or you can't read what's going on. |
|||
msg143072 - (view) | Author: Tom Christiansen (tchrist) | Date: 2011-08-27 14:48 | |
Guido van Rossum <report@bugs.python.org> wrote on Fri, 26 Aug 2011 21:11:24 -0000: > Would this also affect .islower() and friends? SHORT VERSION: (7 lines) I don't believe so, but the relationship between lower() and islower() is not as clear to me as I would have thought, and more importantly, the code and the documentation for Python's islower() etc currently seem to disagree. For future releases, I recommend fixing the code, but if compatibility is an issue, then perhaps for previous releases still in maintenance mode fixing only the documentation would possibly be good enough--your call. ======================================================================= MEDIUM VERSION: (87 lines) I was initially confused with Python's islower() family because of the way they are defined to operate on full strings. They don't check that everything is lowercase even though they say they do. < http://docs.python.org/py3k/library/stdtypes.html#sequence-types-str-bytes-bytearray-list-tuple-range str.lower() Return a copy of the string with all the cased characters [4] converted to lowercase. str.islower() Return true if all cased characters [4] in the string are lowercase and there is at least one cased character, false otherwise. [4] (1, 2, 3, 4) Cased characters are those with general category property being one of “Lu” (Letter, uppercase), “Ll” (Letter, lowercase), or “Lt” (Letter, titlecase). This is strange in several ways. Of lesser importance is that strings can be considered lowercase even if they don't match ^\p{lowercase}+$ Another is that the result of calling str.lower() may not be .islower(). I'm not sure what these are particularly for, since I myself would just use a regex to get finer-grained control. (I suppose that's because re doesn't give access to the Unicode properties needed that this approach never gained any traction in the Python community.) However, the worst of this is that the documentation defines both cased characters and lowercase characters *differently* from how Unicode does defines those very same terms. This was quite confusing. Unicode distinguishes Cased code points from Cased_*Letter* code points. Python is using the Cased_Letter property but calling it Cased. Cased in a proper superset of Cased_Letter. From the DerivedCoreProperties file in the Unicode Character Database: # Derived Property: Cased (Cased) # As defined by Unicode Standard Definition D120 # C has the Lowercase or Uppercase property or has a General_Category value of Titlecase_Letter. In the same way, the Lowercase and Uppercase properties are not the same as the Lowercase_*Letter* and Uppercase_*Letter* properties. Rather, the former are respectively proper supersets of the latter. # Derived Property: Lowercase # Generated from: Ll + Other_Lowercase [...] # Derived Property: Uppercase # Generated from: Lu + Other_Uppercase In all these, you almost always want the superset versions not the restricted subset versions you are using. If it were in the regex engine, the user could select either. Java used to miss all these, too. But in 1.7, they updated their character methods to use the properties that they'd all along said they were using: < http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char) public static boolean isLowerCase(char ch) Determines if the specified character is a lowercase character. A character is lowercase if its general category type, provided by Character.getType(ch), is LOWERCASE_LETTER, or it has contributory -> property Other_Lowercase as defined by the Unicode Standard. Note: This method cannot handle supplementary characters. To support all Unicode characters, including supplementary characters, use the isLowerCase(int) method. (And yes, that's where Java uses "character" to mean "code unit" not "code point", alas. No wonder people get confused) I'm pretty sure that Python needs to either update its documentation to match its code, update its code to match its documentation, or both. Java chose to update the code to match the documentation, and this is the course I would recommend if at all possible. If you say you are checking for cased code points, then you should use the Unicode definition of cased code points not your own, and if you say you are checking for lowercase code points, then you should use the Unicode definition not your own. Both of these require access to contributory properties from the UCD and not just general categories alone. --tom ======================================================================= LONG VERSION: (222 lines) Essential tools I use for inspecting Unicode code points and their properties include http://training.perl.com/scripts/unichars http://training.perl.com/scripts/uniprops http://training.perl.com/scripts/uninames And over the course of the day, these get used a fair bit, too: http://training.perl.com/scripts/uniquote http://training.perl.com/scripts/ucsort http://training.perl.com/scripts/unifmt Here for example are (some of) the *non*-Letter code point that are nonetheless considered lowercase or uppercase because they have the Other_{Lower,Upper}case properties: % unichars -gs '\PL' '[\p{upper}\p{lower}]' ○ͅ U+00345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI Ⅰ U+02160 GC=Nl SC=Latin ROMAN NUMERAL ONE Ⅱ U+02161 GC=Nl SC=Latin ROMAN NUMERAL TWO Ⅲ U+02162 GC=Nl SC=Latin ROMAN NUMERAL THREE [...] ⅰ U+02170 GC=Nl SC=Latin SMALL ROMAN NUMERAL ONE ⅱ U+02171 GC=Nl SC=Latin SMALL ROMAN NUMERAL TWO ⅲ U+02172 GC=Nl SC=Latin SMALL ROMAN NUMERAL THREE [...] Ⓐ U+024B6 GC=So SC=Common CIRCLED LATIN CAPITAL LETTER A Ⓑ U+024B7 GC=So SC=Common CIRCLED LATIN CAPITAL LETTER B Ⓒ U+024B8 GC=So SC=Common CIRCLED LATIN CAPITAL LETTER C [...] ⓐ U+024D0 GC=So SC=Common CIRCLED LATIN SMALL LETTER A ⓑ U+024D1 GC=So SC=Common CIRCLED LATIN SMALL LETTER B ⓒ U+024D2 GC=So SC=Common CIRCLED LATIN SMALL LETTER C [...] And here are (some of) the letters that are cased but which are not Lu, Lt, or Ll (they're all Lm, in fact): % unichars -gs '\p{Lm}' '\p{cased}' | ucsort ᴭ U+1D2D GC=Lm SC=Latin MODIFIER LETTER CAPITAL AE ᴬ U+1D2C GC=Lm SC=Latin MODIFIER LETTER CAPITAL A ᵃ U+1D43 GC=Lm SC=Latin MODIFIER LETTER SMALL A ₐ U+2090 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER A ᵅ U+1D45 GC=Lm SC=Latin MODIFIER LETTER SMALL ALPHA ᴮ U+1D2E GC=Lm SC=Latin MODIFIER LETTER CAPITAL B ᵇ U+1D47 GC=Lm SC=Latin MODIFIER LETTER SMALL B [...] ʷ U+02B7 GC=Lm SC=Latin MODIFIER LETTER SMALL W ᵂ U+1D42 GC=Lm SC=Latin MODIFIER LETTER CAPITAL W ˣ U+02E3 GC=Lm SC=Latin MODIFIER LETTER SMALL X ₓ U+2093 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER X ʸ U+02B8 GC=Lm SC=Latin MODIFIER LETTER SMALL Y ᶻ U+1DBB GC=Lm SC=Latin MODIFIER LETTER SMALL Z ᵝ U+1D5D GC=Lm SC=Greek MODIFIER LETTER SMALL BETA ᵞ U+1D5E GC=Lm SC=Greek MODIFIER LETTER SMALL GREEK GAMMA ᵟ U+1D5F GC=Lm SC=Greek MODIFIER LETTER SMALL DELTA ᶿ U+1DBF GC=Lm SC=Greek MODIFIER LETTER SMALL THETA ͺ U+037A GC=Lm SC=Greek GREEK YPOGEGRAMMENI ᵠ U+1D60 GC=Lm SC=Greek MODIFIER LETTER SMALL GREEK PHI ᵡ U+1D61 GC=Lm SC=Greek MODIFIER LETTER SMALL CHI ᵸ U+1D78 GC=Lm SC=Cyrillic MODIFIER LETTER CYRILLIC EN Perversely, here are some of the modifier letters which are *not* cased: % unichars -gs '\p{Lm}' '\P{CASED}' | ucsort ₕ U+2095 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER H ʻ U+02BB GC=Lm SC=Common MODIFIER LETTER TURNED COMMA ʽ U+02BD GC=Lm SC=Common MODIFIER LETTER REVERSED COMMA ⁱ U+2071 GC=Lm SC=Latin SUPERSCRIPT LATIN SMALL LETTER I ₖ U+2096 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER K ₗ U+2097 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER L ₘ U+2098 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER M ⁿ U+207F GC=Lm SC=Latin SUPERSCRIPT LATIN SMALL LETTER N ₙ U+2099 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER N ₚ U+209A GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER P ₛ U+209B GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER S ₜ U+209C GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER T ʹ U+02B9 GC=Lm SC=Common MODIFIER LETTER PRIME ʺ U+02BA GC=Lm SC=Common MODIFIER LETTER DOUBLE PRIME ˆ U+02C6 GC=Lm SC=Common MODIFIER LETTER CIRCUMFLEX ACCENT ˇ U+02C7 GC=Lm SC=Common CARON ˈ U+02C8 GC=Lm SC=Common MODIFIER LETTER VERTICAL LINE ˉ U+02C9 GC=Lm SC=Common MODIFIER LETTER MACRON ˊ U+02CA GC=Lm SC=Common MODIFIER LETTER ACUTE ACCENT ˋ U+02CB GC=Lm SC=Common MODIFIER LETTER GRAVE ACCENT ˌ U+02CC GC=Lm SC=Common MODIFIER LETTER LOW VERTICAL LINE (Interesting how the commas sort as breath marks next to H.) I cannot for the life of me figure out why Unicode deems these lowercase: ᵃ U+1D43 GC=Lm SC=Latin MODIFIER LETTER SMALL A ₐ U+2090 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER A ᵅ U+1D45 GC=Lm SC=Latin MODIFIER LETTER SMALL ALPHA yet these *not* to be cased: ⁱ U+2071 GC=Lm SC=Latin SUPERSCRIPT LATIN SMALL LETTER I ₘ U+2098 GC=Lm SC=Latin LATIN SUBSCRIPT SMALL LETTER M ⁿ U+207F GC=Lm SC=Latin SUPERSCRIPT LATIN SMALL LETTER N All I know is that the tables tell me. Here's a fair assortment of cased and noncased, case-changing and non-casing code points. The variation in binary properties is pretty wide. $ uniprops x 00aa 1d4e 2071 2172 df 262 1d401 1d42d 2117 24c5 U+0078 ‹x› \N{LATIN SMALL LETTER X} \w \pL \p{LC} \p{L_} \p{L&} \p{Ll} All Any Alnum Alpha Alphabetic ASCII Assigned Basic_Latin Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower Lowercase PerlWord POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print POSIX_Word Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word U+00AA ‹ª› \N{FEMININE ORDINAL INDICATOR} \w \pL \p{LC} \p{L_} \p{L&} \p{Ll} All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word U+1D4E <ᵎ> \N{MODIFIER LETTER SMALL TURNED I} \w \pL \p{L_} \p{Lm} All Any Alnum Alpha Alphabetic Assigned InPhoneticExtensions Case_Ignorable CI Cased Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Modifier_Letter Lower Lowercase Phonetic_Extensions Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word U+2071 <ⁱ> \N{SUPERSCRIPT LATIN SMALL LETTER I} \w \pL \p{L_} \p{Lm} All Any Alnum Alpha Alphabetic Assigned InSuperscriptsAndSubscripts Case_Ignorable CI Changes_When_NFKC_Casefolded CWKCF L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Modifier_Letter Print SD Soft_Dotted Superscripts_And_Subscripts Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word U+2172 <ⅲ> \N{SMALL ROMAN NUMERAL THREE} \w \pN \p{Nl} All Any Alnum Alpha Alphabetic Assigned InNumberForms Cased Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Nl N Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Latin Latn Letter_Number Lower Lowercase Number Number_Forms Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word U+00DF <ß> \N{LATIN SMALL LETTER SHARP S} \w \pL \p{LC} \p{L_} \p{L&} \p{Ll} All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word U+0262 <ɢ> \N{LATIN LETTER SMALL CAPITAL G} \w \pL \p{LC} \p{L_} \p{L&} \p{Ll} All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS IPA_Extensions Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word U+1D401 <𝐁> \N{MATHEMATICAL BOLD CAPITAL B} \w \pL \p{LC} \p{L_} \p{L&} \p{Lu} All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Common Zyyy Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Math Mathematical_Alphanumeric_Symbols Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word U+1D42D <𝐭> \N{MATHEMATICAL BOLD SMALL T} \w \pL \p{LC} \p{L_} \p{L&} \p{Ll} All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Common Zyyy Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Math Mathematical_Alphanumeric_Symbols Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word U+2117 ‹℗› \N{SOUND RECORDING COPYRIGHT} \pS \p{So} All Any Assigned InLetterlikeSymbols Common Zyyy So S Gr_Base Grapheme_Base Graph GrBase Letterlike_Symbols Other_Symbol Print Symbol X_POSIX_Graph X_POSIX_Print U+24C5 ‹Ⓟ› \N{CIRCLED LATIN CAPITAL LETTER P} \w \pS \p{So} All Any Alnum Alpha Alphabetic Assigned InEnclosedAlphanumerics Cased Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Common Zyyy Enclosed_Alphanumerics So S Gr_Base Grapheme_Base Graph GrBase Other_Symbol Print Symbol Upper Uppercase Word X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word Unicode also has a Case_Ignorable (CI) character property, which I haven't thought much about but which might be useful. http://www.unicode.org/reports/tr44/#Case_Ignorable Characters which are ignored for casing purposes. For more information, see D121 in Section 3.13, Default Case Algorithms in [Unicode]. Generated from: Mn + Me + Cf + Lm + Sk + Word_Break=MidLetter + Word_Break=MidNumLet I'm not sure if you should think about these when doing your isupper() test; maybe you should. That way you wouldn't fail just because you had a code point that was technically lowercase, like if someone used "LEONARD MᶜCOY". That funny ᶜ wouldn't count as a spoiler then, so that "Leonard MᶜCoy".upper().isupper() could be true, as the ᶜ wouldn't change but wouldn't count, either. I haven't thought about this enough though. I'm not used to full string-based isupper() functions, so my instincts may be wrong here. The only code point that is both CWCM and also CI is the notorious ○ͅ U+00345 GC=Mn SC=Inherited COMBINING GREEK YPOGEGRAMMENI Subscripts, superscripts, modifier letters, small capitals, and mathematical letters *tend* to be cased code points that do not change when casemapped or casefolded, although there are exceptions. % uninames small capital '\b\R\b' ʀ 0280 LATIN LETTER SMALL CAPITAL R * voiced uvular trill * Germanic, Old Norse * uppercase is 01A6 ʁ 0281 LATIN LETTER SMALL CAPITAL INVERTED R * voiced uvular fricative or approximant x (modifier letter small capital inverted r - 02B6) ʶ 02B6 MODIFIER LETTER SMALL CAPITAL INVERTED R * preceding four used for r-coloring or r-offglides x (latin letter small capital inverted r - 0281) # <super> 0281 ᴙ 1D19 LATIN LETTER SMALL CAPITAL REVERSED R ᴚ 1D1A LATIN LETTER SMALL CAPITAL TURNED R ᷢ 1DE2 COMBINING LATIN LETTER SMALL CAPITAL R % uniprops 280 1a6 U+0280 <ʀ> \N{LATIN LETTER SMALL CAPITAL R} \w \pL \p{LC} \p{L_} \p{L&} \p{Ll} All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS IPA_Extensions Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word U+01A6 <Ʀ> \N{LATIN LETTER YR} \w \pL \p{LC} \p{L_} \p{L&} \p{Lu} All Any Alnum Alpha Alphabetic Assigned InLatinExtendedB Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_Extended_B Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word That's right: the uppercase of LATIN LETTER SMALL CAPITAL R is LATIN LETTER YR, and I don't know why. No other small capital -- which are all considered lowercase -- changes when casemapped. Only this one alone. Note that things like code points like U+00DF LATIN SMALL LETTER SHARP S have these binary properties true because the normal/default sense of these terms in Unicode is the full/string sense not the simple/character sense: Changes_When_Casefolded (CWCF) Changes_When_Casemapped (CWCM) Changes_When_Titlecased (CWT) Changes_When_Uppercased (CWU) Those are true because the full uppercase map of "ß" is "SS" and the full casefold of "ß" is "ss". --tom |
|||
msg143083 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2011-08-27 16:15 | |
Thanks you very much. We should fix the behavior in 3.3 for sure. I'm thinking that we may be able to backport the behavior fix to 2.7 and 3.2 as well, since it just makes the behavior generally "better" (and for most folks it won't matter anyway). I'm not sure where the somewhat odd rules for .islower() come from, I think in part from the desire to have "".islower() be False but "a b".islower() to be True. Intuitively, this means that .islower() means both "there is at least one lower case character" and "there are no upper case characters", but not "all characters are lowercase". I forget what we do w.r.t. titlecase, but the intuitive meaning should not change. Although personally I don't have much of an intuition for what titlecase means (and why it's important), perhaps because I'm not familiar with any language where there is a third case for some letters. |
|||
msg143084 - (view) | Author: Tom Christiansen (tchrist) | Date: 2011-08-27 19:17 | |
Guido van Rossum <report@bugs.python.org> wrote on Sat, 27 Aug 2011 16:15:33 -0000: > Although personally I don't have much of an intuition for what > titlecase means (and why it's important), perhaps because I'm not > familiar with any language where there is a third case for some > letters. Neither am I. Even in "old-style" English with ae and oe, one wrote ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or *Aesir. Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology. (BTW, in French you really shouldn't split up the œ into oe, nor in Old English, Old Norse, or Icelandic the æ in ae; although in contemporary English, it's usually ok to do so.) I believe that almost but not quite all the sticky situations with Unicode casing involve compatibility characters for clean round-trips with legacy encodings. Exceptions include the German sharp s (both of them now) and the two Greek lowercase sigmas. Thank goodness we don't use the long s in English anymore. What is it with s's, anyway? :) Most of the titlecase letters are in Greek, with a few in Armenian. I know no Armenian (their letters all look the same to me :), and the folks I talked to about the Greek are skeptical. The German sharp s is a red herring, because you can never have it as the first letter (although it needn't be the last, as in Rußland). That's no more possible than having the old legacy ff ligature appear at the beginning of an English world. In any event, there are only 129 total code points that are "problematic" in terms of their case, where by problematic I mean one or more of: --- titlecase differs from uppercase --- foldcase differs from lowercase --- any of fold/lower/title/uppercase yields more than one code point Of all these, it's the (now two!) sharp s's and the Turkic i that are the most annoying. It's really quite a lot of trouble to go through for so few code points of so little (perceived) use. But I suppose you never know what new ones they'll uncover, either. Here are those 129 case-problematicals arranged in UCA order. Some of these normilizations forms that decompose into graphemes with four code points (not shown). There are a few other oddities, like the Kelvin sign and other "singletons", but these are most of the trouble. They're all in the BMP; I guess we learned our lesson. :) --tom 1: U+0345 ○ͅ COMBINING GREEK YPOGEGRAMMENI fc=ι U+3B9 lc=○ͅ U+345 tc=Ι U+399 uc=Ι U+399 2: U+1E9A ẚ LATIN SMALL LETTER A WITH RIGHT HALF RING fc=aʾ U+61.2BE lc=ẚ U+1E9A tc=Aʾ U+41.2BE uc=Aʾ U+41.2BE 3: U+01F3 dz LATIN SMALL LETTER DZ fc=dz U+1F3 lc=dz U+1F3 tc=Dz U+1F2 uc=DZ U+1F1 4: U+01F2 Dz LATIN CAPITAL LETTER D WITH SMALL LETTER Z fc=dz U+1F3 lc=dz U+1F3 tc=Dz U+1F2 uc=DZ U+1F1 5: U+01F1 DZ LATIN CAPITAL LETTER DZ fc=dz U+1F3 lc=dz U+1F3 tc=Dz U+1F2 uc=DZ U+1F1 6: U+01C6 dž LATIN SMALL LETTER DZ WITH CARON fc=dž U+1C6 lc=dž U+1C6 tc=Dž U+1C5 uc=DŽ U+1C4 7: U+01C5 Dž LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON fc=dž U+1C6 lc=dž U+1C6 tc=Dž U+1C5 uc=DŽ U+1C4 8: U+01C4 DŽ LATIN CAPITAL LETTER DZ WITH CARON fc=dž U+1C6 lc=dž U+1C6 tc=Dž U+1C5 uc=DŽ U+1C4 9: U+FB00 ff LATIN SMALL LIGATURE FF fc=ff U+66.66 lc=ff U+FB00 tc=Ff U+46.66 uc=FF U+46.46 10: U+FB03 ffi LATIN SMALL LIGATURE FFI fc=ffi U+66.66.69 lc=ffi U+FB03 tc=Ffi U+46.66.69 uc=FFI U+46.46.49 11: U+FB04 ffl LATIN SMALL LIGATURE FFL fc=ffl U+66.66.6C lc=ffl U+FB04 tc=Ffl U+46.66.6C uc=FFL U+46.46.4C 12: U+FB01 fi LATIN SMALL LIGATURE FI fc=fi U+66.69 lc=fi U+FB01 tc=Fi U+46.69 uc=FI U+46.49 13: U+FB02 fl LATIN SMALL LIGATURE FL fc=fl U+66.6C lc=fl U+FB02 tc=Fl U+46.6C uc=FL U+46.4C 14: U+1E96 ẖ LATIN SMALL LETTER H WITH LINE BELOW fc=ẖ U+68.331 lc=ẖ U+1E96 tc=H̱ U+48.331 uc=H̱ U+48.331 15: U+0130 İ LATIN CAPITAL LETTER I WITH DOT ABOVE fc=i̇ U+69.307 lc=i̇ U+69.307 tc=İ U+130 uc=İ U+130 16: U+01F0 ǰ LATIN SMALL LETTER J WITH CARON fc=ǰ U+6A.30C lc=ǰ U+1F0 tc=J̌ U+4A.30C uc=J̌ U+4A.30C 17: U+01C9 lj LATIN SMALL LETTER LJ fc=lj U+1C9 lc=lj U+1C9 tc=Lj U+1C8 uc=LJ U+1C7 18: U+01C8 Lj LATIN CAPITAL LETTER L WITH SMALL LETTER J fc=lj U+1C9 lc=lj U+1C9 tc=Lj U+1C8 uc=LJ U+1C7 19: U+01C7 LJ LATIN CAPITAL LETTER LJ fc=lj U+1C9 lc=lj U+1C9 tc=Lj U+1C8 uc=LJ U+1C7 20: U+01CC nj LATIN SMALL LETTER NJ fc=nj U+1CC lc=nj U+1CC tc=Nj U+1CB uc=NJ U+1CA 21: U+01CB Nj LATIN CAPITAL LETTER N WITH SMALL LETTER J fc=nj U+1CC lc=nj U+1CC tc=Nj U+1CB uc=NJ U+1CA 22: U+01CA NJ LATIN CAPITAL LETTER NJ fc=nj U+1CC lc=nj U+1CC tc=Nj U+1CB uc=NJ U+1CA 23: U+017F ſ LATIN SMALL LETTER LONG S fc=s U+73 lc=ſ U+17F tc=S U+53 uc=S U+53 24: U+1E9B ẛ LATIN SMALL LETTER LONG S WITH DOT ABOVE fc=ṡ U+1E61 lc=ẛ U+1E9B tc=Ṡ U+1E60 uc=Ṡ U+1E60 25: U+00DF ß LATIN SMALL LETTER SHARP S fc=ss U+73.73 lc=ß U+DF tc=Ss U+53.73 uc=SS U+53.53 26: U+1E9E ẞ LATIN CAPITAL LETTER SHARP S fc=ss U+73.73 lc=ß U+DF tc=ẞ U+1E9E uc=ẞ U+1E9E 27: U+FB06 st LATIN SMALL LIGATURE ST fc=st U+73.74 lc=st U+FB06 tc=St U+53.74 uc=ST U+53.54 28: U+FB05 ſt LATIN SMALL LIGATURE LONG S T fc=st U+73.74 lc=ſt U+FB05 tc=St U+53.74 uc=ST U+53.54 29: U+1E97 ẗ LATIN SMALL LETTER T WITH DIAERESIS fc=ẗ U+74.308 lc=ẗ U+1E97 tc=T̈ U+54.308 uc=T̈ U+54.308 30: U+1E98 ẘ LATIN SMALL LETTER W WITH RING ABOVE fc=ẘ U+77.30A lc=ẘ U+1E98 tc=W̊ U+57.30A uc=W̊ U+57.30A 31: U+1E99 ẙ LATIN SMALL LETTER Y WITH RING ABOVE fc=ẙ U+79.30A lc=ẙ U+1E99 tc=Y̊ U+59.30A uc=Y̊ U+59.30A 32: U+0149 ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE fc=ʼn U+2BC.6E lc=ʼn U+149 tc=ʼN U+2BC.4E uc=ʼN U+2BC.4E 33: U+1F84 ᾄ GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI fc=ἄι U+1F04.3B9 lc=ᾄ U+1F84 tc=ᾌ U+1F8C uc=ἌΙ U+1F0C.399 34: U+1F8C ᾌ GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI fc=ἄι U+1F04.3B9 lc=ᾄ U+1F84 tc=ᾌ U+1F8C uc=ἌΙ U+1F0C.399 35: U+1F82 ᾂ GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI fc=ἂι U+1F02.3B9 lc=ᾂ U+1F82 tc=ᾊ U+1F8A uc=ἊΙ U+1F0A.399 36: U+1F8A ᾊ GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI fc=ἂι U+1F02.3B9 lc=ᾂ U+1F82 tc=ᾊ U+1F8A uc=ἊΙ U+1F0A.399 37: U+1F86 ᾆ GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI fc=ἆι U+1F06.3B9 lc=ᾆ U+1F86 tc=ᾎ U+1F8E uc=ἎΙ U+1F0E.399 38: U+1F8E ᾎ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI fc=ἆι U+1F06.3B9 lc=ᾆ U+1F86 tc=ᾎ U+1F8E uc=ἎΙ U+1F0E.399 39: U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI fc=ἀι U+1F00.3B9 lc=ᾀ U+1F80 tc=ᾈ U+1F88 uc=ἈΙ U+1F08.399 40: U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI fc=ἀι U+1F00.3B9 lc=ᾀ U+1F80 tc=ᾈ U+1F88 uc=ἈΙ U+1F08.399 41: U+1F85 ᾅ GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI fc=ἅι U+1F05.3B9 lc=ᾅ U+1F85 tc=ᾍ U+1F8D uc=ἍΙ U+1F0D.399 42: U+1F8D ᾍ GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI fc=ἅι U+1F05.3B9 lc=ᾅ U+1F85 tc=ᾍ U+1F8D uc=ἍΙ U+1F0D.399 43: U+1F83 ᾃ GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI fc=ἃι U+1F03.3B9 lc=ᾃ U+1F83 tc=ᾋ U+1F8B uc=ἋΙ U+1F0B.399 44: U+1F8B ᾋ GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI fc=ἃι U+1F03.3B9 lc=ᾃ U+1F83 tc=ᾋ U+1F8B uc=ἋΙ U+1F0B.399 45: U+1F87 ᾇ GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI fc=ἇι U+1F07.3B9 lc=ᾇ U+1F87 tc=ᾏ U+1F8F uc=ἏΙ U+1F0F.399 46: U+1F8F ᾏ GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI fc=ἇι U+1F07.3B9 lc=ᾇ U+1F87 tc=ᾏ U+1F8F uc=ἏΙ U+1F0F.399 47: U+1F81 ᾁ GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI fc=ἁι U+1F01.3B9 lc=ᾁ U+1F81 tc=ᾉ U+1F89 uc=ἉΙ U+1F09.399 48: U+1F89 ᾉ GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI fc=ἁι U+1F01.3B9 lc=ᾁ U+1F81 tc=ᾉ U+1F89 uc=ἉΙ U+1F09.399 49: U+1FB4 ᾴ GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI fc=άι U+3AC.3B9 lc=ᾴ U+1FB4 tc=Άͅ U+386.345 uc=ΆΙ U+386.399 50: U+1FB2 ᾲ GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI fc=ὰι U+1F70.3B9 lc=ᾲ U+1FB2 tc=Ὰͅ U+1FBA.345 uc=ᾺΙ U+1FBA.399 51: U+1FB6 ᾶ GREEK SMALL LETTER ALPHA WITH PERISPOMENI fc=ᾶ U+3B1.342 lc=ᾶ U+1FB6 tc=Α͂ U+391.342 uc=Α͂ U+391.342 52: U+1FB7 ᾷ GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI fc=ᾶι U+3B1.342.3B9 lc=ᾷ U+1FB7 tc=ᾼ͂ U+391.342.345 uc=Α͂Ι U+391.342.399 53: U+1FB3 ᾳ GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI fc=αι U+3B1.3B9 lc=ᾳ U+1FB3 tc=ᾼ U+1FBC uc=ΑΙ U+391.399 54: U+1FBC ᾼ GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI fc=αι U+3B1.3B9 lc=ᾳ U+1FB3 tc=ᾼ U+1FBC uc=ΑΙ U+391.399 55: U+03D0 ϐ GREEK BETA SYMBOL fc=β U+3B2 lc=ϐ U+3D0 tc=Β U+392 uc=Β U+392 56: U+03F5 ϵ GREEK LUNATE EPSILON SYMBOL fc=ε U+3B5 lc=ϵ U+3F5 tc=Ε U+395 uc=Ε U+395 57: U+1F94 ᾔ GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI fc=ἤι U+1F24.3B9 lc=ᾔ U+1F94 tc=ᾜ U+1F9C uc=ἬΙ U+1F2C.399 58: U+1F9C ᾜ GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI fc=ἤι U+1F24.3B9 lc=ᾔ U+1F94 tc=ᾜ U+1F9C uc=ἬΙ U+1F2C.399 59: U+1F92 ᾒ GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI fc=ἢι U+1F22.3B9 lc=ᾒ U+1F92 tc=ᾚ U+1F9A uc=ἪΙ U+1F2A.399 60: U+1F9A ᾚ GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI fc=ἢι U+1F22.3B9 lc=ᾒ U+1F92 tc=ᾚ U+1F9A uc=ἪΙ U+1F2A.399 61: U+1F96 ᾖ GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI fc=ἦι U+1F26.3B9 lc=ᾖ U+1F96 tc=ᾞ U+1F9E uc=ἮΙ U+1F2E.399 62: U+1F9E ᾞ GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI fc=ἦι U+1F26.3B9 lc=ᾖ U+1F96 tc=ᾞ U+1F9E uc=ἮΙ U+1F2E.399 63: U+1F90 ᾐ GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI fc=ἠι U+1F20.3B9 lc=ᾐ U+1F90 tc=ᾘ U+1F98 uc=ἨΙ U+1F28.399 64: U+1F98 ᾘ GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI fc=ἠι U+1F20.3B9 lc=ᾐ U+1F90 tc=ᾘ U+1F98 uc=ἨΙ U+1F28.399 65: U+1F95 ᾕ GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI fc=ἥι U+1F25.3B9 lc=ᾕ U+1F95 tc=ᾝ U+1F9D uc=ἭΙ U+1F2D.399 66: U+1F9D ᾝ GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI fc=ἥι U+1F25.3B9 lc=ᾕ U+1F95 tc=ᾝ U+1F9D uc=ἭΙ U+1F2D.399 67: U+1F93 ᾓ GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI fc=ἣι U+1F23.3B9 lc=ᾓ U+1F93 tc=ᾛ U+1F9B uc=ἫΙ U+1F2B.399 68: U+1F9B ᾛ GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI fc=ἣι U+1F23.3B9 lc=ᾓ U+1F93 tc=ᾛ U+1F9B uc=ἫΙ U+1F2B.399 69: U+1F97 ᾗ GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI fc=ἧι U+1F27.3B9 lc=ᾗ U+1F97 tc=ᾟ U+1F9F uc=ἯΙ U+1F2F.399 70: U+1F9F ᾟ GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI fc=ἧι U+1F27.3B9 lc=ᾗ U+1F97 tc=ᾟ U+1F9F uc=ἯΙ U+1F2F.399 71: U+1F91 ᾑ GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI fc=ἡι U+1F21.3B9 lc=ᾑ U+1F91 tc=ᾙ U+1F99 uc=ἩΙ U+1F29.399 72: U+1F99 ᾙ GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI fc=ἡι U+1F21.3B9 lc=ᾑ U+1F91 tc=ᾙ U+1F99 uc=ἩΙ U+1F29.399 73: U+1FC4 ῄ GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI fc=ήι U+3AE.3B9 lc=ῄ U+1FC4 tc=Ήͅ U+389.345 uc=ΉΙ U+389.399 74: U+1FC2 ῂ GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI fc=ὴι U+1F74.3B9 lc=ῂ U+1FC2 tc=Ὴͅ U+1FCA.345 uc=ῊΙ U+1FCA.399 75: U+1FC6 ῆ GREEK SMALL LETTER ETA WITH PERISPOMENI fc=ῆ U+3B7.342 lc=ῆ U+1FC6 tc=Η͂ U+397.342 uc=Η͂ U+397.342 76: U+1FC7 ῇ GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI fc=ῆι U+3B7.342.3B9 lc=ῇ U+1FC7 tc=ῌ͂ U+397.342.345 uc=Η͂Ι U+397.342.399 77: U+1FC3 ῃ GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI fc=ηι U+3B7.3B9 lc=ῃ U+1FC3 tc=ῌ U+1FCC uc=ΗΙ U+397.399 78: U+1FCC ῌ GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI fc=ηι U+3B7.3B9 lc=ῃ U+1FC3 tc=ῌ U+1FCC uc=ΗΙ U+397.399 79: U+03D1 ϑ GREEK THETA SYMBOL fc=θ U+3B8 lc=ϑ U+3D1 tc=Θ U+398 uc=Θ U+398 80: U+1FBE ι GREEK PROSGEGRAMMENI fc=ι U+3B9 lc=ι U+1FBE tc=Ι U+399 uc=Ι U+399 81: U+1FD6 ῖ GREEK SMALL LETTER IOTA WITH PERISPOMENI fc=ῖ U+3B9.342 lc=ῖ U+1FD6 tc=Ι͂ U+399.342 uc=Ι͂ U+399.342 82: U+0390 ΐ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS fc=ΐ U+3B9.308.301 lc=ΐ U+390 tc=Ϊ́ U+399.308.301 uc=Ϊ́ U+399.308.301 83: U+1FD3 ΐ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA fc=ΐ U+3B9.308.301 lc=ΐ U+1FD3 tc=Ϊ́ U+399.308.301 uc=Ϊ́ U+399.308.301 84: U+1FD2 ῒ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA fc=ῒ U+3B9.308.300 lc=ῒ U+1FD2 tc=Ϊ̀ U+399.308.300 uc=Ϊ̀ U+399.308.300 85: U+1FD7 ῗ GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI fc=ῗ U+3B9.308.342 lc=ῗ U+1FD7 tc=Ϊ͂ U+399.308.342 uc=Ϊ͂ U+399.308.342 86: U+03F0 ϰ GREEK KAPPA SYMBOL fc=κ U+3BA lc=ϰ U+3F0 tc=Κ U+39A uc=Κ U+39A 87: U+00B5 µ MICRO SIGN fc=μ U+3BC lc=µ U+B5 tc=Μ U+39C uc=Μ U+39C 88: U+03D6 ϖ GREEK PI SYMBOL fc=π U+3C0 lc=ϖ U+3D6 tc=Π U+3A0 uc=Π U+3A0 89: U+03F1 ϱ GREEK RHO SYMBOL fc=ρ U+3C1 lc=ϱ U+3F1 tc=Ρ U+3A1 uc=Ρ U+3A1 90: U+1FE4 ῤ GREEK SMALL LETTER RHO WITH PSILI fc=ῤ U+3C1.313 lc=ῤ U+1FE4 tc=Ρ̓ U+3A1.313 uc=Ρ̓ U+3A1.313 91: U+03C2 ς GREEK SMALL LETTER FINAL SIGMA fc=σ U+3C3 lc=ς U+3C2 tc=Σ U+3A3 uc=Σ U+3A3 92: U+1F50 ὐ GREEK SMALL LETTER UPSILON WITH PSILI fc=ὐ U+3C5.313 lc=ὐ U+1F50 tc=Υ̓ U+3A5.313 uc=Υ̓ U+3A5.313 93: U+1F54 ὔ GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA fc=ὔ U+3C5.313.301 lc=ὔ U+1F54 tc=Υ̓́ U+3A5.313.301 uc=Υ̓́ U+3A5.313.301 94: U+1F52 ὒ GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA fc=ὒ U+3C5.313.300 lc=ὒ U+1F52 tc=Υ̓̀ U+3A5.313.300 uc=Υ̓̀ U+3A5.313.300 95: U+1F56 ὖ GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI fc=ὖ U+3C5.313.342 lc=ὖ U+1F56 tc=Υ̓͂ U+3A5.313.342 uc=Υ̓͂ U+3A5.313.342 96: U+1FE6 ῦ GREEK SMALL LETTER UPSILON WITH PERISPOMENI fc=ῦ U+3C5.342 lc=ῦ U+1FE6 tc=Υ͂ U+3A5.342 uc=Υ͂ U+3A5.342 97: U+03B0 ΰ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS fc=ΰ U+3C5.308.301 lc=ΰ U+3B0 tc=Ϋ́ U+3A5.308.301 uc=Ϋ́ U+3A5.308.301 98: U+1FE3 ΰ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA fc=ΰ U+3C5.308.301 lc=ΰ U+1FE3 tc=Ϋ́ U+3A5.308.301 uc=Ϋ́ U+3A5.308.301 99: U+1FE2 ῢ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA fc=ῢ U+3C5.308.300 lc=ῢ U+1FE2 tc=Ϋ̀ U+3A5.308.300 uc=Ϋ̀ U+3A5.308.300 100: U+1FE7 ῧ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI fc=ῧ U+3C5.308.342 lc=ῧ U+1FE7 tc=Ϋ͂ U+3A5.308.342 uc=Ϋ͂ U+3A5.308.342 101: U+03D5 ϕ GREEK PHI SYMBOL fc=φ U+3C6 lc=ϕ U+3D5 tc=Φ U+3A6 uc=Φ U+3A6 102: U+1FA4 ᾤ GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI fc=ὤι U+1F64.3B9 lc=ᾤ U+1FA4 tc=ᾬ U+1FAC uc=ὬΙ U+1F6C.399 103: U+1FAC ᾬ GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI fc=ὤι U+1F64.3B9 lc=ᾤ U+1FA4 tc=ᾬ U+1FAC uc=ὬΙ U+1F6C.399 104: U+1FA2 ᾢ GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI fc=ὢι U+1F62.3B9 lc=ᾢ U+1FA2 tc=ᾪ U+1FAA uc=ὪΙ U+1F6A.399 105: U+1FAA ᾪ GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI fc=ὢι U+1F62.3B9 lc=ᾢ U+1FA2 tc=ᾪ U+1FAA uc=ὪΙ U+1F6A.399 106: U+1FA6 ᾦ GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI fc=ὦι U+1F66.3B9 lc=ᾦ U+1FA6 tc=ᾮ U+1FAE uc=ὮΙ U+1F6E.399 107: U+1FAE ᾮ GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI fc=ὦι U+1F66.3B9 lc=ᾦ U+1FA6 tc=ᾮ U+1FAE uc=ὮΙ U+1F6E.399 108: U+1FA0 ᾠ GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI fc=ὠι U+1F60.3B9 lc=ᾠ U+1FA0 tc=ᾨ U+1FA8 uc=ὨΙ U+1F68.399 109: U+1FA8 ᾨ GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI fc=ὠι U+1F60.3B9 lc=ᾠ U+1FA0 tc=ᾨ U+1FA8 uc=ὨΙ U+1F68.399 110: U+1FA5 ᾥ GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI fc=ὥι U+1F65.3B9 lc=ᾥ U+1FA5 tc=ᾭ U+1FAD uc=ὭΙ U+1F6D.399 111: U+1FAD ᾭ GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI fc=ὥι U+1F65.3B9 lc=ᾥ U+1FA5 tc=ᾭ U+1FAD uc=ὭΙ U+1F6D.399 112: U+1FA3 ᾣ GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI fc=ὣι U+1F63.3B9 lc=ᾣ U+1FA3 tc=ᾫ U+1FAB uc=ὫΙ U+1F6B.399 113: U+1FAB ᾫ GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI fc=ὣι U+1F63.3B9 lc=ᾣ U+1FA3 tc=ᾫ U+1FAB uc=ὫΙ U+1F6B.399 114: U+1FA7 ᾧ GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI fc=ὧι U+1F67.3B9 lc=ᾧ U+1FA7 tc=ᾯ U+1FAF uc=ὯΙ U+1F6F.399 115: U+1FAF ᾯ GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI fc=ὧι U+1F67.3B9 lc=ᾧ U+1FA7 tc=ᾯ U+1FAF uc=ὯΙ U+1F6F.399 116: U+1FA1 ᾡ GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI fc=ὡι U+1F61.3B9 lc=ᾡ U+1FA1 tc=ᾩ U+1FA9 uc=ὩΙ U+1F69.399 117: U+1FA9 ᾩ GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI fc=ὡι U+1F61.3B9 lc=ᾡ U+1FA1 tc=ᾩ U+1FA9 uc=ὩΙ U+1F69.399 118: U+1FF4 ῴ GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI fc=ώι U+3CE.3B9 lc=ῴ U+1FF4 tc=Ώͅ U+38F.345 uc=ΏΙ U+38F.399 119: U+1FF2 ῲ GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI fc=ὼι U+1F7C.3B9 lc=ῲ U+1FF2 tc=Ὼͅ U+1FFA.345 uc=ῺΙ U+1FFA.399 120: U+1FF6 ῶ GREEK SMALL LETTER OMEGA WITH PERISPOMENI fc=ῶ U+3C9.342 lc=ῶ U+1FF6 tc=Ω͂ U+3A9.342 uc=Ω͂ U+3A9.342 121: U+1FF7 ῷ GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI fc=ῶι U+3C9.342.3B9 lc=ῷ U+1FF7 tc=ῼ͂ U+3A9.342.345 uc=Ω͂Ι U+3A9.342.399 122: U+1FF3 ῳ GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI fc=ωι U+3C9.3B9 lc=ῳ U+1FF3 tc=ῼ U+1FFC uc=ΩΙ U+3A9.399 123: U+1FFC ῼ GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI fc=ωι U+3C9.3B9 lc=ῳ U+1FF3 tc=ῼ U+1FFC uc=ΩΙ U+3A9.399 124: U+0587 և ARMENIAN SMALL LIGATURE ECH YIWN fc=եւ U+565.582 lc=և U+587 tc=Եւ U+535.582 uc=ԵՒ U+535.552 125: U+FB14 ﬔ ARMENIAN SMALL LIGATURE MEN ECH fc=մե U+574.565 lc=ﬔ U+FB14 tc=Մե U+544.565 uc=ՄԵ U+544.535 126: U+FB15 ﬕ ARMENIAN SMALL LIGATURE MEN INI fc=մի U+574.56B lc=ﬕ U+FB15 tc=Մի U+544.56B uc=ՄԻ U+544.53B 127: U+FB17 ﬗ ARMENIAN SMALL LIGATURE MEN XEH fc=մխ U+574.56D lc=ﬗ U+FB17 tc=Մխ U+544.56D uc=ՄԽ U+544.53D 128: U+FB13 ﬓ ARMENIAN SMALL LIGATURE MEN NOW fc=մն U+574.576 lc=ﬓ U+FB13 tc=Մն U+544.576 uc=ՄՆ U+544.546 129: U+FB16 ﬖ ARMENIAN SMALL LIGATURE VEW NOW fc=վն U+57E.576 lc=ﬖ U+FB16 tc=Վն U+54E.576 uc=ՎՆ U+54E.546 |
|||
msg143085 - (view) | Author: Matthew Barnett (mrabarnett) * | Date: 2011-08-27 19:29 | |
There are some oddities in Unicode case-folding. Under full case-folding, both "\N{LATIN CAPITAL LETTER SHARP S}" and "\N{LATIN SMALL LETTER SHARP S}" fold to "ss", which means that those codepoints match each other. However, under simple case-folding, they fold to themselves, which means that those codepoints _don't_ match each other. |
|||
msg143086 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2011-08-27 20:04 | |
> Neither am I. Even in "old-style" English with ae and oe, one wrote > ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or > *Aesir. Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology. Trying to disprove you a bit: http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg but classical typographies seem to write either the uppercase Œ or the lowercase œ. That said, I wonder why Unicode even includes ligatures like ff. Sounds like mission creep to me (and horrible annoyances for people like us). |
|||
msg143089 - (view) | Author: Ezio Melotti (ezio.melotti) * | Date: 2011-08-28 05:54 | |
FTR, with the latest Python 3.2/3.3 (narrow) I get: Total failures: 58 / 500 ( 12%) Total successes: 442 / 500 ( 88%) and with the latest Python 3.2/3.3 (wide) I get: Total failures: 52 / 500 ( 10%) Total successes: 448 / 500 ( 90%) |
|||
msg143110 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2011-08-28 17:27 | |
Thanks Tom for such a clear explanation! I hope someone will implement this. (Matthew, does this affect regex? I am guessing it does, for case-insensitive matching?) |
|||
msg143119 - (view) | Author: Matthew Barnett (mrabarnett) * | Date: 2011-08-28 18:56 | |
The regex module currently uses simple case-folding, although I'm working towards full case-folding, as listed in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt. |
|||
msg143124 - (view) | Author: Tom Christiansen (tchrist) | Date: 2011-08-28 21:01 | |
Antoine Pitrou <report@bugs.python.org> wrote on Sat, 27 Aug 2011 20:04:56 -0000: >> Neither am I. Even in "old-style" English with ae and oe, one wrote >> ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or >> *Aesir. Similarly with ŒNOLOGY / Œnology / œnology, never *Oenology. > Trying to disprove you a bit: > http://ecx.images-amazon.com/images/I/51G6CH9XFFL._SL500_AA300_.jpg > http://ecx.images-amazon.com/images/I/51k7TmosPdL._SL500_AA300_.jpg > http://ecx.images-amazon.com/images/I/518UzMeLFCL._SL500_AA300_.jpg > but classical typographies seem to write either the uppercase Œ or the > lowercase œ. That's what I meant: one only ever sees œufs or ŒUFS, never OEUFS. French doesn't fit into ISO 8859-1. That's one of the changes to ISO-8859-15 compared with ISO-8859-1 (and Unicode): iso-8859-1 A4 ⇔ U+00A4 < ¤ > \N{CURRENCY SIGN} iso-8859-15 A4 ⇒ U+20AC < € > \N{EURO SIGN} iso-8859-1 A6 ⇔ U+00A6 < ¦ > \N{BROKEN BAR} iso-8859-15 A6 ⇒ U+0160 < Š > \N{LATIN CAPITAL LETTER S WITH CARON} iso-8859-1 A8 ⇔ U+00A8 < ¨ > \N{DIAERESIS} iso-8859-15 A8 ⇒ U+0161 < š > \N{LATIN SMALL LETTER S WITH CARON} iso-8859-1 B4 ⇔ U+00B4 < ´ > \N{ACUTE ACCENT} iso-8859-15 B4 ⇒ U+017D < Ž > \N{LATIN CAPITAL LETTER Z WITH CARON} iso-8859-1 B8 ⇔ U+00B8 < ¸ > \N{CEDILLA} iso-8859-15 B8 ⇒ U+017E < ž > \N{LATIN SMALL LETTER Z WITH CARON} iso-8859-1 BC ⇔ U+00BC < ¼ > \N{VULGAR FRACTION ONE QUARTER} iso-8859-15 BC ⇒ U+0152 < Œ > \N{LATIN CAPITAL LIGATURE OE} iso-8859-1 BD ⇔ U+00BD < ½ > \N{VULGAR FRACTION ONE HALF} iso-8859-15 BD ⇒ U+0153 < œ > \N{LATIN SMALL LIGATURE OE} iso-8859-1 BE ⇔ U+00BE < ¾ > \N{VULGAR FRACTION THREE QUARTERS} iso-8859-15 BE ⇒ U+0178 < Ÿ > \N{LATIN CAPITAL LETTER Y WITH DIAERESIS} > That said, I wonder why Unicode even includes ligatures like ff. Sounds > like mission creep to me (and horrible annoyances for people like us). I'm pretty sure that typographic ligatures are there for roundtripping with legacy encodings. I believe that œ/Œ is the only code point with ligature in its name that you're "supposed" to still use, and that all others should be figured out by modern fonting software. --tom |
|||
msg143145 - (view) | Author: Jean-Michel Fauth (Jean-Michel.Fauth) | Date: 2011-08-29 13:13 | |
Œ, œ or even & are historically ligatures or "ligatured forms". In the French typography, they are "single plain letters" and they belong the group of the 42 letters used in the French typography. Typographically speaking, using "oe" instead of "œ" is considered as a mistake, while not using the ligatured forms for the groups of letters like ff, ffi, ffl, fj, et, st is acceptable. Microsoft with cp1252, Apple with mac-roman, Adobe and all foundries and now "Unicode" are working correctly. It should be noted, when "TeX" moved from the ascii to iso-8859-1 (more precisely "CorkEncoding") as default encoding, "they" saw the problem and introduced the \oe or \OE commands. From my understanding and my point of view on the subject, ISO has somehow recognized his mistake by introducing iso-8859-15. Infortunatelly, it was too late. To the subject: Œdipe: correct, Oedipe, OEdipe: incorrect. Without beeing an expert on that field, all the informations one can find on Wikipedia (French) regarding questions about typography are generally correct. |
|||
msg143146 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2011-08-29 13:21 | |
> Œ, œ or even & are historically ligatures or "ligatured forms". > In the French typography, they are "single plain letters" and > they belong the group of the 42 letters used in the French > typography. > Typographically speaking, using "oe" instead of "œ" is considered > as a mistake, It's not only "typographically speaking", it's really a spelling error, even in hand-written text :-) |
|||
msg143148 - (view) | Author: Tom Christiansen (tchrist) | Date: 2011-08-29 14:16 | |
Antoine Pitrou <report@bugs.python.org> wrote on Mon, 29 Aug 2011 13:21:06 -0000: > It's not only "typographically speaking", it's really a spelling error, > even in hand-written text :-) Sure, and so too is omitting an accent mark or diaeresis. But—alas!—you’ll never convince most monoglot anglophones of that, the ones who keep wanting to strip them from résumé, façade, châteaux, crème brûlée, fête, tête-à-tête, à la française, or naïveté, not to mention José, jalapeño, the erstwhile American Secretary of State Federico Peña, or nearby Cañon City, Colorado, where I have family. I think œnonlogy has survived solely on its rarity, and the Encyclopædia Britannica is that way because the ligat(ur)ed letter is in their actual trademark. Cell phone users sending text messages have long suffered the grievous injuries to their language(s) that naked ASCII imparts, but this is nothing like the crossdressing nightmare called Greeklish, also variously known as Grenglish, Latinoellinika/Λατινοελληνικά, or ASCII Greek. http://en.wikipedia.org/wiki/Greeklish [...] The reason for this is the fact that text written in Greeklish is considerably less aesthetically pleasing, and also much harder to read, compared to text written in the Greek alphabet. A non-Greek speaker/reader can guess this by this example: "δις ιζ χαρντ του ριντ" would be the way to write "this is hard to read" in English but utilizing the Greek alphabet. I especially enjoy George Baloglou’s "Byzantine" Grenglish, wherein: Ὀδυσσεύς => Oducceus instead of Odysseus Ἀχιλλεύς => Axilleus instead of Achilleus Σίσυφος => Sicuphos instead of Sisyphus Περικλῆς => 5epiklhs instead of Pericles Χθονός => X8onos instead of Chthonos Οι Ατρείδες => Oi Atpeides instead of the Atreïdes Terrible though the depredations upon the French language that may have been committed by ASCII, surely these go even further. :) --tom Η Ιλιάδα H Iliada Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλῆος Mhnin aeide, 8ea, 5hlhiadeo Axilhos οὐλομένην, ἣ μυρί’ Ἀχαιοῖς ἄλγε’ ἔθηκε, oulomenhn, 'h mupi’ Axaiois alge’ e8hke, πολλὰς δ’ ἰφθίμους ψυχὰς Ἄϊδι προῒαψεν nollas d’ iph8imous yuxas Aidi npoiayen ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν 'hpoon, autous de elopia teuxe kuneccin οἰωνοῖσί τε πᾶσι· Διὸς δ’ ἐτελείετο βουλή· oionoici te naci· Dios d’ eteleieto boulh· ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε eks o'u dh ta npota diacththn epicante Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς. Atpeidhs te anaks andpon kai dios Axilleus. |
|||
msg150844 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2012-01-08 03:54 | |
Here is a patch. I only dealt with case mappings and not titlecase. Doing titlecase properly requires word segmentation, which I think should be another patch/issue. This patch fixes swapcase(), capitalize(), upper(), and lower(). It does not include the changes to Objects/unicodetype_db.h because those are huge. Regenerate the database if you want to test it. Please review. |
|||
msg150998 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2012-01-10 03:49 | |
New patch. I implemented it the way Antoine desired. It seems rather inefficient to be copying around so much data... |
|||
msg151016 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2012-01-10 14:03 | |
__ap__'s implementation method is about 2x faster than mine. |
|||
msg151088 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2012-01-11 20:20 | |
New patch with title casing mappings added. |
|||
msg151098 - (view) | Author: Roundup Robot (python-dev) | Date: 2012-01-11 23:17 | |
New changeset f7e05d205a52 by Benjamin Peterson in branch 'default': use full unicode mappings for upper/lower/title case (#12736) http://hg.python.org/cpython/rev/f7e05d205a52 |
|||
msg151141 - (view) | Author: Jim Jewett (Jim.Jewett) * | Date: 2012-01-12 17:17 | |
The currently applied patch ( http://hg.python.org/cpython/rev/f7e05d205a52 ) left some dead code in unicodeobject.c function fixup ( http://hg.python.org/cpython/file/f7e05d205a52/Objects/unicodeobject.c#l9386 ) has a shortcut for when the fixer doesn't make any actual changes. The removed fixers (like fixupper ) returned 0 rather than maxchar to indicate that. The only remaining fixer, fix_decimal_and_space_to_ascii (line 8839), does not. (I think fix_decimal_and_space_to_ascii *should* add a touched flag, but until it does, the shortcut dedup code is dead.) Also, around line 10502, there is an #if 0 section with code that relied on one of the removed fixers; is it time to remove that section? |
|||
msg151311 - (view) | Author: Jim Jewett (Jim.Jewett) * | Date: 2012-01-16 00:24 | |
Why was the delta-processing removed from the casing functions? As best I can tell, the whole point of going through multiple levels of indirection (courtesy splitbins) is to maximize compression and minimize the amount of cache that unicode might occupy. By using deltas, only one record is needed for each combination of (upper - lower, upper - title), which is generally only one or two combinations per script. Without deltas, nearly every cased letter needs its own record, and the index tables also get bigger. (It seems to be about 2.6 times as large, but cache effects may be worse, since letters from the same script will no longer be in the same record or the same index chain.) If it is a concern about not enough room for flags, then the decimal/digit chars could be combined. They are always the same, unless the number isn't decimal (in which case the flag is enough). |
|||
msg151314 - (view) | Author: Roundup Robot (python-dev) | Date: 2012-01-16 02:19 | |
New changeset 03ea95e3b497 by Benjamin Peterson in branch 'default': delta encoding of upper/lower/title makes a glorious return (#12736) http://hg.python.org/cpython/rev/03ea95e3b497 |
|||
msg261517 - (view) | Author: Андрей Баксаляр (Андрей Баксаляр) | Date: 2016-03-10 17:37 | |
A same problem with the unicode case mapping is still present in the Python 3.4.3. You can repeat the bug with this code, for instance: 'ΰ'.upper().lower() == 'ΰ' The case swapping is strangelly leads to character replacement: b'\xce\xb0' → b'\xcf\x85\xcc\x88\xcc\x81' |
|||
msg261522 - (view) | Author: Андрей Баксаляр (Андрей Баксаляр) | Date: 2016-03-10 20:21 | |
Interestingly, the bug is still reproducible in version 3.5.1, but fixed in 2.7.6. |
|||
msg261547 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2016-03-11 07:39 | |
The full case mappings do not preserve normalization form. >>> for c in 'ΰ'.upper().lower(): print(unicodedata.name(c)) ... GREEK SMALL LETTER UPSILON COMBINING DIAERESIS COMBINING ACUTE ACCENT >>> unicodedata.normalize('NFC', 'ΰ'.upper().lower()) == 'ΰ' True |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:20 | admin | set | github: 56945 |
2016-03-11 07:39:49 | benjamin.peterson | set | messages: + msg261547 |
2016-03-10 20:44:03 | gvanrossum | set | nosy:
- gvanrossum |
2016-03-10 20:42:37 | SilentGhost | set | versions: + Python 3.4, Python 3.5, Python 3.6, - Python 2.7 |
2016-03-10 20:21:51 | Андрей Баксаляр | set | files:
+ pythonbug.png messages: + msg261522 versions: + Python 2.7, - Python 3.4 |
2016-03-10 17:37:31 | Андрей Баксаляр | set | nosy:
+ Андрей Баксаляр messages: + msg261517 versions: + Python 3.4, - Python 3.3 |
2013-06-23 23:56:10 | belopolsky | link | issue4610 superseder |
2012-01-16 02:19:31 | python-dev | set | messages: + msg151314 |
2012-01-16 00:24:46 | Jim.Jewett | set | messages: + msg151311 |
2012-01-12 17:17:24 | Jim.Jewett | set | nosy:
+ Jim.Jewett messages: + msg151141 |
2012-01-11 23:23:51 | benjamin.peterson | set | status: open -> closed resolution: fixed |
2012-01-11 23:17:46 | python-dev | set | nosy:
+ python-dev messages: + msg151098 |
2012-01-11 20:20:09 | benjamin.peterson | set | files:
+ full-casemapping.patch messages: + msg151088 |
2012-01-11 03:38:21 | benjamin.peterson | set | files: + full-casemapping.patch |
2012-01-10 14:03:39 | benjamin.peterson | set | messages: + msg151016 |
2012-01-10 03:49:31 | benjamin.peterson | set | files:
+ full-casemapping.patch messages: + msg150998 |
2012-01-08 03:54:29 | benjamin.peterson | set | files:
+ full-casemapping.patch nosy: + benjamin.peterson messages: + msg150844 keywords: + patch |
2011-08-29 14:16:04 | tchrist | set | messages: + msg143148 |
2011-08-29 13:21:06 | pitrou | set | messages: + msg143146 |
2011-08-29 13:13:57 | Jean-Michel.Fauth | set | nosy:
+ Jean-Michel.Fauth messages: + msg143145 |
2011-08-28 21:01:49 | tchrist | set | messages: + msg143124 |
2011-08-28 18:56:35 | mrabarnett | set | messages: + msg143119 |
2011-08-28 17:27:28 | gvanrossum | set | messages: + msg143110 |
2011-08-28 05:54:35 | ezio.melotti | set | files:
+ casing-results.txt messages: + msg143089 |
2011-08-27 20:04:56 | pitrou | set | nosy:
+ pitrou messages: + msg143086 |
2011-08-27 19:29:28 | mrabarnett | set | messages: + msg143085 |
2011-08-27 19:17:30 | tchrist | set | messages: + msg143084 |
2011-08-27 16:15:33 | gvanrossum | set | messages: + msg143083 |
2011-08-27 14:48:38 | tchrist | set | messages: + msg143072 |
2011-08-26 23:55:58 | tchrist | set | files:
+ casing-tests.py messages: + msg143052 |
2011-08-26 23:36:17 | tchrist | set | messages: + msg143051 |
2011-08-26 21:11:23 | gvanrossum | set | nosy:
+ gvanrossum messages: + msg143036 |
2011-08-13 00:58:12 | mrabarnett | set | nosy:
+ mrabarnett |
2011-08-12 18:05:57 | Arfrever | set | nosy:
+ Arfrever |
2011-08-12 17:30:15 | eric.araujo | set | components:
+ Interpreter Core, Unicode, - Library (Lib) versions: + Python 3.3, - Python 3.2 |
2011-08-12 00:17:23 | ezio.melotti | set | nosy:
+ belopolsky, ezio.melotti |
2011-08-11 21:39:44 | tchrist | create |