Message 349646 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Greg Price
Recipients	Greg Price, benjamin.peterson, ezio.melotti, lemburg, vstinner
Date	2019-08-14.05:42:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1565761340.97.0.731805225537.issue37848@roundup.psfhosted.org>
In-reply-to

Content
Splitting this out from #32771 for more specific discussion. Benjamin writes there that it would be good to: > implement the locale-specific case mappings of https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of the Unicode 12 standard in str.lower/upper/casefold. and adds that an implementation would require having available in the core the data on canonical combining classes, which is currently only in the unicodedata module. --- First, I'd like to better understand what functionality we have now and what else the standard describes. Reading https://www.unicode.org/Public/12.0.0/ucd/SpecialCasing.txt , I see * a bunch of rules that aren't language-specific * some other rules that are. I also see in makeunicodedata.py that we don't even parse the language-specific rules. Here's, IIUC, a demo of us correctly implementing the language-independent rules. One line in the data file reads: FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF And in fact the `lower`, `title`, and `upper` of `\uFB00` are those strings respectively: $ unicode --brief "$(./python -c \ 's="\ufb00"; print(" ".join((s.lower(), s.title(), s.upper())))')" ﬀ U+FB00 LATIN SMALL LIGATURE FF U+0020 SPACE F U+0046 LATIN CAPITAL LETTER F f U+0066 LATIN SMALL LETTER F U+0020 SPACE F U+0046 LATIN CAPITAL LETTER F F U+0046 LATIN CAPITAL LETTER F OK, great. --- Then here's something we don't implement. Another line in the file reads: 00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE IOW `'\u00CD'` should lowercase to `'\u0069\u0307\u0301'`, i.e.: i U+0069 LATIN SMALL LETTER I ̇ U+0307 COMBINING DOT ABOVE ́ U+0301 COMBINING ACUTE ACCENT ... but only in a Lithuanian (`lt`) locale. One question is: what would the right API for this be? I'm not sure I'd want `str.lower`'s results to depend on the process's current Unix locale... and I definitely wouldn't want to get that without some way of instead telling it what locale to use. (Either to use a single constant locale, or to use a per-user locale in e.g. a web application.) Perhaps `str.lower` and friends would take a keyword argument `locale`? Oh, one more link for reference: the said section of the standard is in this PDF: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf , near the end. And a related previous issue: #12736.

Splitting this out from #32771 for more specific discussion. Benjamin writes there that it would be good to:

> implement the locale-specific case mappings of https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of the Unicode 12 standard in str.lower/upper/casefold.

and adds that an implementation would require having available in the core the data on canonical combining classes, which is currently only in the unicodedata module.

---

First, I'd like to better understand what functionality we have now and what else the standard describes. Reading https://www.unicode.org/Public/12.0.0/ucd/SpecialCasing.txt , I see
* a bunch of rules that aren't language-specific
* some other rules that are.

I also see in makeunicodedata.py that we don't even parse the language-specific rules.

Here's, IIUC, a demo of us correctly implementing the language-independent rules. One line in the data file reads:

FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF

And in fact the `lower`, `title`, and `upper` of `\uFB00` are those strings respectively:

$ unicode --brief "$(./python -c \
's="\ufb00"; print(" ".join((s.lower(), s.title(), s.upper())))')"
ﬀ U+FB00 LATIN SMALL LIGATURE FF
U+0020 SPACE
F U+0046 LATIN CAPITAL LETTER F
f U+0066 LATIN SMALL LETTER F
U+0020 SPACE
F U+0046 LATIN CAPITAL LETTER F
F U+0046 LATIN CAPITAL LETTER F

OK, great.

---

Then here's something we don't implement. Another line in the file reads:

00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE

IOW `'\u00CD'` should lowercase to `'\u0069\u0307\u0301'`, i.e.:

i U+0069 LATIN SMALL LETTER I
̇ U+0307 COMBINING DOT ABOVE
́ U+0301 COMBINING ACUTE ACCENT

... but only in a Lithuanian (`lt`) locale.

One question is: what would the right API for this be? I'm not sure I'd want `str.lower`'s results to depend on the process's current Unix locale... and I definitely wouldn't want to get that without some way of instead telling it what locale to use. (Either to use a single constant locale, or to use a per-user locale in e.g. a web application.) Perhaps `str.lower` and friends would take a keyword argument `locale`?

Oh, one more link for reference: the said section of the standard is in this PDF: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf , near the end.

And a related previous issue: #12736.

History
Date	User	Action	Args
2019-08-14 05:42:20	Greg Price	set	recipients: + Greg Price, lemburg, vstinner, benjamin.peterson, ezio.melotti
2019-08-14 05:42:20	Greg Price	set	messageid: <1565761340.97.0.731805225537.issue37848@roundup.psfhosted.org>
2019-08-14 05:42:20	Greg Price	link	issue37848 messages
2019-08-14 05:42:20	Greg Price	create