classification
Title: More fully implement Unicode's case mappings
Type: Stage:
Components: Unicode Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Greg Price, benjamin.peterson, ezio.melotti, lemburg, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2019-08-14 05:42 by Greg Price, last changed 2019-08-15 01:15 by benjamin.peterson.

Messages (9)
msg349646 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-14 05:42
Splitting this out from #32771 for more specific discussion. Benjamin writes there that it would be good to:

> implement the locale-specific case mappings of https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt and §3.13 of the Unicode 12 standard in str.lower/upper/casefold.

and adds that an implementation would require having available in the core the data on canonical combining classes, which is currently only in the unicodedata module.

---

First, I'd like to better understand what functionality we have now and what else the standard describes.  Reading https://www.unicode.org/Public/12.0.0/ucd/SpecialCasing.txt , I see
* a bunch of rules that aren't language-specific
* some other rules that are.

I also see in makeunicodedata.py that we don't even parse the language-specific rules.

Here's, IIUC, a demo of us correctly implementing the language-independent rules.  One line in the data file reads:

FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF

And in fact the `lower`, `title`, and `upper` of `\uFB00` are those strings respectively:

$ unicode --brief "$(./python -c \
   's="\ufb00"; print(" ".join((s.lower(), s.title(), s.upper())))')"
ff U+FB00 LATIN SMALL LIGATURE FF
  U+0020 SPACE
F U+0046 LATIN CAPITAL LETTER F
f U+0066 LATIN SMALL LETTER F
  U+0020 SPACE
F U+0046 LATIN CAPITAL LETTER F
F U+0046 LATIN CAPITAL LETTER F

OK, great.

---


Then here's something we don't implement. Another line in the file reads:

00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE

IOW `'\u00CD'` should lowercase to `'\u0069\u0307\u0301'`, i.e.:

i U+0069 LATIN SMALL LETTER I
 ̇ U+0307 COMBINING DOT ABOVE
 ́ U+0301 COMBINING ACUTE ACCENT

... but only in a Lithuanian (`lt`) locale.

One question is: what would the right API for this be? I'm not sure I'd want `str.lower`'s results to depend on the process's current Unix locale... and I definitely wouldn't want to get that without some way of instead telling it what locale to use. (Either to use a single constant locale, or to use a per-user locale in e.g. a web application.)  Perhaps `str.lower` and friends would take a keyword argument `locale`?


Oh, one more link for reference: the said section of the standard is in this PDF: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf , near the end.


And a related previous issue: #12736.
msg349649 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-14 06:02
Another previous discussion is #4610.
msg349659 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2019-08-14 07:40
The Unicode implementation is deliberately not locale specific and
this should not change.

If a locale specific mapping is requested, this should be done
explicitly by e.g. providing a parameter to str.lower() / upper() /
title().
msg349660 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-08-14 08:08
I believe that all locale specific things should be in the locale module, not in the str class.
msg349669 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 10:19
Handling locales correctly is a pain. Each platform uses different locale names (different Linux distributions, Windows, FreeBSD, macOS, etc.). For example, en_US.UTF-8 vs en_US.utf8. For example, there are tons of bugs related to locale.getdefaultlocale() which tries to be smart on parsing locales. I'm fixing locale encoding bugs for 10 years and I'm not done yet: there are still many open bugs.

I suggest you to first create a module on PyPI to experiment getting the locale and attempt to implement Unicode algorithms which depend on the locale.

Maintaining Python is already expensive, I would prefer to not complicate the codebase too much with locales. There are already enough bugs waiting for you to be fixed ;-)
msg349732 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-14 18:26
> I believe that all locale specific things should be in the locale module, not in the str class.

The locale module is all about doing things with the current process-global Unix locale. I don't think that'd be an appropriate interface for this -- if it's worth doing, it's worth doing in such a way that the same web server process can handle requests for Turkish-, Lithuanian-, and Spanish-speaking users without having to reset a global variable for each one.

> If a locale specific mapping is requested, this should be done
> explicitly by e.g. providing a parameter to str.lower() / upper() /
> title().

I like this design.

I said "locale" above, but that wasn't quite right, I think -- the file says e.g. `tr`, not `tr_TR` and `tr_CY`, and it describes the identifiers as "language IDs".  So perhaps

str.lower(*, lang=None)

?  And then

"I".lower(lang="tr") == "ı" == "\N{Latin small letter dotless I}"
msg349733 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-14 18:40
> Maintaining Python is already expensive [...] There are already enough bugs waiting for you to be fixed ;-)

BTW I basically agree with this. I think this is not a high-priority issue, and I have my eye on some of those bugs. :-)

I think the fact that it's per-*language* (despite my inaccurate phrasing in the OP), not per-locale, simplifies it some -- for example the whole `.UTF-8` vs `.utf8` thing doesn't appear. And in particular I think if/when someone decides to sit down and make an implementation of this, then if they take the time to carefully read and absorb the relevant pages of the standard... this is a feature where it's pretty feasible for the implementation to be a self-contained and relatively stable and low-bugs piece of code.

And in general I think even if nobody implements it soon, it's useful to have an issue that can be pointed to for this feature, and especially so if the discussion clearly lays out what the feature involves and what different people's views are on the API. For example #18236 has been open for 6 years, but the discussion there was extremely helpful for me to understand it and work up a fix, after just being pointed to it by someone who'd searched the tracker on seeing me send in the doc fix GH-15019.
msg349738 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-14 19:09
(I should add that it was only after doing the reading that produced the OP that I had a clear idea what I thought the priority of the issue was -- before doing that work I didn't have a clear sense of the scope of what it affects. Based on that SpecialCasing.txt file as of Unicode 12.0.0, I believe the functionality we don't currently support is entirely about the handling of certain versions of the Latin letter I, as treated in Lithuanian, Turkish, and Azerbaijani. Though one function of this issue thread is that it would be a great place to point out if there's another component to it!)
msg349782 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-08-15 01:15
Greg has read my mind. An optional parameter to upper/lower/casefold was exactly the API I was thinking of. No C locales or the locale module involved.
History
Date User Action Args
2019-08-15 01:15:14benjamin.petersonsetmessages: + msg349782
2019-08-14 19:09:19Greg Pricesetmessages: + msg349738
2019-08-14 18:40:35Greg Pricesetmessages: + msg349733
2019-08-14 18:26:30Greg Pricesetmessages: + msg349732
2019-08-14 10:19:24vstinnersetmessages: + msg349669
2019-08-14 08:08:29serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg349660
2019-08-14 07:40:52lemburgsetmessages: + msg349659
2019-08-14 06:02:48Greg Pricesetmessages: + msg349649
2019-08-14 05:42:20Greg Pricecreate