classification
Title: Further improve casefold documentation
Type: Stage: needs patch
Components: Documentation Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Mariatta Nosy List: Jim.Jewett, Marc Richter, Mariatta, MrSupertash, benjamin.peterson, cheryl.sabella, docs@python, mark, rhettinger
Priority: normal Keywords:

Created on 2012-01-19 17:06 by Jim.Jewett, last changed 2020-08-24 17:39 by Jim.Jewett.

Messages (11)
msg151644 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2012-01-19 17:06
> http://hg.python.org/cpython/rev/0b5ce36a7a24
> changeset:   74515:0b5ce36a7a24


> +   Casefolding is similar to lowercasing but more aggressive because it is
> +   intended to remove all case distinctions in a string. For example, the German
> +   lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already
> +   lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold`
> +   converts it to ``"ss"``.

Perhaps add the recommendation to canonicalize as well.

A complete, but possibly too long, try is below:


Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold` converts it to ``"ss"``.  Note that most case-insensitive matches should also match compatibility equivalent characters.  

The casefolding algorithm is described in section 3.13 of the Unicode Standard.  Per D146, a compatibility caseless match can be achieved by

    from unicodedata import normalize
    def caseless_compat(string):
        nfd_string = normalize("NFD", string)
        nfkd1_string = normalize("NFKD", nfd_string.casefold())
        return normalize("NFKD", nfkd1_string.casefold())
msg151645 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2012-01-19 17:09
Frankly, I do think that sample code is too long, but correctness matters ... perhaps a better solution would be to add either a method or a unicodedata function that does the work, then the extra note could just say

Note that most case-insensitive matches should also match compatibility equivalent characters; see unicodedata.compatibity_casefold
msg151665 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-20 01:12
It's a bit unfriendly to launch into discussion of "compatiblity caseless matching" when the new reader probably has no idea what "compatibility-equivalence" is.
msg253662 - (view) Author: Mark Summerfield (mark) * Date: 2015-10-29 07:14
I think the str.casefold() docs are fine as far as they go, rightly covering what it _does_ rather than _how_, yet providing a reference for the details. But what they lack is more complete information. For example I discovered this:

>>> x = "files and shuffles"
>>> x
'files and shuffles'
>>> x.casefold()
'files and shuffles'

In view of this I would add one sentence:

    In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi".
msg253797 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-10-31 15:36
> In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi".

+1 I would have found that sentence to be helpful.
msg327334 - (view) Author: Marc Richter (Marc Richter) Date: 2018-10-08 09:33
+1 as well.

To be honest, I did not understand what this function does in detail yet.
Since not too long ago (2017) in Germany, there was an uppercase-variant for the special letter from this function's example (ß) been added to the official orthography [1].
Is this something that needs to be changed in this function's behavior now or stays this expected behavior? I'm still puzzled and I think the whole function should get a more clear description.

[1]: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E
msg338689 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-03-23 16:53
Assigning to @Mariatta for the sprints.
msg375842 - (view) Author: Thorsten (MrSupertash) Date: 2020-08-24 13:48
German example in casefolding is plain incorrect.

#Casefolding is similar to lowercasing but more aggressive because it is #intended to remove all case distinctions in a string. For example, the #German lowercase letter 'ß' is equivalent to "ss". Since it is already #lowercase, lower() would do nothing to 'ß'; casefold() converts it to #"ss".

It is not true that "ß" is equivalent to "ss" and has not been since an orthography reform in 1996. These are to be used in distinct use cases. "ß" after a diphthong or a long/open vowel. "ss" after a short/closed vowel. The documentation correctly describes (in this case) how Python handles the .casefold() for this letter, although the behavior itself is incorrect.

As mentioned before, in 2017 an official upper-case version of "ß" has been introduced into German orthography: "ẞ". The German example should be stated as current incorrect behavior in the documentation.

+1 to adding previously mentioned sentence: In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi".
msg375844 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2020-08-24 13:52
Correctness of casefolding is defined by the Unicode standard, which currently states that "ß" folds to "ss".
msg375847 - (view) Author: Thorsten (MrSupertash) Date: 2020-08-24 15:01
I see. I found the documents. That's an issue. That usage is incorrect. It is still valid to upper case "ß" to SS since "ẞ" is fairly new as an official German character, but the other way around is not valid.

As such the current sentence in documentation also just does not make sense.

>"Since it is already lowercase, lower() would do nothing to 'ß'"

Exactly. Why would it? It is nonsensical to change an already lowercase character with a lowercase function.

Suggest to update to:

"For example, the Unicode standard for German lower case letter 'ß' prescribes full casefolding to 'ss'. Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to 'ss'.
In addition to full lowercasing, this function also expands ligatures, for example, 'fi' becomes 'fi'."
msg375858 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2020-08-24 17:39
Unicode probably won't make the correction, because of backwards
compatibility.  I do support the sentence suggested in Thorsten's most
recent reply.  Is expanding ligatures the only other normalization it does?

Ideally, we should also mention that it shifts to the canonical case, which
is usually (but not always) lowercase.  I think Cherokee is one that folds
to the upper case.

On Mon, Aug 24, 2020 at 11:02 AM Thorsten <report@bugs.python.org> wrote:

>
> Thorsten <mrsupertash@gmail.com> added the comment:
>
> I see. I found the documents. That's an issue. That usage is incorrect. It
> is still valid to upper case "ß" to SS since "ẞ" is fairly new as an
> official German character, but the other way around is not valid.
>
> As such the current sentence in documentation also just does not make
> sense.
>
> >"Since it is already lowercase, lower() would do nothing to 'ß'"
>
> Exactly. Why would it? It is nonsensical to change an already lowercase
> character with a lowercase function.
>
> Suggest to update to:
>
> "For example, the Unicode standard for German lower case letter 'ß'
> prescribes full casefolding to 'ss'. Since it is already lowercase, lower()
> would do nothing to 'ß'; casefold() converts it to 'ss'.
> In addition to full lowercasing, this function also expands ligatures, for
> example, 'fi' becomes 'fi'."
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue13828>
> _______________________________________
>
History
Date User Action Args
2020-08-24 17:39:41Jim.Jewettsetmessages: + msg375858
2020-08-24 15:01:42MrSupertashsetmessages: + msg375847
2020-08-24 13:52:36benjamin.petersonsetmessages: + msg375844
2020-08-24 13:48:30MrSupertashsetnosy: + MrSupertash
messages: + msg375842
2019-03-23 16:53:57cheryl.sabellasetversions: + Python 3.7, Python 3.8, - Python 3.3
nosy: + Mariatta, cheryl.sabella

messages: + msg338689

assignee: docs@python -> Mariatta
stage: needs patch
2018-10-08 09:33:46Marc Richtersetnosy: + Marc Richter
messages: + msg327334
2015-10-31 15:36:13rhettingersetnosy: + rhettinger
messages: + msg253797
2015-10-29 07:14:19marksetnosy: + mark
messages: + msg253662
2012-01-20 01:12:41benjamin.petersonsetmessages: + msg151665
2012-01-19 17:09:52Jim.Jewettsetmessages: + msg151645
2012-01-19 17:06:02Jim.Jewettcreate