classification
Title: Further improve casefold documentation
Type: Stage: needs patch
Components: Documentation Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Mariatta Nosy List: Jim.Jewett, Marc Richter, Mariatta, benjamin.peterson, cheryl.sabella, docs@python, mark, rhettinger
Priority: normal Keywords:

Created on 2012-01-19 17:06 by Jim.Jewett, last changed 2019-03-23 16:53 by cheryl.sabella.

Messages (7)
msg151644 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2012-01-19 17:06
> http://hg.python.org/cpython/rev/0b5ce36a7a24
> changeset:   74515:0b5ce36a7a24


> +   Casefolding is similar to lowercasing but more aggressive because it is
> +   intended to remove all case distinctions in a string. For example, the German
> +   lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already
> +   lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold`
> +   converts it to ``"ss"``.

Perhaps add the recommendation to canonicalize as well.

A complete, but possibly too long, try is below:


Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold` converts it to ``"ss"``.  Note that most case-insensitive matches should also match compatibility equivalent characters.  

The casefolding algorithm is described in section 3.13 of the Unicode Standard.  Per D146, a compatibility caseless match can be achieved by

    from unicodedata import normalize
    def caseless_compat(string):
        nfd_string = normalize("NFD", string)
        nfkd1_string = normalize("NFKD", nfd_string.casefold())
        return normalize("NFKD", nfkd1_string.casefold())
msg151645 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2012-01-19 17:09
Frankly, I do think that sample code is too long, but correctness matters ... perhaps a better solution would be to add either a method or a unicodedata function that does the work, then the extra note could just say

Note that most case-insensitive matches should also match compatibility equivalent characters; see unicodedata.compatibity_casefold
msg151665 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-20 01:12
It's a bit unfriendly to launch into discussion of "compatiblity caseless matching" when the new reader probably has no idea what "compatibility-equivalence" is.
msg253662 - (view) Author: Mark Summerfield (mark) * Date: 2015-10-29 07:14
I think the str.casefold() docs are fine as far as they go, rightly covering what it _does_ rather than _how_, yet providing a reference for the details. But what they lack is more complete information. For example I discovered this:

>>> x = "files and shuffles"
>>> x
'files and shuffles'
>>> x.casefold()
'files and shuffles'

In view of this I would add one sentence:

    In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi".
msg253797 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-10-31 15:36
> In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi".

+1 I would have found that sentence to be helpful.
msg327334 - (view) Author: Marc Richter (Marc Richter) Date: 2018-10-08 09:33
+1 as well.

To be honest, I did not understand what this function does in detail yet.
Since not too long ago (2017) in Germany, there was an uppercase-variant for the special letter from this function's example (ß) been added to the official orthography [1].
Is this something that needs to be changed in this function's behavior now or stays this expected behavior? I'm still puzzled and I think the whole function should get a more clear description.

[1]: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E
msg338689 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-03-23 16:53
Assigning to @Mariatta for the sprints.
History
Date User Action Args
2019-03-23 16:53:57cheryl.sabellasetversions: + Python 3.7, Python 3.8, - Python 3.3
nosy: + Mariatta, cheryl.sabella

messages: + msg338689

assignee: docs@python -> Mariatta
stage: needs patch
2018-10-08 09:33:46Marc Richtersetnosy: + Marc Richter
messages: + msg327334
2015-10-31 15:36:13rhettingersetnosy: + rhettinger
messages: + msg253797
2015-10-29 07:14:19marksetnosy: + mark
messages: + msg253662
2012-01-20 01:12:41benjamin.petersonsetmessages: + msg151665
2012-01-19 17:09:52Jim.Jewettsetmessages: + msg151645
2012-01-19 17:06:02Jim.Jewettcreate