Title: Speed up unicode normalization of ASCII strings
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.11
Status: closed Resolution: fixed
Assigned To: Nosy List: corona10, ezio.melotti, serhiy.storchaka, steven.daprano, vstinner
Priority: normal Keywords: patch

Created on 2021-08-24 02:07 by steven.daprano, last changed 2022-04-11 14:59 by admin.

PR 28283 merged corona10, 2021-09-11 05:37
PR 28293 merged corona10, 2021-09-11 18:51
Author: Steven D'Aprano (steven.daprano) Date: 2021-08-24 02:07
I think there is an opportunity to speed up some unicode normalisations significantly.

In 3.9 at least, the normalisation appears to be dependent on the length of the string:

    >>> setup="from unicodedata import normalize; s = 'reverse'"
    >>> t1 = Timer('normalize("NFKC", s)', setup=setup)
    >>> setup="from unicodedata import normalize; s = 'reverse'*1000"
    >>> t2 = Timer('normalize("NFKC", s)', setup=setup)
    >>> min(t1.repeat(repeat=7))
    >>> min(t2.repeat(repeat=7))

But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP 393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation.
Author: STINNER Victor (vstinner) Date: 2021-09-07 20:13
Well, someone should write a PR for it.
Author: Dong-hee Na (corona10) Date: 2021-09-11 14:02
> Well, someone should write a PR for it.

Well, I sent a patch :)
Author: Serhiy Storchaka (serhiy.storchaka) Date: 2021-09-11 15:04
New changeset 9abd07e5963f966c4d6df8f4e4bf390ed8191066 by Dong-hee Na in branch 'main':
bpo-44987: Speed up unicode normalization of ASCII strings (GH-28283)
Author: Dong-hee Na (corona10) Date: 2021-09-11 19:06
New changeset 5277ffe12d492939544ff9c54a3aaf448b913fb3 by Dong-hee Na in branch 'main':
bpo-44987: Fix typo whatsnew 3.11 (GH-28293)
