This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Speed up unicode normalization of ASCII strings
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: corona10, ezio.melotti, serhiy.storchaka, steven.daprano, vstinner
Priority: normal Keywords: patch

Created on 2021-08-24 02:07 by steven.daprano, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 28283 merged corona10, 2021-09-11 05:37
PR 28293 merged corona10, 2021-09-11 18:51
Messages (5)
msg400192 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-08-24 02:07
I think there is an opportunity to speed up some unicode normalisations significantly.

In 3.9 at least, the normalisation appears to be dependent on the length of the string:

    >>> setup="from unicodedata import normalize; s = 'reverse'"
    >>> t1 = Timer('normalize("NFKC", s)', setup=setup)
    >>> setup="from unicodedata import normalize; s = 'reverse'*1000"
    >>> t2 = Timer('normalize("NFKC", s)', setup=setup)
    >>> 
    >>> min(t1.repeat(repeat=7))
    0.04854234401136637
    >>> min(t2.repeat(repeat=7))
    9.98313440399943

But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP 393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation.
msg401342 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-07 20:13
Well, someone should write a PR for it.
msg401639 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-09-11 14:02
> Well, someone should write a PR for it.

Well, I sent a patch :)
msg401641 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-09-11 15:04
New changeset 9abd07e5963f966c4d6df8f4e4bf390ed8191066 by Dong-hee Na in branch 'main':
bpo-44987: Speed up unicode normalization of ASCII strings (GH-28283)
https://github.com/python/cpython/commit/9abd07e5963f966c4d6df8f4e4bf390ed8191066
msg401646 - (view) Author: Dong-hee Na (corona10) * (Python committer) Date: 2021-09-11 19:06
New changeset 5277ffe12d492939544ff9c54a3aaf448b913fb3 by Dong-hee Na in branch 'main':
bpo-44987: Fix typo whatsnew 3.11 (GH-28293)
https://github.com/python/cpython/commit/5277ffe12d492939544ff9c54a3aaf448b913fb3
History
Date User Action Args
2022-04-11 14:59:49adminsetgithub: 89150
2021-09-11 19:06:01corona10setmessages: + msg401646
2021-09-11 18:51:02corona10setpull_requests: + pull_request26709
2021-09-11 15:05:19serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-09-11 15:04:42serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg401641
2021-09-11 14:02:00corona10setmessages: + msg401639
2021-09-11 05:37:04corona10setkeywords: + patch
stage: patch review
pull_requests: + pull_request26700
2021-09-11 05:11:59corona10setnosy: + corona10
2021-09-07 20:13:31vstinnersetmessages: + msg401342
2021-08-24 02:07:52steven.dapranocreate