This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author steven.daprano
Recipients ezio.melotti, steven.daprano, vstinner
Date 2021-08-24.02:07:52
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1629770872.82.0.815585920926.issue44987@roundup.psfhosted.org>
In-reply-to
Content
I think there is an opportunity to speed up some unicode normalisations significantly.

In 3.9 at least, the normalisation appears to be dependent on the length of the string:

    >>> setup="from unicodedata import normalize; s = 'reverse'"
    >>> t1 = Timer('normalize("NFKC", s)', setup=setup)
    >>> setup="from unicodedata import normalize; s = 'reverse'*1000"
    >>> t2 = Timer('normalize("NFKC", s)', setup=setup)
    >>> 
    >>> min(t1.repeat(repeat=7))
    0.04854234401136637
    >>> min(t2.repeat(repeat=7))
    9.98313440399943

But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP 393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation.
History
Date User Action Args
2021-08-24 02:07:52steven.dapranosetrecipients: + steven.daprano, vstinner, ezio.melotti
2021-08-24 02:07:52steven.dapranosetmessageid: <1629770872.82.0.815585920926.issue44987@roundup.psfhosted.org>
2021-08-24 02:07:52steven.dapranolinkissue44987 messages
2021-08-24 02:07:52steven.dapranocreate