Message 400192 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	ezio.melotti, steven.daprano, vstinner
Date	2021-08-24.02:07:52
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1629770872.82.0.815585920926.issue44987@roundup.psfhosted.org>
In-reply-to

Content
I think there is an opportunity to speed up some unicode normalisations significantly. In 3.9 at least, the normalisation appears to be dependent on the length of the string: >>> setup="from unicodedata import normalize; s = 'reverse'" >>> t1 = Timer('normalize("NFKC", s)', setup=setup) >>> setup="from unicodedata import normalize; s = 'reverse'*1000" >>> t2 = Timer('normalize("NFKC", s)', setup=setup) >>> >>> min(t1.repeat(repeat=7)) 0.04854234401136637 >>> min(t2.repeat(repeat=7)) 9.98313440399943 But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP 393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation.

I think there is an opportunity to speed up some unicode normalisations significantly.

In 3.9 at least, the normalisation appears to be dependent on the length of the string:

    >>> setup="from unicodedata import normalize; s = 'reverse'"
    >>> t1 = Timer('normalize("NFKC", s)', setup=setup)
    >>> setup="from unicodedata import normalize; s = 'reverse'*1000"
    >>> t2 = Timer('normalize("NFKC", s)', setup=setup)
    >>> 
    >>> min(t1.repeat(repeat=7))
    0.04854234401136637
    >>> min(t2.repeat(repeat=7))
    9.98313440399943

But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP 393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation.

History
Date	User	Action	Args
2021-08-24 02:07:52	steven.daprano	set	recipients: + steven.daprano, vstinner, ezio.melotti
2021-08-24 02:07:52	steven.daprano	set	messageid: <1629770872.82.0.815585920926.issue44987@roundup.psfhosted.org>
2021-08-24 02:07:52	steven.daprano	link	issue44987 messages
2021-08-24 02:07:52	steven.daprano	create