Message400192
I think there is an opportunity to speed up some unicode normalisations significantly.
In 3.9 at least, the normalisation appears to be dependent on the length of the string:
>>> setup="from unicodedata import normalize; s = 'reverse'"
>>> t1 = Timer('normalize("NFKC", s)', setup=setup)
>>> setup="from unicodedata import normalize; s = 'reverse'*1000"
>>> t2 = Timer('normalize("NFKC", s)', setup=setup)
>>>
>>> min(t1.repeat(repeat=7))
0.04854234401136637
>>> min(t2.repeat(repeat=7))
9.98313440399943
But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP 393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation. |
|
Date |
User |
Action |
Args |
2021-08-24 02:07:52 | steven.daprano | set | recipients:
+ steven.daprano, vstinner, ezio.melotti |
2021-08-24 02:07:52 | steven.daprano | set | messageid: <1629770872.82.0.815585920926.issue44987@roundup.psfhosted.org> |
2021-08-24 02:07:52 | steven.daprano | link | issue44987 messages |
2021-08-24 02:07:52 | steven.daprano | create | |
|