Message350651
In 3.8 we add a new function `unicodedata.is_normalized`. The result is equivalent to `str == unicodedata.normalize(form, str)`, but the implementation uses a version of the "quick check" algorithm from UAX #15 as an optimization to try to avoid having to copy the whole string. This was added in issue #32285, commit 2810dd7be.
However, it turns out the code doesn't actually implement the same algorithm as UAX #15, and as a result we often miss the optimization and end up having to compute the whole normalized string after all.
Here's a quick demo on my desktop. We pass a long string made entirely out of a character for which the quick-check algorithm always says `NO`, it's not normalized:
$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop
$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
's == unicodedata.normalize("NFD", s)'
50 loops, best of 5: 4.41 msec per loop
That's the same 4.4 ms (for a 1 MB string) with or without the attempted optimization.
Here it is after a patch that makes the algorithm run as in the standard:
$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop
Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1).
The root cause of the issue is that our `is_normalized` static helper, which the new function relies on, was never written as a full implementation of the quick-check algorithm. The full algorithm can return YES, MAYBE, or NO; but originally this helper's only caller was the implementation of `unicodedata.normalize`, which only cares about YES vs. MAYBE-or-NO. So the helper often returns MAYBE when the standard algorithm would say NO.
(More precisely, perhaps: it's fine that this helper was never a full implementation... but it didn't say that anywhere, even while referring to the standard algorithm, and as a result set us up for future confusion.)
That's exactly what's happening in the example above: the standard quick-check algorithm would say NO, but our helper says MAYBE. Which for `unicodedata.is_normalized` means it has to go compute the whole normalized string. |
|
Date |
User |
Action |
Args |
2019-08-28 03:52:56 | Greg Price | set | recipients:
+ Greg Price |
2019-08-28 03:52:56 | Greg Price | set | messageid: <1566964376.46.0.968905414274.issue37966@roundup.psfhosted.org> |
2019-08-28 03:52:56 | Greg Price | link | issue37966 messages |
2019-08-28 03:52:55 | Greg Price | create | |
|