New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is_normalized is much slower at "no" than the standard's algorithm #82147
Comments
In 3.8 we add a new function However, it turns out the code doesn't actually implement the same algorithm as UAX #15, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Here's a quick demo on my desktop. We pass a long string made entirely out of a character for which the quick-check algorithm always says $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop
$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
's == unicodedata.normalize("NFD", s)'
50 loops, best of 5: 4.41 msec per loop That's the same 4.4 ms (for a 1 MB string) with or without the attempted optimization. Here it is after a patch that makes the algorithm run as in the standard: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1). The root cause of the issue is that our (More precisely, perhaps: it's fine that this helper was never a full implementation... but it didn't say that anywhere, even while referring to the standard algorithm, and as a result set us up for future confusion.) That's exactly what's happening in the example above: the standard quick-check algorithm would say NO, but our helper says MAYBE. Which for |
Thanks for that! |
Thanks Greg Price for this nice optimization! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: