This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: is_normalized is much slower at "no" than the standard's algorithm
Type: Stage: resolved
Components: Unicode Versions: Python 3.9, Python 3.8
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Greg Price, Maxime Belanger, benjamin.peterson, ezio.melotti, miss-islington, serhiy.storchaka, steven.daprano, vstinner
Priority: normal Keywords: patch

Created on 2019-08-28 03:52 by Greg Price, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15558 merged Greg Price, 2019-08-28 04:57
PR 15671 merged miss-islington, 2019-09-04 02:45
Messages (6)
msg350651 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-28 03:52
In 3.8 we add a new function `unicodedata.is_normalized`.  The result is equivalent to `str == unicodedata.normalize(form, str)`, but the implementation uses a version of the "quick check" algorithm from UAX #15 as an optimization to try to avoid having to copy the whole string.  This was added in issue #32285, commit 2810dd7be.

However, it turns out the code doesn't actually implement the same algorithm as UAX #15, and as a result we often miss the optimization and end up having to compute the whole normalized string after all.

Here's a quick demo on my desktop.  We pass a long string made entirely out of a character for which the quick-check algorithm always says `NO`, it's not normalized:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    's == unicodedata.normalize("NFD", s)'
50 loops, best of 5: 4.41 msec per loop

That's the same 4.4 ms (for a 1 MB string) with or without the attempted optimization.

Here it is after a patch that makes the algorithm run as in the standard:

$ -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop

Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1).

The root cause of the issue is that our `is_normalized` static helper, which the new function relies on, was never written as a full implementation of the quick-check algorithm.  The full algorithm can return YES, MAYBE, or NO; but originally this helper's only caller was the implementation of `unicodedata.normalize`, which only cares about YES vs. MAYBE-or-NO.  So the helper often returns MAYBE when the standard algorithm would say NO.

(More precisely, perhaps: it's fine that this helper was never a full implementation... but it didn't say that anywhere, even while referring to the standard algorithm, and as a result set us up for future confusion.)

That's exactly what's happening in the example above: the standard quick-check algorithm would say NO, but our helper says MAYBE.  Which for `unicodedata.is_normalized` means it has to go compute the whole normalized string.
msg350655 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-28 05:04
Fix posted, as GH-15558.

Adding cc's for the folks in the thread on #32285, where this function was originally added.
msg351110 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-04 02:46
New changeset 2f09413947d1ce0043de62ed2346f9a2b4e5880b by Benjamin Peterson (Greg Price) in branch 'master':
closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
msg351111 - (view) Author: Maxime Belanger (Maxime Belanger) Date: 2019-09-04 02:47
Thanks for that!
msg351112 - (view) Author: miss-islington (miss-islington) Date: 2019-09-04 03:03
New changeset 4dd1c9d9c2bca4744c70c9556b7051f4465ede3e by Miss Islington (bot) in branch '3.8':
closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558)
msg351127 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-04 13:56
Thanks Greg Price for this nice optimization!
Date User Action Args
2022-04-11 14:59:19adminsetgithub: 82147
2019-09-04 13:56:46vstinnersetmessages: + msg351127
2019-09-04 03:03:43miss-islingtonsetnosy: + miss-islington
messages: + msg351112
2019-09-04 02:47:17Maxime Belangersetmessages: + msg351111
2019-09-04 02:46:07benjamin.petersonsetstatus: open -> closed
resolution: fixed
messages: + msg351110

stage: patch review -> resolved
2019-09-04 02:45:57miss-islingtonsetpull_requests: + pull_request15335
2019-08-29 13:19:15vstinnersetnosy: + serhiy.storchaka
2019-08-28 05:04:42Greg Pricesetnosy: + ezio.melotti, steven.daprano, benjamin.peterson, vstinner, Maxime Belanger
versions: + Python 3.9
messages: + msg350655

components: + Unicode
title: is_normalized is much slower than the standard's algorithm -> is_normalized is much slower at "no" than the standard's algorithm
2019-08-28 04:57:51Greg Pricesetkeywords: + patch
stage: patch review
pull_requests: + pull_request15231
2019-08-28 03:52:56Greg Pricecreate