is_normalized is much slower at "no" than the standard's algorithm #82147

gnprice · 2019-08-28T03:52:56Z

BPO	37966
Nosy	@vstinner, @benjaminp, @ezio-melotti, @stevendaprano, @serhiy-storchaka, @gnprice, @miss-islington
PRs	bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558 [3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) #15671

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-09-04.02:46:07.031>
created_at = <Date 2019-08-28.03:52:56.429>
labels = ['3.8', '3.9', 'expert-unicode']
title = 'is_normalized is much slower at "no" than the standard\'s algorithm'
updated_at = <Date 2019-09-04.13:56:46.537>
user = 'https://github.com/gnprice'

bugs.python.org fields:

activity = <Date 2019-09-04.13:56:46.537>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2019-09-04.02:46:07.031>
closer = 'benjamin.peterson'
components = ['Unicode']
creation = <Date 2019-08-28.03:52:56.429>
creator = 'Greg Price'
dependencies = []
files = []
hgrepos = []
issue_num = 37966
keywords = ['patch']
message_count = 6.0
messages = ['350651', '350655', '351110', '351111', '351112', '351127']
nosy_count = 8.0
nosy_names = ['vstinner', 'benjamin.peterson', 'ezio.melotti', 'steven.daprano', 'serhiy.storchaka', 'Maxime Belanger', 'Greg Price', 'miss-islington']
pr_nums = ['15558', '15671']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue37966'
versions = ['Python 3.8', 'Python 3.9']

gnprice · 2019-08-28T03:52:56Z

In 3.8 we add a new function unicodedata.is_normalized. The result is equivalent to str == unicodedata.normalize(form, str), but the implementation uses a version of the "quick check" algorithm from UAX #15 as an optimization to try to avoid having to copy the whole string. This was added in issue bpo-32285, commit 2810dd7.

However, it turns out the code doesn't actually implement the same algorithm as UAX #15, and as a result we often miss the optimization and end up having to compute the whole normalized string after all.

Here's a quick demo on my desktop. We pass a long string made entirely out of a character for which the quick-check algorithm always says NO, it's not normalized:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    's == unicodedata.normalize("NFD", s)'
50 loops, best of 5: 4.41 msec per loop

That's the same 4.4 ms (for a 1 MB string) with or without the attempted optimization.

Here it is after a patch that makes the algorithm run as in the standard:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop

Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1).

The root cause of the issue is that our is_normalized static helper, which the new function relies on, was never written as a full implementation of the quick-check algorithm. The full algorithm can return YES, MAYBE, or NO; but originally this helper's only caller was the implementation of unicodedata.normalize, which only cares about YES vs. MAYBE-or-NO. So the helper often returns MAYBE when the standard algorithm would say NO.

(More precisely, perhaps: it's fine that this helper was never a full implementation... but it didn't say that anywhere, even while referring to the standard algorithm, and as a result set us up for future confusion.)

That's exactly what's happening in the example above: the standard quick-check algorithm would say NO, but our helper says MAYBE. Which for unicodedata.is_normalized means it has to go compute the whole normalized string.

gnprice · 2019-08-28T05:04:42Z

Fix posted, as #59763.

Adding cc's for the folks in the thread on bpo-32285, where this function was originally added.

benjaminp · 2019-09-04T02:46:07Z

New changeset 2f09413 by Benjamin Peterson (Greg Price) in branch 'master':
closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
2f09413

MaximeBelanger · 2019-09-04T02:47:18Z

Thanks for that!

miss-islington · 2019-09-04T03:03:44Z

New changeset 4dd1c9d by Miss Islington (bot) in branch '3.8':
closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558)
4dd1c9d

vstinner · 2019-09-04T13:56:46Z

Thanks Greg Price for this nice optimization!

gnprice added the 3.8 only security fixes label Aug 28, 2019

gnprice added topic-unicode 3.9 only security fixes labels Aug 28, 2019

gnprice changed the title ~~is_normalized is much slower than the standard's algorithm~~ is_normalized is much slower at "no" than the standard's algorithm Aug 28, 2019

benjaminp closed this as completed Sep 4, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is_normalized is much slower at "no" than the standard's algorithm #82147

is_normalized is much slower at "no" than the standard's algorithm #82147

gnprice commented Aug 28, 2019

gnprice commented Aug 28, 2019

gnprice commented Aug 28, 2019

benjaminp commented Sep 4, 2019

MaximeBelanger mannequin commented Sep 4, 2019

miss-islington commented Sep 4, 2019

vstinner commented Sep 4, 2019

is_normalized is much slower at "no" than the standard's algorithm #82147

is_normalized is much slower at "no" than the standard's algorithm #82147

Comments

gnprice commented Aug 28, 2019

gnprice commented Aug 28, 2019

gnprice commented Aug 28, 2019

benjaminp commented Sep 4, 2019

MaximeBelanger mannequin commented Sep 4, 2019

miss-islington commented Sep 4, 2019

vstinner commented Sep 4, 2019