Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_normalized is much slower at "no" than the standard's algorithm #82147

Closed
gnprice opened this issue Aug 28, 2019 · 6 comments
Closed

is_normalized is much slower at "no" than the standard's algorithm #82147

gnprice opened this issue Aug 28, 2019 · 6 comments
Labels
3.8 only security fixes 3.9 only security fixes topic-unicode

Comments

@gnprice
Copy link
Contributor

gnprice commented Aug 28, 2019

BPO 37966
Nosy @vstinner, @benjaminp, @ezio-melotti, @stevendaprano, @serhiy-storchaka, @gnprice, @miss-islington
PRs
  • bpo-37966: Fully implement the UAX #15 quick-check algorithm. #15558
  • [3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) #15671
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-09-04.02:46:07.031>
    created_at = <Date 2019-08-28.03:52:56.429>
    labels = ['3.8', '3.9', 'expert-unicode']
    title = 'is_normalized is much slower at "no" than the standard\'s algorithm'
    updated_at = <Date 2019-09-04.13:56:46.537>
    user = 'https://github.com/gnprice'

    bugs.python.org fields:

    activity = <Date 2019-09-04.13:56:46.537>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-09-04.02:46:07.031>
    closer = 'benjamin.peterson'
    components = ['Unicode']
    creation = <Date 2019-08-28.03:52:56.429>
    creator = 'Greg Price'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 37966
    keywords = ['patch']
    message_count = 6.0
    messages = ['350651', '350655', '351110', '351111', '351112', '351127']
    nosy_count = 8.0
    nosy_names = ['vstinner', 'benjamin.peterson', 'ezio.melotti', 'steven.daprano', 'serhiy.storchaka', 'Maxime Belanger', 'Greg Price', 'miss-islington']
    pr_nums = ['15558', '15671']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue37966'
    versions = ['Python 3.8', 'Python 3.9']

    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 28, 2019

    In 3.8 we add a new function unicodedata.is_normalized. The result is equivalent to str == unicodedata.normalize(form, str), but the implementation uses a version of the "quick check" algorithm from UAX #15 as an optimization to try to avoid having to copy the whole string. This was added in issue bpo-32285, commit 2810dd7.

    However, it turns out the code doesn't actually implement the same algorithm as UAX #15, and as a result we often miss the optimization and end up having to compute the whole normalized string after all.

    Here's a quick demo on my desktop. We pass a long string made entirely out of a character for which the quick-check algorithm always says NO, it's not normalized:

    $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
        'unicodedata.is_normalized("NFD", s)'
    50 loops, best of 5: 4.39 msec per loop
    
    $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
        's == unicodedata.normalize("NFD", s)'
    50 loops, best of 5: 4.41 msec per loop

    That's the same 4.4 ms (for a 1 MB string) with or without the attempted optimization.

    Here it is after a patch that makes the algorithm run as in the standard:

    $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
        'unicodedata.is_normalized("NFD", s)'
    5000000 loops, best of 5: 58.2 nsec per loop

    Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1).

    The root cause of the issue is that our is_normalized static helper, which the new function relies on, was never written as a full implementation of the quick-check algorithm. The full algorithm can return YES, MAYBE, or NO; but originally this helper's only caller was the implementation of unicodedata.normalize, which only cares about YES vs. MAYBE-or-NO. So the helper often returns MAYBE when the standard algorithm would say NO.

    (More precisely, perhaps: it's fine that this helper was never a full implementation... but it didn't say that anywhere, even while referring to the standard algorithm, and as a result set us up for future confusion.)

    That's exactly what's happening in the example above: the standard quick-check algorithm would say NO, but our helper says MAYBE. Which for unicodedata.is_normalized means it has to go compute the whole normalized string.

    @gnprice gnprice added the 3.8 only security fixes label Aug 28, 2019
    @gnprice
    Copy link
    Contributor Author

    gnprice commented Aug 28, 2019

    Fix posted, as #59763.

    Adding cc's for the folks in the thread on bpo-32285, where this function was originally added.

    @gnprice gnprice added topic-unicode 3.9 only security fixes labels Aug 28, 2019
    @gnprice gnprice changed the title is_normalized is much slower than the standard's algorithm is_normalized is much slower at "no" than the standard's algorithm Aug 28, 2019
    @benjaminp
    Copy link
    Contributor

    New changeset 2f09413 by Benjamin Peterson (Greg Price) in branch 'master':
    closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
    2f09413

    @MaximeBelanger
    Copy link
    Mannequin

    MaximeBelanger mannequin commented Sep 4, 2019

    Thanks for that!

    @miss-islington
    Copy link
    Contributor

    New changeset 4dd1c9d by Miss Islington (bot) in branch '3.8':
    closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558)
    4dd1c9d

    @vstinner
    Copy link
    Member

    vstinner commented Sep 4, 2019

    Thanks Greg Price for this nice optimization!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants