Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up unicode normalization of ASCII strings #89150

Closed
stevendaprano opened this issue Aug 24, 2021 · 5 comments
Closed

Speed up unicode normalization of ASCII strings #89150

stevendaprano opened this issue Aug 24, 2021 · 5 comments
Labels
3.11 only security fixes topic-unicode type-feature A feature request or enhancement

Comments

@stevendaprano
Copy link
Member

BPO 44987
Nosy @vstinner, @ezio-melotti, @stevendaprano, @serhiy-storchaka, @corona10
PRs
  • bpo-44987: Speed up unicode normalization of ASCII strings #28283
  • bpo-44987: Fix typo whatsnew 3.11 #28293
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-09-11.15:05:19.089>
    created_at = <Date 2021-08-24.02:07:52.797>
    labels = ['type-feature', 'expert-unicode', '3.11']
    title = 'Speed up unicode normalization of ASCII strings'
    updated_at = <Date 2021-09-11.19:06:01.839>
    user = 'https://github.com/stevendaprano'

    bugs.python.org fields:

    activity = <Date 2021-09-11.19:06:01.839>
    actor = 'corona10'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-09-11.15:05:19.089>
    closer = 'serhiy.storchaka'
    components = ['Unicode']
    creation = <Date 2021-08-24.02:07:52.797>
    creator = 'steven.daprano'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 44987
    keywords = ['patch']
    message_count = 5.0
    messages = ['400192', '401342', '401639', '401641', '401646']
    nosy_count = 5.0
    nosy_names = ['vstinner', 'ezio.melotti', 'steven.daprano', 'serhiy.storchaka', 'corona10']
    pr_nums = ['28283', '28293']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue44987'
    versions = ['Python 3.11']

    @stevendaprano
    Copy link
    Member Author

    I think there is an opportunity to speed up some unicode normalisations significantly.

    In 3.9 at least, the normalisation appears to be dependent on the length of the string:

        >>> setup="from unicodedata import normalize; s = 'reverse'"
        >>> t1 = Timer('normalize("NFKC", s)', setup=setup)
        >>> setup="from unicodedata import normalize; s = 'reverse'*1000"
        >>> t2 = Timer('normalize("NFKC", s)', setup=setup)
        >>> 
        >>> min(t1.repeat(repeat=7))
        0.04854234401136637
        >>> min(t2.repeat(repeat=7))
        9.98313440399943

    But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP-393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation.

    @stevendaprano stevendaprano added 3.11 only security fixes topic-unicode type-feature A feature request or enhancement labels Aug 24, 2021
    @vstinner
    Copy link
    Member

    vstinner commented Sep 7, 2021

    Well, someone should write a PR for it.

    @corona10
    Copy link
    Member

    Well, someone should write a PR for it.

    Well, I sent a patch :)

    @serhiy-storchaka
    Copy link
    Member

    New changeset 9abd07e by Dong-hee Na in branch 'main':
    bpo-44987: Speed up unicode normalization of ASCII strings (GH-28283)
    9abd07e

    @corona10
    Copy link
    Member

    New changeset 5277ffe by Dong-hee Na in branch 'main':
    bpo-44987: Fix typo whatsnew 3.11 (GH-28293)
    5277ffe

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants