Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In unicodedata, it should be possible to check a unistr's normal form without necessarily copying it #76466

Closed
MaximeBelanger mannequin opened this issue Dec 12, 2017 · 4 comments
Labels
3.8 only security fixes topic-unicode type-feature A feature request or enhancement

Comments

@MaximeBelanger
Copy link
Mannequin

MaximeBelanger mannequin commented Dec 12, 2017

BPO 32285
Nosy @vstinner, @benjaminp, @ezio-melotti, @stevendaprano
PRs
  • bpo-32285: Add unicodedata.is_normalized to check the current norma… #4806
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-11-04.23:58:27.379>
    created_at = <Date 2017-12-12.01:16:09.856>
    labels = ['type-feature', '3.8', 'expert-unicode']
    title = "In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it"
    updated_at = <Date 2018-11-04.23:58:27.377>
    user = 'https://bugs.python.org/MaximeBelanger'

    bugs.python.org fields:

    activity = <Date 2018-11-04.23:58:27.377>
    actor = 'benjamin.peterson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-11-04.23:58:27.379>
    closer = 'benjamin.peterson'
    components = ['Unicode']
    creation = <Date 2017-12-12.01:16:09.856>
    creator = 'Maxime Belanger'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 32285
    keywords = ['patch']
    message_count = 4.0
    messages = ['308085', '308122', '308127', '329276']
    nosy_count = 5.0
    nosy_names = ['vstinner', 'benjamin.peterson', 'ezio.melotti', 'steven.daprano', 'Maxime Belanger']
    pr_nums = ['4806']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue32285'
    versions = ['Python 3.8']

    @MaximeBelanger
    Copy link
    Mannequin Author

    MaximeBelanger mannequin commented Dec 12, 2017

    In our deployment of Python 2.7, we've patched unicodedata to introduce a new function: is_normalized can check whether a unistr is in a given normal form. This currently has to be done by creating a normalized copy, then checking whether it is equal to the source string.

    This function uses the internal helper (also called is_normalized) that can "quick check" normalization, but falls back on creating a normalized copy and comparing (when necessary).

    We're contributing this change in case this can helpful to others. Feedback is welcome!

    @MaximeBelanger MaximeBelanger mannequin added 3.7 (EOL) end of life topic-unicode labels Dec 12, 2017
    @stevendaprano
    Copy link
    Member

    Python 2.7 is in feature freeze, so this can only go into 3.7.

    I would find this useful, and would like this feature. However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing. That could be expensive, and shouldn't be needed. According to here:

    http://unicode.org/reports/tr15/#Detecting_Normalization_Forms

    in the worst case, you can incrementally check only the code points in doubt (around the "MAYBE" code points).

    @stevendaprano stevendaprano added the type-feature A feature request or enhancement label Dec 12, 2017
    @vstinner
    Copy link
    Member

    However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing.

    The purpose of the function is to be faster than str == unicodedata.normalize(form, str). So yeah, any optimization is welcome.

    But I don't bother with MAYBE suboptimal case which is implemented with: str == unicodedata.normalize(form, str). It can be optimized later, if needed.

    If someone cares of performance, I will require a benchmark, since I only trust numbers :-)

    @MaximeBelanger MaximeBelanger mannequin added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Oct 25, 2018
    @benjaminp
    Copy link
    Contributor

    New changeset 2810dd7 by Benjamin Peterson (Max Bélanger) in branch 'master':
    closes bpo-32285: Add unicodedata.is_normalized. (GH-4806)
    2810dd7

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants