In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it #76466

MaximeBelanger · 2017-12-12T01:16:10Z

BPO	32285
Nosy	@vstinner, @benjaminp, @ezio-melotti, @stevendaprano
PRs	bpo-32285: Add `unicodedata.is_normalized` to check the current norma… #4806

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-11-04.23:58:27.379>
created_at = <Date 2017-12-12.01:16:09.856>
labels = ['type-feature', '3.8', 'expert-unicode']
title = "In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it"
updated_at = <Date 2018-11-04.23:58:27.377>
user = 'https://bugs.python.org/MaximeBelanger'

bugs.python.org fields:

activity = <Date 2018-11-04.23:58:27.377>
actor = 'benjamin.peterson'
assignee = 'none'
closed = True
closed_date = <Date 2018-11-04.23:58:27.379>
closer = 'benjamin.peterson'
components = ['Unicode']
creation = <Date 2017-12-12.01:16:09.856>
creator = 'Maxime Belanger'
dependencies = []
files = []
hgrepos = []
issue_num = 32285
keywords = ['patch']
message_count = 4.0
messages = ['308085', '308122', '308127', '329276']
nosy_count = 5.0
nosy_names = ['vstinner', 'benjamin.peterson', 'ezio.melotti', 'steven.daprano', 'Maxime Belanger']
pr_nums = ['4806']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue32285'
versions = ['Python 3.8']

The text was updated successfully, but these errors were encountered:

MaximeBelanger · 2017-12-12T01:16:08Z

In our deployment of Python 2.7, we've patched unicodedata to introduce a new function: is_normalized can check whether a unistr is in a given normal form. This currently has to be done by creating a normalized copy, then checking whether it is equal to the source string.

This function uses the internal helper (also called is_normalized) that can "quick check" normalization, but falls back on creating a normalized copy and comparing (when necessary).

We're contributing this change in case this can helpful to others. Feedback is welcome!

stevendaprano · 2017-12-12T12:25:09Z

Python 2.7 is in feature freeze, so this can only go into 3.7.

I would find this useful, and would like this feature. However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing. That could be expensive, and shouldn't be needed. According to here:

http://unicode.org/reports/tr15/#Detecting_Normalization_Forms

in the worst case, you can incrementally check only the code points in doubt (around the "MAYBE" code points).

vstinner · 2017-12-12T13:10:47Z

However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing.

The purpose of the function is to be faster than str == unicodedata.normalize(form, str). So yeah, any optimization is welcome.

But I don't bother with MAYBE suboptimal case which is implemented with: str == unicodedata.normalize(form, str). It can be optimized later, if needed.

If someone cares of performance, I will require a benchmark, since I only trust numbers :-)

benjaminp · 2018-11-04T23:58:27Z

New changeset 2810dd7 by Benjamin Peterson (Max Bélanger) in branch 'master':
closes bpo-32285: Add unicodedata.is_normalized. (GH-4806)
2810dd7

MaximeBelanger mannequin added 3.7 (EOL) end of life topic-unicode labels Dec 12, 2017

stevendaprano added the type-feature A feature request or enhancement label Dec 12, 2017

MaximeBelanger mannequin added 3.8 only security fixes and removed 3.7 (EOL) end of life labels Oct 25, 2018

benjaminp closed this as completed Nov 4, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it #76466

In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it #76466

MaximeBelanger mannequin commented Dec 12, 2017

MaximeBelanger mannequin commented Dec 12, 2017

stevendaprano commented Dec 12, 2017

vstinner commented Dec 12, 2017

benjaminp commented Nov 4, 2018

In unicodedata, it should be possible to check a unistr's normal form without necessarily copying it #76466

In unicodedata, it should be possible to check a unistr's normal form without necessarily copying it #76466

Comments

MaximeBelanger mannequin commented Dec 12, 2017

MaximeBelanger mannequin commented Dec 12, 2017

stevendaprano commented Dec 12, 2017

vstinner commented Dec 12, 2017

benjaminp commented Nov 4, 2018

In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it #76466

In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it #76466