classification
Title: In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Maxime Belanger, benjamin.peterson, ezio.melotti, steven.daprano, vstinner
Priority: normal Keywords: patch

Created on 2017-12-12 01:16 by Maxime Belanger, last changed 2018-11-04 23:58 by benjamin.peterson. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 4806 merged maxbelanger, 2017-12-12 01:19
Messages (4)
msg308085 - (view) Author: Maxime Belanger (Maxime Belanger) Date: 2017-12-12 01:16
In our deployment of Python 2.7, we've patched `unicodedata` to introduce a new function: `is_normalized` can check whether a unistr is in a given normal form. This currently has to be done by creating a normalized copy, then checking whether it is equal to the source string.

This function uses the internal helper (also called `is_normalized`) that can "quick check" normalization, but falls back on creating a normalized copy and comparing (when necessary).

We're contributing this change in case this can helpful to others. Feedback is welcome!
msg308122 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-12-12 12:25
Python 2.7 is in feature freeze, so this can only go into 3.7.

I would find this useful, and would like this feature. However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing. That could be expensive, and shouldn't be needed. According to here:

http://unicode.org/reports/tr15/#Detecting_Normalization_Forms

in the worst case, you can incrementally check only the code points in doubt (around the "MAYBE" code points).
msg308127 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-12 13:10
> However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing.

The purpose of the function is to be faster than str == unicodedata.normalize(form, str). So yeah, any optimization is welcome.

But I don't bother with MAYBE suboptimal case which is implemented with: str == unicodedata.normalize(form, str). It can be optimized later, if needed.

If someone cares of performance, I will require a benchmark, since I only trust numbers :-)
msg329276 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-11-04 23:58
New changeset 2810dd7be9876236f74ac80716d113572c9098dd by Benjamin Peterson (Max Bélanger) in branch 'master':
closes bpo-32285: Add unicodedata.is_normalized. (GH-4806)
https://github.com/python/cpython/commit/2810dd7be9876236f74ac80716d113572c9098dd
History
Date User Action Args
2018-11-04 23:58:27benjamin.petersonsetstatus: open -> closed

nosy: + benjamin.peterson
messages: + msg329276

resolution: fixed
stage: patch review -> resolved
2018-10-25 00:06:09Maxime Belangersetversions: + Python 3.8, - Python 3.7
2017-12-12 13:10:46vstinnersetmessages: + msg308127
2017-12-12 12:25:08steven.dapranosetversions: - Python 2.7
nosy: + steven.daprano

messages: + msg308122

type: enhancement
2017-12-12 01:19:59maxbelangersetkeywords: + patch
stage: patch review
pull_requests: + pull_request4703
2017-12-12 01:16:10Maxime Belangercreate