Title: In `unicodedata`, it should be possible to check a unistr's normal form without necessarily copying it
PR 4806 merged maxbelanger, 2017-12-12
Messages
In our deployment of Python 2.7, we've patched `unicodedata` to introduce a new function: `is_normalized` can check whether a unistr is in a given normal form. This currently has to be done by creating a normalized copy, then checking whether it is equal to the source string.

This function uses the internal helper (also called `is_normalized`) that can "quick check" normalization, but falls back on creating a normalized copy and comparing (when necessary).

We're contributing this change in case this can helpful to others. Feedback is welcome!
Python 2.7 is in feature freeze, so this can only go into 3.7.

I would find this useful, and would like this feature. However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing. That could be expensive, and shouldn't be needed. According to here:

in the worst case, you can incrementally check only the code points in doubt (around the "MAYBE" code points).
> However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing.

The purpose of the function is to be faster than str == unicodedata.normalize(form, str). So yeah, any optimization is welcome.

But I don't bother with MAYBE suboptimal case which is implemented with: str == unicodedata.normalize(form, str). It can be optimized later, if needed.

If someone cares of performance, I will require a benchmark, since I only trust numbers :-)
New changeset 2810dd7be9876236f74ac80716d113572c9098dd by Benjamin Peterson (Max Bélanger) in branch 'master':
closes bpo-32285: Add unicodedata.is_normalized. (GH-4806)
