This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Fast path for unicodedata.normalize()
Type: performance Stage: patch review
Components: Extension Modules Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: ajaksu2, loewis, pitrou, raulir, vdupras
Priority: normal Keywords: patch

Created on 2007-06-09 22:45 by raulir, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
normalization_fast_path.patch raulir, 2007-06-09 22:45
uninorm.patch pitrou, 2008-12-14 11:57
Messages (6)
msg52743 - (view) Author: Rauli Ruohonen (raulir) Date: 2007-06-09 22:45
Implements quick checking of already normalized forms as 
described in http://unicode.org/reports/tr15/#Annex8

The patch is against 2.6 SVN trunk. Normalization test
passes on both UCS2 and UCS4 builds on Ubuntu Edgy.

API affected:

unicodedata.normalize('NFC', u'a') is u'a' and similar
expressions become true, as the unicode object is not
copied when it is found to be already normalized.
The documentation does not specify either way.

Added memory footprint:

A new 8-bit field is added to _PyUnicode_DatabaseRecord,
and the generated _PyUnicode_Database_Records
array grows from 219 records to 304 records. Each
record looks like this:

typedef struct {
   const unsigned char category;
   const unsigned char combining;
   const unsigned char bidirectional;
   const unsigned char mirrored;
   const unsigned char east_asian_width;
   const unsigned char normalization_quick_check;
} _PyUnicode_DatabaseRecord;

normalization_quick_check is the added field.
msg62437 - (view) Author: Virgil Dupras (vdupras) (Python triager) Date: 2008-02-15 20:13
It's a very interesting patch. I wonder why it fell into oblivion. stuff 
like "unicode.normalize('NFC', u'\xe9')" was more than twice as fast for 
me.

Making sure that all unicode is normalized can be a bottleneck in a lot 
of applications (it somewhat is in my apps).

The downside is that it makes test_codecs and test_unicode_file fail.
msg77790 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-14 11:57
Here is a new patch against trunk, including the modified data files.
All tests pass and I confirm a very healthy speed-up (~ 5x) when trying
to a normalize an already normalized string. The slowdown for
non-normalized strings is so small that it cannot be distinguished from
measurement noise.

Martin, do you think this can be committed?
msg86594 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-04-26 02:37
Should this be considered for 3.1?
msg86600 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-04-26 05:03
The patch looks fine to me, please apply. One change is necessary: the
quick check should only be performed if it is the newest version (i.e.
self is NULL); otherwise, we would need to add a delta list for changed
quickcheck values, as well.

I think it would be possible to fold the NO and MAYBE answers into NO in
the database already, reducing the number of necessary bits to 4, and
then allowing to check with a simple bit test (i.e. no shift). OTOH, the
shift can be avoided already, by changing  quickcheck_shift into a
bitmask. OTTH, perhaps the compiler does that already, anyway.

With a reduction of the number of bits, it would be possible to reclaim
a byte, by merging the bits into one of the other fields. Whether that's
worth it, I don't know.
msg86705 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-04-27 22:32
Committed in r72054, r72055. Thanks for the patch!
History
Date User Action Args
2022-04-11 14:56:24adminsetgithub: 45076
2009-04-27 22:32:48pitrousetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg86705
2009-04-26 05:04:55loewissetassignee: pitrou
resolution: accepted
messages: + msg86600
2009-04-26 02:37:36ajaksu2setnosy: + ajaksu2
messages: + msg86594
2008-12-14 11:57:40pitrousetfiles: + uninorm.patch
versions: + Python 3.1, Python 2.7, - Python 2.6
nosy: + loewis
messages: + msg77790
type: performance
stage: patch review
2008-07-25 12:41:28pitrousetnosy: + pitrou
2008-02-15 20:13:14vduprassetnosy: + vdupras
messages: + msg62437
2007-06-09 22:45:42raulircreate