This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author arigo
Recipients arigo, ezio.melotti, vstinner
Date 2016-05-03.08:48:26
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1462265307.72.0.382714967199.issue26917@psf.upfronthosting.co.za>
In-reply-to
Content
There is an apparent inconsistency in unicodedata.normalize("NFC"), introduced with the switch from the Unicode DB 5.1.0 to 5.2.0 (in Python 2.7).  First, please note that my knowledge of unicode is limited, so I may be wrong and the following behavior might be perfectly correct.

>>> from unicodedata import normalize
>>> print(normalize("NFC", "---\uafb8\u11a7---").encode('utf-8'))
b'---\xea\xbe\xb8\xe1\x86\xa7---'    # i.e., the same as the input

>>> print(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1").encode('utf-8'))
b'---\xea\xbe\xb8---\xe3\xa4\xba'

Note how in the second example the initial two-character part is replaced with a single character (actually the first of them).  This does not occur in the first example.  In Python 2.6, both inputs would be normalized to the single-character output.

The new behavior introduced in Python 2.7 is to first do a quick-check on the string, and if this `is_normalized()` function returns 1, we know that the string should already be normalized and we return it unmodified.  However, the example "\uafb8\u11a7" shows a contradictory behavior: it causes both is_normalized() to return 1, but actual normalization to change it.  We can see in the second example above that if, for an unrelated reason, we force is_normalized() to return 0 (by adding some non-normalized character elsewhere in the string), then the "\uafb8\u11a7" is changed.

This is a bit unexpected, but I don't know if it is officially correct behavior or if the problem is a bug in `is_normalized()`.
History
Date User Action Args
2016-05-03 08:48:27arigosetrecipients: + arigo, vstinner, ezio.melotti
2016-05-03 08:48:27arigosetmessageid: <1462265307.72.0.382714967199.issue26917@psf.upfronthosting.co.za>
2016-05-03 08:48:27arigolinkissue26917 messages
2016-05-03 08:48:26arigocreate