Message 264704 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	arigo, ezio.melotti, vstinner
Date	2016-05-03.09:25:45
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1462267545.53.0.914258768336.issue26917@psf.upfronthosting.co.za>
In-reply-to

Content
Extract of unicodedata_UCD_normalize_impl(): if (strcmp(form, "NFC") == 0) { if (is_normalized(self, input, 1, 0)) { Py_INCREF(input); return input; } return nfc_nfkc(self, input, 0); } is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also false for "\uafb8\u11a7\U0002f8a1"). unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged because is_normalized() is true. unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7": U+afb8 is decomposed to {U+1101, U+116e}. unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7")) returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e, U+11a7} is composed to {U+afb8}. It may be an issue in the "quickcheck" property of the Python Unicode database. Format of this field: /* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No, as described in http://unicode.org/reports/tr15/#Annex8. */ quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));

Extract of unicodedata_UCD_normalize_impl():

    if (strcmp(form, "NFC") == 0) {
        if (is_normalized(self, input, 1, 0)) {
            Py_INCREF(input);
            return input;
        }
        return nfc_nfkc(self, input, 0);
    }

is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also false for "\uafb8\u11a7\U0002f8a1").

unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged because is_normalized() is true.

unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7": U+afb8 is decomposed to {U+1101, U+116e}.

unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7")) returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e, U+11a7} is composed to {U+afb8}.

It may be an issue in the "quickcheck" property of the Python Unicode database. Format of this field:

    /* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No,
       as described in http://unicode.org/reports/tr15/#Annex8. */
    quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));

History
Date	User	Action	Args
2016-05-03 09:25:45	vstinner	set	recipients: + vstinner, arigo, ezio.melotti
2016-05-03 09:25:45	vstinner	set	messageid: <1462267545.53.0.914258768336.issue26917@psf.upfronthosting.co.za>
2016-05-03 09:25:45	vstinner	link	issue26917 messages
2016-05-03 09:25:45	vstinner	create