This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients arigo, ezio.melotti, vstinner
Date 2016-05-03.09:25:45
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1462267545.53.0.914258768336.issue26917@psf.upfronthosting.co.za>
In-reply-to
Content
Extract of unicodedata_UCD_normalize_impl():

    if (strcmp(form, "NFC") == 0) {
        if (is_normalized(self, input, 1, 0)) {
            Py_INCREF(input);
            return input;
        }
        return nfc_nfkc(self, input, 0);
    }

is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also false for "\uafb8\u11a7\U0002f8a1").

unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged because is_normalized() is true.

unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7": U+afb8 is decomposed to {U+1101, U+116e}.

unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7")) returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e, U+11a7} is composed to {U+afb8}.

It may be an issue in the "quickcheck" property of the Python Unicode database. Format of this field:

    /* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No,
       as described in http://unicode.org/reports/tr15/#Annex8. */
    quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));
History
Date User Action Args
2016-05-03 09:25:45vstinnersetrecipients: + vstinner, arigo, ezio.melotti
2016-05-03 09:25:45vstinnersetmessageid: <1462267545.53.0.914258768336.issue26917@psf.upfronthosting.co.za>
2016-05-03 09:25:45vstinnerlinkissue26917 messages
2016-05-03 09:25:45vstinnercreate