Message264704
Extract of unicodedata_UCD_normalize_impl():
if (strcmp(form, "NFC") == 0) {
if (is_normalized(self, input, 1, 0)) {
Py_INCREF(input);
return input;
}
return nfc_nfkc(self, input, 0);
}
is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also false for "\uafb8\u11a7\U0002f8a1").
unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged because is_normalized() is true.
unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7": U+afb8 is decomposed to {U+1101, U+116e}.
unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7")) returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e, U+11a7} is composed to {U+afb8}.
It may be an issue in the "quickcheck" property of the Python Unicode database. Format of this field:
/* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No,
as described in http://unicode.org/reports/tr15/#Annex8. */
quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0)); |
|
Date |
User |
Action |
Args |
2016-05-03 09:25:45 | vstinner | set | recipients:
+ vstinner, arigo, ezio.melotti |
2016-05-03 09:25:45 | vstinner | set | messageid: <1462267545.53.0.914258768336.issue26917@psf.upfronthosting.co.za> |
2016-05-03 09:25:45 | vstinner | link | issue26917 messages |
2016-05-03 09:25:45 | vstinner | create | |
|