classification
Title: unicodedata.normalize(): bug in Hangul Composition
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.6, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Ronan.Lamy, arigo, ezio.melotti, malin, vstinner
Priority: normal Keywords: patch

Created on 2016-05-03 08:48 by arigo, last changed 2018-06-17 17:10 by benjamin.peterson. This issue is now closed.

Files
File name Uploaded Description Edit
hangul_composition.patch vstinner, 2016-05-03 10:02 review
Messages (10)
msg264697 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2016-05-03 08:48
There is an apparent inconsistency in unicodedata.normalize("NFC"), introduced with the switch from the Unicode DB 5.1.0 to 5.2.0 (in Python 2.7).  First, please note that my knowledge of unicode is limited, so I may be wrong and the following behavior might be perfectly correct.

>>> from unicodedata import normalize
>>> print(normalize("NFC", "---\uafb8\u11a7---").encode('utf-8'))
b'---\xea\xbe\xb8\xe1\x86\xa7---'    # i.e., the same as the input

>>> print(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1").encode('utf-8'))
b'---\xea\xbe\xb8---\xe3\xa4\xba'

Note how in the second example the initial two-character part is replaced with a single character (actually the first of them).  This does not occur in the first example.  In Python 2.6, both inputs would be normalized to the single-character output.

The new behavior introduced in Python 2.7 is to first do a quick-check on the string, and if this `is_normalized()` function returns 1, we know that the string should already be normalized and we return it unmodified.  However, the example "\uafb8\u11a7" shows a contradictory behavior: it causes both is_normalized() to return 1, but actual normalization to change it.  We can see in the second example above that if, for an unrelated reason, we force is_normalized() to return 0 (by adding some non-normalized character elsewhere in the string), then the "\uafb8\u11a7" is changed.

This is a bit unexpected, but I don't know if it is officially correct behavior or if the problem is a bug in `is_normalized()`.
msg264698 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2016-05-03 09:01
Note: the examples can also be written in this clearer way on Python 3:

>>> from unicodedata import normalize
>>> print(ascii(normalize("NFC", "---\uafb8\u11a7---")))
'---\uafb8\u11a7---'

>>> print(ascii(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1")))
'---\uafb8---\u393a'
msg264704 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-03 09:25
Extract of unicodedata_UCD_normalize_impl():

    if (strcmp(form, "NFC") == 0) {
        if (is_normalized(self, input, 1, 0)) {
            Py_INCREF(input);
            return input;
        }
        return nfc_nfkc(self, input, 0);
    }

is_normalized() is true for "\uafb8\u11a7" but false for "\U0002f8a1" (and also false for "\uafb8\u11a7\U0002f8a1").

unicodedata.normalize("NFC", "\uafb8\u11a7") returns the string unchanged because is_normalized() is true.

unicodedata.normalize("NFD", "\uafb8\u11a7") returns "\u1101\u116e\u11a7": U+afb8 is decomposed to {U+1101, U+116e}.

unicodedata.normalize("NFC", unicodedata.normalize("NFD", "\uafb8\u11a7")) returns "\uafb8", it's the result of the Hangul Decomposition. {U+1101, U+116e, U+11a7} is composed to {U+afb8}.

It may be an issue in the "quickcheck" property of the Python Unicode database. Format of this field:

    /* The two quickcheck bits at this shift mean 0=Yes, 1=Maybe, 2=No,
       as described in http://unicode.org/reports/tr15/#Annex8. */
    quickcheck_mask = 3 << ((nfc ? 4 : 0) + (k ? 2 : 0));
msg264706 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-03 09:35
I tested http://minaret.info/test/normalize.msp

(1)

꾸ᆧ (afb8 11a7) --NFC or NFKC--> 꾸ᆧ (afb8, 11a7) === same than python
꾸ᆧ (afb8 11a7) --NFD or NFKD--> 꾸ᆧ (1101 116e, 11a7) === same than python

(2)

꾸ᆧ (1101 116e 11a7) --NFC or NFKC--> 꾸 (afb8) === same than python
꾸ᆧ (1101 116e 11a7) --NFC or NFKC--> 꾸ᆧ (1101 116e, 11a7) === same than python

(3)

꾸ᆧ㤺 (afb8 11a7 2f8a1) --NFC or NFKC--> 꾸ᆧ㤺 (afb8, 11a7, 393a) == DIFFERENT than python, python eats the U+11a7 character
꾸ᆧ㤺 (afb8 11a7 2f8a1) --NFD or NFKD--> 꾸ᆧ㤺 (1101 116e, 11a7, 393a) === same than python
msg264707 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-03 10:01
Extract of nfc_nfkc():

      /* Hangul Composition. We don't need to check for <LV,T>
         pairs, since we always have decomposed data. */
      code = PyUnicode_READ(kind, data, i);
      if (LBase <= code && code < (LBase+LCount) &&
          i + 1 < len &&
          VBase <= PyUnicode_READ(kind, data, i+1) &&
          PyUnicode_READ(kind, data, i+1) <= (VBase+VCount)) {
          int LIndex, VIndex;
          LIndex = code - LBase;
          VIndex = PyUnicode_READ(kind, data, i+1) - VBase;
          code = SBase + (LIndex*VCount+VIndex)*TCount;
          i+=2;
          if (i < len &&
              TBase <= PyUnicode_READ(kind, data, i) &&
              PyUnicode_READ(kind, data, i) <= (TBase+TCount)) {
              code += PyUnicode_READ(kind, data, i)-TBase;
              i++;
          }
          output[o++] = code;
          continue;
      }

With the input string (1101 116e, 11a7), we get:

* LIndex = 1
* VIndex = 13


code = SBase + (LIndex*VCount+VIndex)*TCount + (ch3 - TBase)
= 0xAC00 + (1 * 21 + 13) * 28 + 0
= 0xafb8

Constants:

* LBase = 0x1100, LCount = 19
* VBase = 0x1161, VCount = 21
* TBase = 0x11A7, TCount = 28
* SBase = 0xAC00

The problem is maybe than we used the 3rd character whereas (ch3 - TBase) is equal to 0.
msg264708 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-03 10:02
Attached patch changes Hangul Composition. I'm not sure that it is correct.
msg264711 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2016-05-03 10:29
See also https://bitbucket.org/pypy/pypy/issues/2289/incorrect-unicode-normalization .  It seems that you reached the same conclusion than the OP in that issue: the problem would really be that normalizing "\uafb8\u11a7" should not drop the second character.  Both Python and PyPy do that, but Python adds the "is_normalized()" check, so in some cases it returns the correct unmodified result.
msg314067 - (view) Author: Ronan Lamy (Ronan.Lamy) * Date: 2018-03-18 23:24
Victor's patch is correct. I implemented the same fix in PyPy in https://bitbucket.org/pypy/pypy/commits/92b4fb5b9e58
msg314076 - (view) Author: Ma Lin (malin) * Date: 2018-03-19 02:45
> Victor's patch is correct.

I'm afraid you are wrong.
Please see PR 1958 in issue29456, IMO this PR can be merged.
msg319803 - (view) Author: Ma Lin (malin) * Date: 2018-06-17 02:50
This issue can be closed, already fixed in issue29456

Also, PyPy's current code is correct.
History
Date User Action Args
2018-06-17 17:10:10benjamin.petersonsetstatus: open -> closed
resolution: fixed
stage: resolved
2018-06-17 02:50:08malinsetmessages: + msg319803
2018-03-19 02:45:21malinsetnosy: + malin
messages: + msg314076
2018-03-18 23:24:10Ronan.Lamysetnosy: + Ronan.Lamy
messages: + msg314067
2016-05-03 10:29:22arigosetmessages: + msg264711
2016-05-03 10:02:11vstinnersetfiles: + hangul_composition.patch
keywords: + patch
messages: + msg264708
2016-05-03 10:01:41vstinnersetmessages: + msg264707
2016-05-03 09:36:18vstinnersettitle: Inconsistency in unicodedata.normalize()? -> unicodedata.normalize(): bug in Hangul Composition
2016-05-03 09:35:40vstinnersetmessages: + msg264706
2016-05-03 09:25:45vstinnersetmessages: + msg264704
2016-05-03 09:01:37arigosetmessages: + msg264698
2016-05-03 08:48:27arigocreate