classification
Title: bugs in unicodedata.normalize: u1176, u11a7 and u11c3
Type: behavior Stage: patch review
Components: Unicode Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, haypo, lemburg, loewis, pusnow, xiang.zhang
Priority: normal Keywords: patch

Created on 2017-02-06 04:27 by pusnow, last changed 2017-08-28 02:41 by pusnow.

Files
File name Uploaded Description Edit
u1176.patch pusnow, 2017-02-06 04:27 review
u11a7u11c3.patch pusnow, 2017-02-06 05:47 review
Pull Requests
URL Status Linked Edit
PR 1958 open pusnow, 2017-06-05 15:48
Messages (12)
msg287077 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-02-06 04:27
unicodedata can't normalize(NFC) hangul strings which contain \u1176(HANGUL JUNGSEONG A-O).

>>> from unicodedata import normalize
>>> normalize("NFC", "\u1100\u1176\u11a8")
'깍'

=> should be "\u1100\u1176\u11a8" not '깍' (\uae4d)

I attached a patch for this issue. (Fixing boundary of modern medial vowels)
msg287078 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-02-06 05:21
How about the third character's range? The code seems assuming it's [11a7..11c3] while the spec is [11a8..11c2]?

>>> unicodedata.normalize("NFC", "\u1100\u1175\u11a7")
'기'

while it should be '기ᆧ'?
msg287079 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-02-06 05:47
I think you are right. The modern final consonants is [11a8..11c2].
I attached another patch for this issue.
msg295123 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-06-04 11:19
Is there anything need more?
msg295171 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-06-05 07:32
We have moved our code hosting to GitHub, would you mind turn your patch into a GitHub PR first Wonsup?
msg295172 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-06-05 08:06
Ok, I'll do it.
msg299214 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-07-26 07:54
Any updates? I need this fix for my project.
msg299657 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-08-02 13:25
I added some test cases for this issue. Please, someone check this.
msg300039 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-08-10 03:46
I think it can be merged. Is there anything I need to do?
msg300046 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-08-10 05:00
Hi Wonsup, sorry for the delay. I get really busy with my work these days. If no one get involved I'd try to find time reviewing your patch this week.
msg300576 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-08-19 09:54
This patch fixes changes in Unicode 4.1.0.
I think it well reviewed and it is time to merge.
Who can commit this patch? 

@animalize says:
Let me give a supplement:

Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0)
see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

This change happened in 2005.
msg300933 - (view) Author: Wonsup Yoon (pusnow) * Date: 2017-08-28 02:41
Hello?
History
Date User Action Args
2017-08-28 02:41:24pusnowsetmessages: + msg300933
2017-08-19 09:54:09pusnowsetmessages: + msg300576
2017-08-10 05:00:52xiang.zhangsetmessages: + msg300046
2017-08-10 04:59:30xiang.zhangsetfiles: - 800.jpg
2017-08-10 04:11:28高可爱setfiles: + 800.jpg
2017-08-10 03:46:54pusnowsetmessages: + msg300039
2017-08-02 13:25:40pusnowsetmessages: + msg299657
2017-07-26 07:54:11pusnowsetmessages: + msg299214
2017-06-05 15:48:57pusnowsetpull_requests: + pull_request2029
2017-06-05 15:46:27pusnowsettitle: bug in unicodedata.normalize: u1176, u11a7 and u11c3 -> bugs in unicodedata.normalize: u1176, u11a7 and u11c3
2017-06-05 08:06:08pusnowsetmessages: + msg295172
2017-06-05 07:32:39xiang.zhangsetmessages: + msg295171
2017-06-04 11:19:17pusnowsetmessages: + msg295123
2017-03-11 12:55:26serhiy.storchakasetnosy: + lemburg, loewis
stage: patch review
type: behavior

versions: + Python 3.5, Python 3.7
2017-03-11 12:33:28pusnowsettitle: bug in unicodedata.normalize: u1176 -> bug in unicodedata.normalize: u1176, u11a7 and u11c3
2017-02-06 05:47:24pusnowsetfiles: + u11a7u11c3.patch

messages: + msg287079
2017-02-06 05:21:48xiang.zhangsetnosy: + xiang.zhang
messages: + msg287078
2017-02-06 04:27:52pusnowcreate