Issue 43925: Add hangul syllables to unicodedata.decomposititon

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/88091

classification

Title:	Add hangul syllables to unicodedata.decomposititon
Type:	enhancement	Stage:
Components:	Unicode	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, frederic.grosshans, terry.reedy
Priority:	normal	Keywords:

Created on 2021-04-23 17:39 by frederic.grosshans, last changed 2022-04-11 14:59 by admin.

Messages (2)
msg391715 - (view)	Author: Frédéric Grosshans-André (frederic.grosshans)	Date: 2021-04-23 17:39
Currently (python 3.8.6, unidata_version 12.1.0) unicodedata.decomposition outputs an empty string for hangul syllable (codepoints in the AC00..D7A3 range) while the decomposition is not empty: it is always two characters (either a LV syllable and a T Jamo or a L jamo and a V jamo). This decomposition is dedicible algorithmically (se §3.12 of Unicode Standard). A python version of the algorithm is below (I don’t know C, so I can’t propose a patch). For each hangul syllable hs, I have used unicodedata.noramize to check that the NFC of the decomposition is indeed hs, that the decomposition is two codepoints long, that the NFD of both hs and the decompotsition coincide def hangulsyllabledecomposition(c): if not 0xAC00 <= ord(c) <= 0xD7A3 : raise ValueError('only Hangul syllables allowed') dLV, T = divmod(ord(c) - 0xAC00, 28) if T!=0 : #it is a LVT syllable, decomposed into LV:=dLV19 and T return f'{0xAC00+dLV28:04X} {0x11A7+T:04X}' else : #it is a LVT syllable, decomposed into L , V L, V = divmod(dLV,21) return f'{0x1100+L:04X} {0x1161+V:04X}' # Constants used: # ============== # 0xAC00 : first syllable == 1st LV syllable # NB: there is one LV syllable every 28 codepoints # 0xD7A3 : last Hangul syllable # 0x1100 : first L jamo # 0x1161 : first V jamo # 0x11A7 : one before the 1st T jamo (0x1148), since T=0 means no trailing # # (all number below restricted for modern jamos where this algorithm is relevant) # 19 : Number of L jamos (not used here) # 21 : Number of V jamos # 28 : Number of T jamos plus one (since no T jamo for LV syllable)
msg391830 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-04-25 01:27
I verified the claim in 3.19.0a7 freshly compiled today. >>> import unicodedata as ud >>> ud.decomposition('\uac00') '' >>> for cp in range(0xac00, 0xd7a4): if (s := ud.decomposition(chr(cp))) != '': print(cp, s) >>>

History
Date	User	Action	Args
2022-04-11 14:59:44	admin	set	github: 88091
2021-04-27 14:21:34	vstinner	set	nosy: - vstinner
2021-04-25 01:27:57	terry.reedy	set	versions: + Python 3.11, - Python 3.8 nosy: + terry.reedy messages: + msg391830 type: enhancement
2021-04-23 17:39:30	frederic.grosshans	create