Issue 10459: missing character names in unicodedata (CJK...)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54668

classification

Title:	missing character names in unicodedata (CJK...)
Type:	behavior	Stage:
Components:	Library (Lib), Unicode	Versions:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, lemburg, loewis, vbr
Priority:	normal	Keywords:

Created on 2010-11-19 14:36 by vbr, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (6)
msg121521 - (view)	Author: Vlastimil Brom (vbr)	Date: 2010-11-19 14:36
I just noticed an ommision of come character names in unicodedata module. These are some CJK - Ideographs: 龼 (0x9fbc) - 鿋 (0x9fcb) (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff]) 𪜀 (0x2a700) - 𫜴 (0x2b734) (CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f]) 𫝀 (0x2b740) - 𫠝 (0x2b81d) (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f]) The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... etc. (Tested with the recompiled unicodedata - using unicode 6.0; with the py 27 - builtin module (unidata_version: '5.2.0') only the first two ranges are relevant (as CJK Unified Ideographs Extension D is an adition of Unicode 6) (Also there are the unprintable ASCII controls, surrogates and private use areas, where the missing names are probably ok.) I tested with the following rather clumsy code: # # # # # # # # # # # # # # # # wide_unichr = custom unichr emulating unicode ranges beyond FFFF on narrow python build codepoints_missing_char_names = [[-2,-2],] # dummy for i in xrange(0x10FFFF+1): if unicodedata.category(wide_unichr(i))[:1] != 'C' and unicodedata.name(wide_unichr(i), u"??noname??") == u"??noname??": if codepoints_missing_char_names[-1][1] == i-1: codepoints_missing_char_names[-1][1] = i else: codepoints_missing_char_names.append([i, i]) for first, last in codepoints_missing_char_names[1:]: print u"%s (%s) - %s (%s)" % (wide_unichr(first), hex(first), wide_unichr(last), hex(last),) # # # # # # # # # # # # # # # # # # # # # # # # # # Unfortunately, I can't provide a fix, as unicodedata involves C code, where my knowledge is near zero. vbr
msg121537 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-19 15:29
Vlastimil Brom wrote: > > New submission from Vlastimil Brom <vlastimil.brom@gmail.com>: > > I just noticed an ommision of come character names in unicodedata module. > These are some CJK - Ideographs: > > 龼 (0x9fbc) - 鿋 (0x9fcb) > (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff]) > > 𪜀 (0x2a700) - 𫜴 (0x2b734) > (CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f]) > > 𫝀 (0x2b740) - 𫠝 (0x2b81d) > (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f]) > > The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... etc. I don't think we should fill those rather big ranges with generated names, unless there's a standard for this. There are quite a few ranges in the Unicode database that are assigned, but don't have a literal name associated with them.
msg121578 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-19 23:26
Marc-Andre: Many of the characters you refer actually do have names assigned, even if the names don't appear in the Unicode character database. Instead, they are specified in section 4.8 of the Unicode standard, and unicodedata.c already implements that (it just wasn't updated when the ranges changed; I will look into this).
msg121584 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-11-20 00:17
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > Marc-Andre: Many of the characters you refer actually do have names assigned, even if the names don't appear in the Unicode character database. Instead, they are specified in section 4.8 of the Unicode standard, and unicodedata.c already implements that (it just wasn't updated when the ranges changed; I will look into this). Thanks for pointing this out. I wasn't aware of there being a standard for constructing names for CJK ideograph ranges.
msg122100 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-22 09:00
For 3.2, this now fixed in r86681.
msg122107 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-11-22 10:54
The patch for 3.1 is r86685. The patch for 2.7 is r86686.

History
Date	User	Action	Args
2022-04-11 14:57:09	admin	set	github: 54668
2010-11-22 10:54:10	loewis	set	status: open -> closed resolution: fixed messages: + msg122107
2010-11-22 09:00:21	loewis	set	messages: + msg122100
2010-11-20 00:17:05	lemburg	set	messages: + msg121584
2010-11-19 23:26:12	loewis	set	nosy: + loewis messages: + msg121578
2010-11-19 15:29:49	lemburg	set	nosy: + lemburg messages: + msg121537
2010-11-19 14:40:43	vbr	set	nosy: vbr, ezio.melotti type: behavior components: + Library (Lib), Unicode
2010-11-19 14:38:01	ezio.melotti	set	nosy: + ezio.melotti
2010-11-19 14:36:21	vbr	create