This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: missing character names in unicodedata (CJK...)
Type: behavior Stage:
Components: Library (Lib), Unicode Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, lemburg, loewis, vbr
Priority: normal Keywords:

Created on 2010-11-19 14:36 by vbr, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (6)
msg121521 - (view) Author: Vlastimil Brom (vbr) Date: 2010-11-19 14:36
I just noticed an ommision of come character names in unicodedata module.
These are some CJK - Ideographs:

龼 (0x9fbc) - 鿋 (0x9fcb)
 (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])

𪜀 (0x2a700) - 𫜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])

𫝀 (0x2b740) - 𫠝 (0x2b81d)
 (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])

The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... etc.

(Tested with the recompiled unicodedata - using unicode 6.0; with the py 27 - builtin module (unidata_version: '5.2.0') only the first two ranges are relevant (as CJK Unified Ideographs Extension D is an adition of Unicode 6)

(Also there are the unprintable ASCII controls, surrogates and private use areas, where the missing names are probably ok.)


I tested with the following rather clumsy code:

# # # # # # # # # # # # # # # 
# wide_unichr = custom unichr emulating unicode ranges beyond FFFF on narrow python build
codepoints_missing_char_names = [[-2,-2],] # dummy
for i in xrange(0x10FFFF+1):
    if unicodedata.category(wide_unichr(i))[:1] != 'C' and unicodedata.name(wide_unichr(i), u"??noname??") == u"??noname??":
        if codepoints_missing_char_names[-1][1] == i-1:
            codepoints_missing_char_names[-1][1] = i
        else:
            codepoints_missing_char_names.append([i, i])

for first, last in codepoints_missing_char_names[1:]:
    print u"%s (%s) - %s (%s)" % (wide_unichr(first), hex(first), wide_unichr(last), hex(last),)
# # # # # # # # # # # # # # # # # # # # # # # # # # 

Unfortunately, I can't provide a fix, as unicodedata involves C code, where my knowledge is near zero.

vbr
msg121537 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-19 15:29
Vlastimil Brom wrote:
> 
> New submission from Vlastimil Brom <vlastimil.brom@gmail.com>:
> 
> I just noticed an ommision of come character names in unicodedata module.
> These are some CJK - Ideographs:
> 
> 龼 (0x9fbc) - 鿋 (0x9fcb)
>  (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])
> 
> 𪜀 (0x2a700) - 𫜴 (0x2b734)
> (CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])
> 
> 𫝀 (0x2b740) - 𫠝 (0x2b81d)
>  (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])
> 
> The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... etc.

I don't think we should fill those rather big ranges with generated
names, unless there's a standard for this. There are quite a
few ranges in the Unicode database that are assigned, but don't
have a literal name associated with them.
msg121578 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-11-19 23:26
Marc-Andre: Many of the characters you refer actually do have names assigned, even if the names don't appear in the Unicode character database. Instead, they are specified in section 4.8 of the Unicode standard, and unicodedata.c already implements that (it just wasn't updated when the ranges changed; I will look into this).
msg121584 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-11-20 00:17
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> Marc-Andre: Many of the characters you refer actually do have names assigned, even if the names don't appear in the Unicode character database. Instead, they are specified in section 4.8 of the Unicode standard, and unicodedata.c already implements that (it just wasn't updated when the ranges changed; I will look into this).

Thanks for pointing this out. I wasn't aware of there being a standard
for constructing names for CJK ideograph ranges.
msg122100 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-11-22 09:00
For 3.2, this now fixed in r86681.
msg122107 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-11-22 10:54
The patch for 3.1 is r86685. The patch for 2.7 is r86686.
History
Date User Action Args
2022-04-11 14:57:09adminsetgithub: 54668
2010-11-22 10:54:10loewissetstatus: open -> closed
resolution: fixed
messages: + msg122107
2010-11-22 09:00:21loewissetmessages: + msg122100
2010-11-20 00:17:05lemburgsetmessages: + msg121584
2010-11-19 23:26:12loewissetnosy: + loewis
messages: + msg121578
2010-11-19 15:29:49lemburgsetnosy: + lemburg
messages: + msg121537
2010-11-19 14:40:43vbrsetnosy: vbr, ezio.melotti
type: behavior
components: + Library (Lib), Unicode
2010-11-19 14:38:01ezio.melottisetnosy: + ezio.melotti
2010-11-19 14:36:21vbrcreate