Message 121521 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vbr
Recipients	vbr
Date	2010-11-19.14:36:21
SpamBayes Score	0.0006816065
Marked as misclassified	No
Message-id	<1290177387.08.0.667845269575.issue10459@psf.upfronthosting.co.za>
In-reply-to

Content
I just noticed an ommision of come character names in unicodedata module. These are some CJK - Ideographs: 龼 (0x9fbc) - 鿋 (0x9fcb) (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff]) 𪜀 (0x2a700) - 𫜴 (0x2b734) (CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f]) 𫝀 (0x2b740) - 𫠝 (0x2b81d) (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f]) The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... etc. (Tested with the recompiled unicodedata - using unicode 6.0; with the py 27 - builtin module (unidata_version: '5.2.0') only the first two ranges are relevant (as CJK Unified Ideographs Extension D is an adition of Unicode 6) (Also there are the unprintable ASCII controls, surrogates and private use areas, where the missing names are probably ok.) I tested with the following rather clumsy code: # # # # # # # # # # # # # # # # wide_unichr = custom unichr emulating unicode ranges beyond FFFF on narrow python build codepoints_missing_char_names = [[-2,-2],] # dummy for i in xrange(0x10FFFF+1): if unicodedata.category(wide_unichr(i))[:1] != 'C' and unicodedata.name(wide_unichr(i), u"??noname??") == u"??noname??": if codepoints_missing_char_names[-1][1] == i-1: codepoints_missing_char_names[-1][1] = i else: codepoints_missing_char_names.append([i, i]) for first, last in codepoints_missing_char_names[1:]: print u"%s (%s) - %s (%s)" % (wide_unichr(first), hex(first), wide_unichr(last), hex(last),) # # # # # # # # # # # # # # # # # # # # # # # # # # Unfortunately, I can't provide a fix, as unicodedata involves C code, where my knowledge is near zero. vbr

I just noticed an ommision of come character names in unicodedata module.
These are some CJK - Ideographs:

龼 (0x9fbc) - 鿋 (0x9fcb)
 (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])

𪜀 (0x2a700) - 𫜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])

𫝀 (0x2b740) - 𫠝 (0x2b81d)
 (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])

The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... etc.

(Tested with the recompiled unicodedata - using unicode 6.0; with the py 27 - builtin module (unidata_version: '5.2.0') only the first two ranges are relevant (as CJK Unified Ideographs Extension D is an adition of Unicode 6)

(Also there are the unprintable ASCII controls, surrogates and private use areas, where the missing names are probably ok.)


I tested with the following rather clumsy code:

# # # # # # # # # # # # # # # 
# wide_unichr = custom unichr emulating unicode ranges beyond FFFF on narrow python build
codepoints_missing_char_names = [[-2,-2],] # dummy
for i in xrange(0x10FFFF+1):
    if unicodedata.category(wide_unichr(i))[:1] != 'C' and unicodedata.name(wide_unichr(i), u"??noname??") == u"??noname??":
        if codepoints_missing_char_names[-1][1] == i-1:
            codepoints_missing_char_names[-1][1] = i
        else:
            codepoints_missing_char_names.append([i, i])

for first, last in codepoints_missing_char_names[1:]:
    print u"%s (%s) - %s (%s)" % (wide_unichr(first), hex(first), wide_unichr(last), hex(last),)
# # # # # # # # # # # # # # # # # # # # # # # # # # 

Unfortunately, I can't provide a fix, as unicodedata involves C code, where my knowledge is near zero.

vbr

History
Date	User	Action	Args
2010-11-19 14:36:27	vbr	set	recipients: + vbr
2010-11-19 14:36:27	vbr	set	messageid: <1290177387.08.0.667845269575.issue10459@psf.upfronthosting.co.za>
2010-11-19 14:36:21	vbr	link	issue10459 messages
2010-11-19 14:36:21	vbr	create