Message121521
I just noticed an ommision of come character names in unicodedata module.
These are some CJK - Ideographs:
龼 (0x9fbc) - 鿋 (0x9fcb)
(CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])
𪜀 (0x2a700) - 𫜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])
𫝀 (0x2b740) - 𫠝 (0x2b81d)
(CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])
The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... etc.
(Tested with the recompiled unicodedata - using unicode 6.0; with the py 27 - builtin module (unidata_version: '5.2.0') only the first two ranges are relevant (as CJK Unified Ideographs Extension D is an adition of Unicode 6)
(Also there are the unprintable ASCII controls, surrogates and private use areas, where the missing names are probably ok.)
I tested with the following rather clumsy code:
# # # # # # # # # # # # # # #
# wide_unichr = custom unichr emulating unicode ranges beyond FFFF on narrow python build
codepoints_missing_char_names = [[-2,-2],] # dummy
for i in xrange(0x10FFFF+1):
if unicodedata.category(wide_unichr(i))[:1] != 'C' and unicodedata.name(wide_unichr(i), u"??noname??") == u"??noname??":
if codepoints_missing_char_names[-1][1] == i-1:
codepoints_missing_char_names[-1][1] = i
else:
codepoints_missing_char_names.append([i, i])
for first, last in codepoints_missing_char_names[1:]:
print u"%s (%s) - %s (%s)" % (wide_unichr(first), hex(first), wide_unichr(last), hex(last),)
# # # # # # # # # # # # # # # # # # # # # # # # # #
Unfortunately, I can't provide a fix, as unicodedata involves C code, where my knowledge is near zero.
vbr |
|
Date |
User |
Action |
Args |
2010-11-19 14:36:27 | vbr | set | recipients:
+ vbr |
2010-11-19 14:36:27 | vbr | set | messageid: <1290177387.08.0.667845269575.issue10459@psf.upfronthosting.co.za> |
2010-11-19 14:36:21 | vbr | link | issue10459 messages |
2010-11-19 14:36:21 | vbr | create | |
|