This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vbr
Recipients vbr
Date 2010-11-19.14:36:21
SpamBayes Score 0.0006816065
Marked as misclassified No
Message-id <1290177387.08.0.667845269575.issue10459@psf.upfronthosting.co.za>
In-reply-to
Content
I just noticed an ommision of come character names in unicodedata module.
These are some CJK - Ideographs:

龼 (0x9fbc) - 鿋 (0x9fcb)
 (CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])

𪜀 (0x2a700) - 𫜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])

𫝀 (0x2b740) - 𫠝 (0x2b81d)
 (CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])

The names are probably to be generated - e.g. CJK UNIFIED IDEOGRAPH-2A700 ... etc.

(Tested with the recompiled unicodedata - using unicode 6.0; with the py 27 - builtin module (unidata_version: '5.2.0') only the first two ranges are relevant (as CJK Unified Ideographs Extension D is an adition of Unicode 6)

(Also there are the unprintable ASCII controls, surrogates and private use areas, where the missing names are probably ok.)


I tested with the following rather clumsy code:

# # # # # # # # # # # # # # # 
# wide_unichr = custom unichr emulating unicode ranges beyond FFFF on narrow python build
codepoints_missing_char_names = [[-2,-2],] # dummy
for i in xrange(0x10FFFF+1):
    if unicodedata.category(wide_unichr(i))[:1] != 'C' and unicodedata.name(wide_unichr(i), u"??noname??") == u"??noname??":
        if codepoints_missing_char_names[-1][1] == i-1:
            codepoints_missing_char_names[-1][1] = i
        else:
            codepoints_missing_char_names.append([i, i])

for first, last in codepoints_missing_char_names[1:]:
    print u"%s (%s) - %s (%s)" % (wide_unichr(first), hex(first), wide_unichr(last), hex(last),)
# # # # # # # # # # # # # # # # # # # # # # # # # # 

Unfortunately, I can't provide a fix, as unicodedata involves C code, where my knowledge is near zero.

vbr
History
Date User Action Args
2010-11-19 14:36:27vbrsetrecipients: + vbr
2010-11-19 14:36:27vbrsetmessageid: <1290177387.08.0.667845269575.issue10459@psf.upfronthosting.co.za>
2010-11-19 14:36:21vbrlinkissue10459 messages
2010-11-19 14:36:21vbrcreate