Author vlbrom
Recipients
Date 2007-04-21.10:52:05
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
There seem to be an incorrect handling of unicode characters beyond the BMP (code point higher than 0xFFFF) in the unicodedata module - function lookup() on narrow unicode python builds (python 2.5.1, Windows XPh)

>>> unicodedata.lookup("GOTHIC LETTER FAIHU")
u'\u0346'
(should be u'\U00010346' - the beginning of the literal is truncated - leading to the ambiguity - in this case u'\u0346' is a combining diacritics "COMBINING BRIDGE ABOVE")

on the contrary, the unicode string literals \N{name} work well.

>>> u"\N{GOTHIC LETTER FAIHU}"
u'\U00010346'

Unfortunately, I haven't been able to find the problematic pieces of sourcecode, so I'm not able to fix it. 

It seems, that initially the correct information on the given codepoint is used, but finally only the last four digits of the codepoint value are taken into account using the "narrow" unicode literal \uxxxx instead of \Uxxxxxxxx 
, while the same task is handled correctly by the unicodeescape codec used for unicode string literals.

vbr

History
Date User Action Args
2007-08-23 14:53:16adminlinkissue1704793 messages
2007-08-23 14:53:16admincreate