The behaviour is technically correct, but confusing and unfortunate, and I don't think we can fix it.
Unicode does not define names for the ASCII control characters. But it does define aliases for them, based on the C0 control char standard.
unicodedata.lookup() looks for aliases as well as names (since version 3.3).
https://www.unicode.org/Public/UNIDATA/UnicodeData.txt
https://www.unicode.org/Public/UNIDATA/NameAliases.txt
It is unfortunate that we have only a single function for looking up a unicode code point by name, alias, alias-abbreviation, and named-sequence. That keeps the API simple, but in corner cases like this it leads to confusion.
The obvious "fix" is to make name() return the alias if there is no official name to return, but that is a change in behaviour. I have code that assumes that C0 and C1 control characters have no name, and relies on name() raising an exception for them.
Even if we changed the behaviour to return the alias, which alias should be returned, the full alias or the abbreviation?
This doesn't fix the problem that name() and lookup() aren't inverses of each other:
lookup('NUL') -> '\0 # using the abbreviated alias
name('\0') -> 'NULL' # returns the full alias (or vice versa)
It gets worse with named sequences:
>>> c = lookup('LATIN CAPITAL LETTER A WITH MACRON AND GRAVE')
>>> name(c)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: name() argument 1 must be a unicode character, not str
>>> len(c)
2
So we cannot possibly make name() and lookup() inverses of each other.
What we really should have had is separate functions for name and alias lookups, or better still, to expose the raw unicode tables as mappings and let people create their own higher-level interfaces.
|