Issue 46947: unicodedata.name gives ValueError for control characters

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/91103

classification

Title:	unicodedata.name gives ValueError for control characters
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.10

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Unicodedata module should provide access to codepoint aliases View: 18234
Assigned To:		Nosy List:	serhiy.storchaka, snoopyjc, steven.daprano
Priority:	normal	Keywords:

Created on 2022-03-07 15:20 by snoopyjc, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg414672 - (view)	Author: Joe Cool (snoopyjc)	Date: 2022-03-07 15:20
unicodedata.name gives ValueError for control characters, for example: >>> unicodedata.name('\x00') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name >>> unicodedata.name('\t') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name Where unicodedata.lookup clearly knows the names for these characters: >>> unicodedata.lookup('NULL') '\x00' >>> unicodedata.lookup('TAB') '\t'
msg414698 - (view)	Author: Joe Cool (snoopyjc)	Date: 2022-03-07 20:12
Note: This is an issue for all chars in the ordinal range 0 thru 31.
msg414708 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2022-03-07 23:15
The behaviour is technically correct, but confusing and unfortunate, and I don't think we can fix it. Unicode does not define names for the ASCII control characters. But it does define aliases for them, based on the C0 control char standard. unicodedata.lookup() looks for aliases as well as names (since version 3.3). https://www.unicode.org/Public/UNIDATA/UnicodeData.txt https://www.unicode.org/Public/UNIDATA/NameAliases.txt It is unfortunate that we have only a single function for looking up a unicode code point by name, alias, alias-abbreviation, and named-sequence. That keeps the API simple, but in corner cases like this it leads to confusion. The obvious "fix" is to make name() return the alias if there is no official name to return, but that is a change in behaviour. I have code that assumes that C0 and C1 control characters have no name, and relies on name() raising an exception for them. Even if we changed the behaviour to return the alias, which alias should be returned, the full alias or the abbreviation? This doesn't fix the problem that name() and lookup() aren't inverses of each other: lookup('NUL') -> '\0 # using the abbreviated alias name('\0') -> 'NULL' # returns the full alias (or vice versa) It gets worse with named sequences: >>> c = lookup('LATIN CAPITAL LETTER A WITH MACRON AND GRAVE') >>> name(c) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: name() argument 1 must be a unicode character, not str >>> len(c) 2 So we cannot possibly make name() and lookup() inverses of each other. What we really should have had is separate functions for name and alias lookups, or better still, to expose the raw unicode tables as mappings and let people create their own higher-level interfaces.
msg414710 - (view)	Author: Joe Cool (snoopyjc)	Date: 2022-03-08 01:21
My recommendation would be to add a keyword parameter, defaulting to False, to name(), something like give_full_alias, or maybe errors=“give_full_alias” like the IO functions. In the meantime, as the author of perllib, I had to make my own dict to return to the user the same thing perl does, which is the full alias for these.
msg414736 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-03-08 09:21
This is a duplicate of issue18234.

History
Date	User	Action	Args
2022-04-11 14:59:57	admin	set	github: 91103
2022-03-08 09:21:03	serhiy.storchaka	set	status: open -> closed superseder: Unicodedata module should provide access to codepoint aliases nosy: + serhiy.storchaka messages: + msg414736 resolution: duplicate stage: resolved
2022-03-08 01:21:08	snoopyjc	set	messages: + msg414710
2022-03-07 23:15:43	steven.daprano	set	nosy: + steven.daprano messages: + msg414708
2022-03-07 20:12:43	snoopyjc	set	messages: + msg414698
2022-03-07 15:20:26	snoopyjc	create