This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: gives ValueError for control characters
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.10
Status: closed Resolution: duplicate
Dependencies: Superseder: Unicodedata module should provide access to codepoint aliases
View: 18234
Assigned To: Nosy List: serhiy.storchaka, snoopyjc, steven.daprano
Priority: normal Keywords:

Created on 2022-03-07 15:20 by snoopyjc, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg414672 - (view) Author: Joe Cool (snoopyjc) Date: 2022-03-07 15:20 gives ValueError for control characters, for example:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

Where unicodedata.lookup clearly knows the names for these characters:
>>> unicodedata.lookup('NULL')
>>> unicodedata.lookup('TAB')
msg414698 - (view) Author: Joe Cool (snoopyjc) Date: 2022-03-07 20:12
Note: This is an issue for all chars in the ordinal range 0 thru 31.
msg414708 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2022-03-07 23:15
The behaviour is technically correct, but confusing and unfortunate, and I don't think we can fix it.

Unicode does not define names for the ASCII control characters. But it does define aliases for them, based on the C0 control char standard.

unicodedata.lookup() looks for aliases as well as names (since version 3.3).

It is unfortunate that we have only a single function for looking up a unicode code point by name, alias, alias-abbreviation, and named-sequence. That keeps the API simple, but in corner cases like this it leads to confusion.

The obvious "fix" is to make name() return the alias if there is no official name to return, but that is a change in behaviour. I have code that assumes that C0 and C1 control characters have no name, and relies on name() raising an exception for them.

Even if we changed the behaviour to return the alias, which alias should be returned, the full alias or the abbreviation?

This doesn't fix the problem that name() and lookup() aren't inverses of each other:

lookup('NUL') -> '\0  # using the abbreviated alias
name('\0') -> 'NULL'  # returns the full alias (or vice versa)

It gets worse with named sequences:

>>> name(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: name() argument 1 must be a unicode character, not str
>>> len(c)

So we cannot possibly make name() and lookup() inverses of each other.

What we really should have had is separate functions for name and alias lookups, or better still, to expose the raw unicode tables as mappings and let people create their own higher-level interfaces.
msg414710 - (view) Author: Joe Cool (snoopyjc) Date: 2022-03-08 01:21
My recommendation would be to add a keyword parameter, defaulting to False, to name(), something like give_full_alias, or maybe errors=“give_full_alias” like the IO functions.

In the meantime, as the author of perllib, I had to make my own dict to return to the user the same thing perl does, which is the full alias for these.
msg414736 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-08 09:21
This is a duplicate of issue18234.
Date User Action Args
2022-04-11 14:59:57adminsetgithub: 91103
2022-03-08 09:21:03serhiy.storchakasetstatus: open -> closed

superseder: Unicodedata module should provide access to codepoint aliases

nosy: + serhiy.storchaka
messages: + msg414736
resolution: duplicate
stage: resolved
2022-03-08 01:21:08snoopyjcsetmessages: + msg414710
2022-03-07 23:15:43steven.dapranosetnosy: + steven.daprano
messages: + msg414708
2022-03-07 20:12:43snoopyjcsetmessages: + msg414698
2022-03-07 15:20:26snoopyjccreate