classification
Title: unicodedata.name() doesn't have names for control characters
Type: behavior Stage:
Components: Library (Lib), Unicode Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ezio.melotti, r.david.murray, zwol
Priority: normal Keywords:

Created on 2016-07-12 13:10 by zwol, last changed 2021-03-08 20:13 by vstinner.

Messages (4)
msg270242 - (view) Author: Zack Weinberg (zwol) * Date: 2016-07-12 13:10
unicodedata.name() doesn't have name information for the C0 and C1 control characters.  To see this, run

pprint.pprint(["U+{:04X} {}".format(n, unicodedata.name(chr(n), "<missing>")) for n in range(256)])

and you will observe <missing> printed for U+0000 through U+001F and U+007F through U+009F.  These characters do have official Unicode names and they should be known to name().

I may see if I can come up with a patch for this one, in my copious free time.
msg270245 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-07-12 13:35
That information is programatically generated from data files obtained from the unicode project, as far as I know.
msg270247 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-07-12 15:08
Character names are in field 1 of UnicodeData.txt [1][2]. For controls the name is just "<control>". In Tools/unicode/makunicodedata.py, the makeunicodename function skips names that start with "<". Instead of skipping the character, it could fall back on the Unicode 1.0 name (field 10), if it's defined. For controls, this is the ISO 6429 name:

    (10) Old name as published in Unicode 1.0 or ISO 6429 names 
    for control functions. This field is empty unless it is 
    significantly different from the current name for the 
    character. No longer used in code chart production. See 
    Name_Alias. 

The names of control characters are also in NameAliases.txt, which gets processed as the unicode.aliases list of (name, char) tuples.

[1]: http://www.unicode.org/reports/tr44/#UnicodeData.txt
[2]: http://www.unicode.org/Public/8.0.0/ucd
msg270254 - (view) Author: Zack Weinberg (zwol) * Date: 2016-07-12 16:04
It looks to me as if NameAliases.txt is the better reference for the C0 and C1 controls.  It matches the UnicodeData.txt field 10 names for most entries where the field 1 name is "<control>", but it has names for U+0080, U+0081, U+0084, and U+0099, which have no field 10 name.  The only catch is that NameAliases may have *several* names for the same character, with the same category tag, e.g.

0009;CHARACTER TABULATION;control
0009;HORIZONTAL TABULATION;control

It probably makes sense to consistently use the first listed.
History
Date User Action Args
2021-03-08 20:13:44vstinnersetnosy: - vstinner
2021-02-26 18:08:26eryksunsetversions: + Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.5, Python 3.6
2016-07-12 17:30:10eryksunsetversions: + Python 2.7, Python 3.6
2016-07-12 17:29:45eryksunsetnosy: + ezio.melotti, vstinner
components: + Unicode
2016-07-12 16:04:24zwolsetmessages: + msg270254
2016-07-12 15:08:29eryksunsetnosy: + eryksun
messages: + msg270247
2016-07-12 13:35:01r.david.murraysetnosy: + r.david.murray
messages: + msg270245
2016-07-12 13:10:49zwolcreate