Issue 27496: unicodedata.name() doesn't have names for control characters

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/71683

classification

Title:	unicodedata.name() doesn't have names for control characters
Type:	behavior	Stage:
Components:	Library (Lib), Unicode	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	eryksun, ezio.melotti, r.david.murray, zwol
Priority:	normal	Keywords:

Created on 2016-07-12 13:10 by zwol, last changed 2022-04-11 14:58 by admin.

Messages (4)
msg270242 - (view)	Author: Zack Weinberg (zwol) *	Date: 2016-07-12 13:10
unicodedata.name() doesn't have name information for the C0 and C1 control characters. To see this, run pprint.pprint(["U+{:04X} {}".format(n, unicodedata.name(chr(n), "<missing>")) for n in range(256)]) and you will observe <missing> printed for U+0000 through U+001F and U+007F through U+009F. These characters do have official Unicode names and they should be known to name(). I may see if I can come up with a patch for this one, in my copious free time.
msg270245 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-07-12 13:35
That information is programatically generated from data files obtained from the unicode project, as far as I know.
msg270247 - (view)	Author: Eryk Sun (eryksun) *	Date: 2016-07-12 15:08
Character names are in field 1 of UnicodeData.txt [1][2]. For controls the name is just "<control>". In Tools/unicode/makunicodedata.py, the makeunicodename function skips names that start with "<". Instead of skipping the character, it could fall back on the Unicode 1.0 name (field 10), if it's defined. For controls, this is the ISO 6429 name: (10) Old name as published in Unicode 1.0 or ISO 6429 names for control functions. This field is empty unless it is significantly different from the current name for the character. No longer used in code chart production. See Name_Alias. The names of control characters are also in NameAliases.txt, which gets processed as the unicode.aliases list of (name, char) tuples. [1]: http://www.unicode.org/reports/tr44/#UnicodeData.txt [2]: http://www.unicode.org/Public/8.0.0/ucd
msg270254 - (view)	Author: Zack Weinberg (zwol) *	Date: 2016-07-12 16:04
It looks to me as if NameAliases.txt is the better reference for the C0 and C1 controls. It matches the UnicodeData.txt field 10 names for most entries where the field 1 name is "<control>", but it has names for U+0080, U+0081, U+0084, and U+0099, which have no field 10 name. The only catch is that NameAliases may have several names for the same character, with the same category tag, e.g. 0009;CHARACTER TABULATION;control 0009;HORIZONTAL TABULATION;control It probably makes sense to consistently use the first listed.

History
Date	User	Action	Args
2022-04-11 14:58:33	admin	set	github: 71683
2021-03-08 20:13:44	vstinner	set	nosy: - vstinner
2021-02-26 18:08:26	eryksun	set	versions: + Python 3.8, Python 3.9, Python 3.10, - Python 2.7, Python 3.5, Python 3.6
2016-07-12 17:30:10	eryksun	set	versions: + Python 2.7, Python 3.6
2016-07-12 17:29:45	eryksun	set	nosy: + ezio.melotti, vstinner components: + Unicode
2016-07-12 16:04:24	zwol	set	messages: + msg270254
2016-07-12 15:08:29	eryksun	set	nosy: + eryksun messages: + msg270247
2016-07-12 13:35:01	r.david.murray	set	nosy: + r.david.murray messages: + msg270245
2016-07-12 13:10:49	zwol	create