Issue 18234: Unicodedata module should provide access to codepoint aliases

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62434

classification

Title:	Unicodedata module should provide access to codepoint aliases
Type:	enhancement	Stage:	needs patch
Components:		Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	belopolsky, benjamin.peterson, ezio.melotti, flying sheep, lemburg, loewis, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2013-06-17 00:24 by belopolsky, last changed 2022-04-11 14:57 by admin.

Messages (20)
msg191300 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-17 00:24
Python is aware of unicode codepoint aliases, but unicodedata does not provide a way to find aliases of a given codepoint: >>> ucd.lookup('ESCAPE') == '\N{ESCAPE}' True >>> ucd.lookup('RS') == '\N{RS}' True but >>> ucd.name('\N{ESCAPE}') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name >>> ucd.name('\N{RS}') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name
msg191510 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2013-06-20 12:57
I think the best way would be to provide a function unicodedata.aliases, returning a list of names for a given character or sequence.
msg191538 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-20 20:32
UCD provides more than just a list of aliases: formal name aliases have "type" - control, abbreviation, etc. See <http://www.unicode.org/Public/UNIDATA/NameAliases.txt>.
msg191715 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-23 18:18
Rather than adding a new method to unicodedata, what do you think about adding a type keyword argument to unicodedata.name()? It can default to "canonical" and have possible values "control", "abbreviation", etc. See also #12753.
msg191719 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-06-23 19:02
Can a character or sequence have multiple aliases? What will be a result type of unicodedata.name() with "abbreviation" keyword value?
msg191724 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-23 19:41
> Can a character or sequence have multiple aliases? Yes, for example, most control characters have two aliases (and no name). 0000;NULL;control 0000;NUL;abbreviation 0001;START OF HEADING;control 0001;SOH;abbreviation 0002;START OF TEXT;control 0002;STX;abbreviation (See <http://www.unicode.org/Public/UNIDATA/NameAliases.txt>) > What will be a result type of unicodedata.name() with "abbreviation" keyword value? Under my proposal: >>> unicodedata.name('\N{ESCAPE}', type='abbreviation') 'ESC' I would also like to consider changing the default slightly. I find the following behavior rather unhelpful: >>> unicodedata.name('\N{ESC}') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: no such name I think most users would expect 'ESCAPE' instead. The following is more of a curiosity rather than a genuine problem, but is a good illustration for a general point: >>> unicodedata.name('\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET}') 'PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET' (Note misspelled word "BRACKET" in the output.) Since "correction" alias is the official method of publishing corrections to unicode names, I think unicodedata.name() should return correct name by default.
msg191729 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-23 20:43
unicodedata.name() was discussed in #12353 (msg144739) where MvL argued that misspelled names are better than corrected because they are more likely to appear misspelled in other sources. I am not sure I buy this argument. Someone googling for 'BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS' will probably just enter BYZANTINE VASIS and find what he or she needs. A more likely scenario is someone trying to get all FTHORA symbols using a naive code like this: [hex(i) for i in range(1114112) if 'FTHORA' in ud.name(chr(i), '')]. Even more likely scenario is someone seeing a fancy symbol on the web and wanting to use it in a python program. Such programmer would copy the symbol to python prompt, call unicode.name() and copy the result in the program. Do we want to encourage people to perpetuate the mistake that Unicode has corrected? I don't think the issue of control codes names was discussed in #12353. I see no downside with returning the first alias in case no name is present.
msg191733 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-23 21:49
I mistyped issue reference above it should be #12753, not 12353.
msg191747 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-06-24 07:54
On 23.06.2013 22:43, Alexander Belopolsky wrote: > > Alexander Belopolsky added the comment: > > unicodedata.name() was discussed in #12353 (msg144739) where MvL argued that misspelled names are better than corrected because they are more likely to appear misspelled in other sources. I am not sure I buy this argument. Someone googling for 'BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS' will probably just enter BYZANTINE VASIS and find what he or she needs. A more likely scenario is someone trying to get all FTHORA symbols using a naive code like this: [hex(i) for i in range(1114112) if 'FTHORA' in ud.name(chr(i), '')]. > > Even more likely scenario is someone seeing a fancy symbol on the web and wanting to use it in a python program. Such programmer would copy the symbol to python prompt, call unicode.name() and copy the result in the program. Do we want to encourage people to perpetuate the mistake that Unicode has corrected? > > I don't think the issue of control codes names was discussed in #12353. I see no downside with returning the first alias in case no name is present. We should stick to the rules. Please leave the function as it is, i.e. a 1-1 mapping to the official, non-changing Unicode name reference (including spelling errors, etc). Same with code points that have no name. If you want to expose the aliases, you can do so in a new function, say .aliases() which then returns the list of aliases of a character (including the original name, if available). If we change the return values of .name() to whatever we think would be more usable, we'd be modifying how Python programmers see the Unicode database. That's not the purpose of the module.
msg191748 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-06-24 08:05
Perhaps unicodedata.aliases() should return not a list, but an ordered dict. What name should use the "namereplace" error handler? Original or corrected? Should it use first alias if there is no original name?
msg191751 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-06-24 09:24
On 24.06.2013 10:05, Serhiy Storchaka wrote: > > Serhiy Storchaka added the comment: > > Perhaps unicodedata.aliases() should return not a list, but an ordered dict. > > What name should use the "namereplace" error handler? Original or corrected? Should it use first alias if there is no original name? For compatibility with other tools, it should use .name(), not .aliases() to determine the name. Please note that the aliases are not the official Unicode names of the code points.
msg191771 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-24 14:35
MAL> Please leave the function as it is, i.e. a 1-1 mapping to the MAL> official, non-changing Unicode name reference (including MAL> spelling errors, etc). Same with code points that have no name. Since we have code points with no name - it is not 1-1 mapping but 1 to 0 or 1. Unicode Standard recommends using "Code Point Labels" "To provide unique, meaningful labels for code points that do not have character names." (Section 4.9.) These labels are not very useful: Control: control-NNNN Reserved: reserved-NNNN Noncharacter: noncharacter-NNNN Private-Use: private-use-NNNN Surrogate: surrogate-NNNN According to the description in NameAliases.txt: # The formal name aliases are part of the Unicode character namespace, which # includes the character names and the names of named character sequences. I believe this means that formal name aliases are as official as the character names. If we don't change the default, what is the downside in adding an optional type argument to unicodedata.name()? After all, according to the standard, aliases are names, just a different type of names.
msg191774 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-24 14:58
Here is an example of "prior art" that is relevant to this discussion: """ charnames::viacode(code) .. As mentioned above under ALIASES, Unicode 6.1 defines extra names (synonyms or aliases) for some code points, most of which were already available as Perl extensions. All these are accepted by \N{...} and the other functions in this module, but viacode has to choose which one name to return for a given input code point, so it returns the "best" name. To understand how this works, it is helpful to know more about the Unicode name properties. All code points actually have only a single name, which (starting in Unicode 2.0) can never change once a character has been assigned to the code point. But mistakes have been made in assigning names, for example sometimes a clerical error was made during the publishing of the Standard which caused words to be misspelled, and there was no way to correct those. The Name_Alias property was eventually created to handle these situations. If a name was wrong, a corrected synonym would be published for it, using Name_Alias. viacode will return that corrected synonym as the "best" name for a code point. (It is even possible, though it hasn't happened yet, that the correction itself will need to be corrected, and so another Name_Alias can be created for that code point; viacode will return the most recent correction.) The Unicode name for each of the control characters (such as LINE FEED) is the empty string. However almost all had names assigned by other standards, such as the ASCII Standard, or were in common use. viacode returns these names as the "best" ones available. Unicode 6.1 has created Name_Aliases for each of them, including alternate names, like NEW LINE. viacode uses the original name, "LINE FEED" in preference to the alternate. Similarly the name returned for U+FEFF is "ZERO WIDTH NO-BREAK SPACE", not "BYTE ORDER MARK". """ <http://perldoc.perl.org/charnames.html#charnames%3a%3aviacode(code)> If .name() cannot be touched, what about implementing .bestname() with the above semantics?
msg191775 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-06-24 14:58
On 24.06.2013 16:35, Alexander Belopolsky wrote: > > Alexander Belopolsky added the comment: > > MAL> Please leave the function as it is, i.e. a 1-1 mapping to the > MAL> official, non-changing Unicode name reference (including > MAL> spelling errors, etc). Same with code points that have no name. > > Since we have code points with no name - it is not 1-1 mapping but 1 to 0 or 1. True, it's not 1-1 in the mathematical sense (bijective), only surjective. However, it is 1-1 for all code points which have a name assigned. > Unicode Standard recommends using "Code Point Labels" "To provide unique, meaningful labels for code points that do not have character names." (Section 4.9.) > > These labels are not very useful: > > Control: control-NNNN > Reserved: reserved-NNNN > Noncharacter: noncharacter-NNNN > Private-Use: private-use-NNNN > Surrogate: surrogate-NNNN I don't any advantage of using these over plain \uXXXX codes. > According to the description in NameAliases.txt: > > # The formal name aliases are part of the Unicode character namespace, which > # includes the character names and the names of named character sequences. > > I believe this means that formal name aliases are as official as the character names. Yes, but they are official aliases, not official code point names :-) > If we don't change the default, what is the downside in adding an optional type argument to unicodedata.name()? After all, according to the standard, aliases are names, just a different type of names. The .aliases() function would have to return a list, not a single name, so a parameter would cause the return type to change, which is not a good idea. A new function also makes the origin of these names clear to the user.
msg191777 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-06-24 15:07
On 24.06.2013 16:58, Alexander Belopolsky wrote: > > Alexander Belopolsky added the comment: > > Here is an example of "prior art" that is relevant to this discussion: > > """ > charnames::viacode(code) > .. > As mentioned above under ALIASES, Unicode 6.1 defines extra names (synonyms or aliases) for some code points, most of which were already available as Perl extensions. All these are accepted by \N{...} and the other functions in this module, but viacode has to choose which one name to return for a given input code point, so it returns the "best" name. To understand how this works, it is helpful to know more about the Unicode name properties. All code points actually have only a single name, which (starting in Unicode 2.0) can never change once a character has been assigned to the code point. But mistakes have been made in assigning names, for example sometimes a clerical error was made during the publishing of the Standard which caused words to be misspelled, and there was no way to correct those. The Name_Alias property was eventually created to handle these situations. If a name was wrong, a corrected synonym would be published for it, using Name_Alias. viacode will return t > hat corr > ected synonym as the "best" name for a code point. (It is even possible, though it hasn't happened yet, that the correction itself will need to be corrected, and so another Name_Alias can be created for that code point; viacode will return the most recent correction.) > > The Unicode name for each of the control characters (such as LINE FEED) is the empty string. However almost all had names assigned by other standards, such as the ASCII Standard, or were in common use. viacode returns these names as the "best" ones available. Unicode 6.1 has created Name_Aliases for each of them, including alternate names, like NEW LINE. viacode uses the original name, "LINE FEED" in preference to the alternate. Similarly the name returned for U+FEFF is "ZERO WIDTH NO-BREAK SPACE", not "BYTE ORDER MARK". > """ <http://perldoc.perl.org/charnames.html#charnames%3a%3aviacode(code)> > > If .name() cannot be touched, what about implementing .bestname() with the above semantics? I think it's better to let the programmer decide what the "best" name should be, e.g. some people will like ESC better than ESCAPE or \u001b or \x1b. unicodedata only provides neutral access to what's in the Unicode database. It doesn't make any decisions on what's good or bad ;-)
msg191781 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-24 16:10
> The .aliases() function would have to return a list, not a single > name, so a parameter would cause the return type to change, which > is not a good idea. You misunderstood my proposal. .name() will still return a single name, but the type parameter will control which name to return: name(ch[, type=(None\|'correction'\|'control'\|'alternate'\|'figment'\|'abbreviation')]) None - default, same as current behavior. correction - indicates that the returned name is a corrected form for the original name (which remains valid) for the same code point. control - return a new name added for a control character. alternate - return an alternate name for a character figment - return a name for a character that has been documented but was never in any actual standard. abbreviation - return a common abbreviation for a character
msg191782 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2013-06-24 16:20
But some of these types could still have lists as values, no?
msg191788 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-06-24 17:04
On 24.06.2013 18:10, Alexander Belopolsky wrote: > > Alexander Belopolsky added the comment: > >> The .aliases() function would have to return a list, not a single >> name, so a parameter would cause the return type to change, which >> is not a good idea. > > You misunderstood my proposal. .name() will still return a single name, but the type parameter will control which name to return: > > name(ch[, type=(None\|'correction'\|'control'\|'alternate'\|'figment'\|'abbreviation')]) > > None - default, same as current behavior. > > correction - indicates that the returned name is a corrected form for the original name (which remains valid) for the same code point. > > control - return a new name added for a control character. > > alternate - return an alternate name for a character > > figment - return a name for a character that has been documented but was never in any actual standard. > > abbreviation - return a common abbreviation for a character How can you be sure that each of those alias types occurs only once ? The NameAliases.txt doesn't say anything about this, AFAIK: http://www.unicode.org/Public/UNIDATA/NameAliases.txt Also, what would name() return in case to alias of a particular type is defined ? I think it would be easier and more future proof to have a function aliases(code) -> [(type, alias),...] which simply returns all defined aliases. Applications could then add helpers for select the type they would like to use. It may make sense to also add the name(code) value as e.g. ('standard', name(code)) to that list. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 24 2013) >>> Python Projects, Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2013-06-18: Released mxODBC Django DE 1.2.0 ... http://egenix.com/go47 2013-07-01: EuroPython 2013, Florence, Italy ... 7 days to go 2013-07-16: Python Meeting Duesseldorf ... 22 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg210811 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-02-10 08:38
See also #20433.
msg229100 - (view)	Author: (flying sheep) *	Date: 2014-10-11 17:33
IDK if it came with unicode 7.0, but there is clarification: # Note that currently the only instances of multiple aliases of the same # type for a single code point are either of type "control" or "abbreviation". # An alias of type "abbreviation" can, in principle, be added for any code # point, although currently aliases of type "correction" do not have # any additional aliases of type "abbreviation". Such relationships # are not enforced by stability policies. it says “currently”, so it isn’t guaranteed to stay that way, and other types could also be specified multiple times in the future. so as much as i’d like it if we could follow Alexander’s proposal, i think we shouldn’t extend the function that way if it would either return a name string, a default value, a list of aliases, or raise an exception: too complex. i think we should create: unicodedata.aliases(chr, type=(None\|'correction'\|'control'\|'alternate'\|'figment'\|'abbreviation')) and make aliases(chr) return a dict with all aliases for the character, and make aliases(chr, type) return a list of aliases for that type (possibly empty) examples: aliases('\b') == {'control': ['BACKSPACE'], 'abbreviation': ['BS']} aliases('\b', 'control') == ['BACKSPACE'] aliases('b') == {} aliases('b', 'control') == [] --- alternative: when specifying a type, it’ll raise an error if no alias of this type exists. but because of the sparse nature of aliases i’m against that.

History
Date	User	Action	Args
2022-04-11 14:57:47	admin	set	github: 62434
2022-03-08 09:21:03	serhiy.storchaka	link	issue46947 superseder
2014-10-11 17:33:20	flying sheep	set	nosy: + flying sheep messages: + msg229100
2014-02-10 08:38:07	ezio.melotti	set	stage: needs patch messages: + msg210811 versions: + Python 3.5, - Python 3.4
2014-01-29 07:50:26	serhiy.storchaka	link	issue20433 superseder
2013-06-24 17:04:16	lemburg	set	messages: + msg191788
2013-06-24 16:20:57	loewis	set	messages: + msg191782
2013-06-24 16:10:08	belopolsky	set	messages: + msg191781
2013-06-24 15:07:49	lemburg	set	messages: + msg191777
2013-06-24 14:58:44	lemburg	set	messages: + msg191775
2013-06-24 14:58:11	belopolsky	set	messages: + msg191774
2013-06-24 14:35:14	belopolsky	set	messages: + msg191771
2013-06-24 09:24:57	lemburg	set	messages: + msg191751
2013-06-24 08:05:08	serhiy.storchaka	set	messages: + msg191748
2013-06-24 07:54:05	lemburg	set	messages: + msg191747
2013-06-23 21:49:08	belopolsky	set	messages: + msg191733
2013-06-23 20:43:45	belopolsky	set	messages: + msg191729
2013-06-23 19:41:26	belopolsky	set	messages: + msg191724
2013-06-23 19:02:23	serhiy.storchaka	set	messages: + msg191719
2013-06-23 18:18:14	belopolsky	set	messages: + msg191715
2013-06-20 20:32:51	belopolsky	set	messages: + msg191538
2013-06-20 12:57:12	loewis	set	messages: + msg191510
2013-06-17 11:49:47	pitrou	set	nosy: + lemburg, loewis, benjamin.peterson, ezio.melotti, serhiy.storchaka
2013-06-17 00:24:18	belopolsky	create