classification
Title: Unicodedata module should provide access to codepoint aliases
Type: enhancement Stage: needs patch
Components: Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, benjamin.peterson, ezio.melotti, flying sheep, lemburg, loewis, serhiy.storchaka
Priority: normal Keywords:

Created on 2013-06-17 00:24 by belopolsky, last changed 2014-10-11 17:33 by flying sheep.

Messages (20)
msg191300 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-17 00:24
Python is aware of unicode codepoint aliases, but unicodedata does not provide a way to find aliases of a given codepoint:

>>> ucd.lookup('ESCAPE') == '\N{ESCAPE}'
True
>>> ucd.lookup('RS') == '\N{RS}'
True

but

>>> ucd.name('\N{ESCAPE}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name


>>> ucd.name('\N{RS}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
msg191510 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-06-20 12:57
I think the best way would be to provide a function unicodedata.aliases, returning a list of names for a given character or sequence.
msg191538 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-20 20:32
UCD provides more than just a list of aliases: formal name aliases have "type" - control, abbreviation, etc.  See <http://www.unicode.org/Public/UNIDATA/NameAliases.txt>.
msg191715 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-23 18:18
Rather than adding a new method to unicodedata, what do you think about adding a type keyword argument to unicodedata.name()?  It can default to "canonical" and have possible values "control", "abbreviation", etc.

See also #12753.
msg191719 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-23 19:02
Can a character or sequence have multiple aliases? What will be a result type of unicodedata.name() with "abbreviation" keyword value?
msg191724 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-23 19:41
> Can a character or sequence have multiple aliases?

Yes, for example, most control characters have two aliases (and no name).

0000;NULL;control
0000;NUL;abbreviation
0001;START OF HEADING;control
0001;SOH;abbreviation
0002;START OF TEXT;control
0002;STX;abbreviation

(See <http://www.unicode.org/Public/UNIDATA/NameAliases.txt>)

> What will be a result type of unicodedata.name() with "abbreviation" keyword value?

Under my proposal:

>>> unicodedata.name('\N{ESCAPE}', type='abbreviation')
'ESC'

I would also like to consider changing the default slightly.  I find the following behavior rather unhelpful:

>>> unicodedata.name('\N{ESC}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

I think most users would expect 'ESCAPE' instead.

The following is more of a curiosity rather than a genuine problem, but is a good illustration for a general point:

>>> unicodedata.name('\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET}')
'PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET'

(Note misspelled word "BRACKET" in the output.)

Since "correction" alias is the official method of publishing corrections to unicode names, I think unicodedata.name() should return correct name by default.
msg191729 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-23 20:43
unicodedata.name() was discussed in #12353 (msg144739) where MvL argued that misspelled names are better than corrected because they are more likely to appear misspelled in other sources.  I am not sure I buy this argument.  Someone googling for 'BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS' will probably just enter BYZANTINE VASIS and find what he or she needs.  A more likely scenario is someone trying to get all FTHORA symbols using a naive code like this: [hex(i) for i in range(1114112) if 'FTHORA' in ud.name(chr(i), '')].

Even more likely scenario is someone seeing a fancy symbol on the web and wanting to use it in a python program.  Such programmer would copy the symbol to python prompt, call unicode.name() and copy the result in the program.  Do we want to encourage people to perpetuate the mistake that Unicode has corrected?

I don't think the issue of control codes names was discussed in #12353.  I see no downside with returning the first alias in case no name is present.
msg191733 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-23 21:49
I mistyped issue reference above it should be #12753, not 12353.
msg191747 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-24 07:54
On 23.06.2013 22:43, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> unicodedata.name() was discussed in #12353 (msg144739) where MvL argued that misspelled names are better than corrected because they are more likely to appear misspelled in other sources.  I am not sure I buy this argument.  Someone googling for 'BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS' will probably just enter BYZANTINE VASIS and find what he or she needs.  A more likely scenario is someone trying to get all FTHORA symbols using a naive code like this: [hex(i) for i in range(1114112) if 'FTHORA' in ud.name(chr(i), '')].
> 
> Even more likely scenario is someone seeing a fancy symbol on the web and wanting to use it in a python program.  Such programmer would copy the symbol to python prompt, call unicode.name() and copy the result in the program.  Do we want to encourage people to perpetuate the mistake that Unicode has corrected?
> 
> I don't think the issue of control codes names was discussed in #12353.  I see no downside with returning the first alias in case no name is present.

We should stick to the rules. Please leave the function as it
is, i.e. a 1-1 mapping to the official, non-changing Unicode
name reference (including spelling errors, etc). Same with
code points that have no name.

If you want to expose the aliases, you can do so in a new
function, say .aliases() which then returns the list of
aliases of a character (including the original name,
if available).

If we change the return values of .name() to whatever we think
would be more usable, we'd be modifying how Python programmers
see the Unicode database. That's not the purpose of the module.
msg191748 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-06-24 08:05
Perhaps unicodedata.aliases() should return not a list, but an ordered dict.

What name should use the "namereplace" error handler? Original or corrected? Should it use first alias if there is no original name?
msg191751 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-24 09:24
On 24.06.2013 10:05, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka added the comment:
> 
> Perhaps unicodedata.aliases() should return not a list, but an ordered dict.
> 
> What name should use the "namereplace" error handler? Original or corrected? Should it use first alias if there is no original name?

For compatibility with other tools, it should use .name(), not .aliases()
to determine the name. Please note that the aliases are not the official
Unicode names of the code points.
msg191771 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-24 14:35
MAL> Please leave the function as it is, i.e. a 1-1 mapping to the
MAL> official, non-changing Unicode name reference (including
MAL> spelling errors, etc). Same with code points that have no name.

Since we have code points with no name - it is not 1-1 mapping but 1 to 0 or 1.

Unicode Standard recommends using "Code Point Labels" "To provide unique, meaningful labels for code points that do not have character names." (Section 4.9.)

These labels are not very useful:

Control: control-NNNN
Reserved: reserved-NNNN
Noncharacter: noncharacter-NNNN
Private-Use: private-use-NNNN
Surrogate: surrogate-NNNN

According to the description in NameAliases.txt:

# The formal name aliases are part of the Unicode character namespace, which
# includes the character names and the names of named character sequences.

I believe this means that formal name aliases are as official as the character names.

If we don't change the default, what is the downside in adding an optional type argument to unicodedata.name()?  After all, according to the standard, aliases *are* names, just a different *type* of names.
msg191774 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-24 14:58
Here is an example of "prior art" that is relevant to this discussion:

"""
charnames::viacode(code)
..
As mentioned above under ALIASES, Unicode 6.1 defines extra names (synonyms or aliases) for some code points, most of which were already available as Perl extensions. All these are accepted by \N{...} and the other functions in this module, but viacode has to choose which one name to return for a given input code point, so it returns the "best" name. To understand how this works, it is helpful to know more about the Unicode name properties. All code points actually have only a single name, which (starting in Unicode 2.0) can never change once a character has been assigned to the code point. But mistakes have been made in assigning names, for example sometimes a clerical error was made during the publishing of the Standard which caused words to be misspelled, and there was no way to correct those. The Name_Alias property was eventually created to handle these situations. If a name was wrong, a corrected synonym would be published for it, using Name_Alias. viacode will return that corrected synonym as the "best" name for a code point. (It is even possible, though it hasn't happened yet, that the correction itself will need to be corrected, and so another Name_Alias can be created for that code point; viacode will return the most recent correction.)

The Unicode name for each of the control characters (such as LINE FEED) is the empty string. However almost all had names assigned by other standards, such as the ASCII Standard, or were in common use. viacode returns these names as the "best" ones available. Unicode 6.1 has created Name_Aliases for each of them, including alternate names, like NEW LINE. viacode uses the original name, "LINE FEED" in preference to the alternate. Similarly the name returned for U+FEFF is "ZERO WIDTH NO-BREAK SPACE", not "BYTE ORDER MARK".
""" <http://perldoc.perl.org/charnames.html#charnames%3a%3aviacode(code)>

If .name() cannot be touched, what about implementing .bestname() with the above semantics?
msg191775 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-24 14:58
On 24.06.2013 16:35, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> MAL> Please leave the function as it is, i.e. a 1-1 mapping to the
> MAL> official, non-changing Unicode name reference (including
> MAL> spelling errors, etc). Same with code points that have no name.
> 
> Since we have code points with no name - it is not 1-1 mapping but 1 to 0 or 1.

True, it's not 1-1 in the mathematical sense (bijective), only surjective.
However, it is 1-1 for all code points which have a name assigned.

> Unicode Standard recommends using "Code Point Labels" "To provide unique, meaningful labels for code points that do not have character names." (Section 4.9.)
> 
> These labels are not very useful:
> 
> Control: control-NNNN
> Reserved: reserved-NNNN
> Noncharacter: noncharacter-NNNN
> Private-Use: private-use-NNNN
> Surrogate: surrogate-NNNN

I don't any advantage of using these over plain \uXXXX codes.

> According to the description in NameAliases.txt:
> 
> # The formal name aliases are part of the Unicode character namespace, which
> # includes the character names and the names of named character sequences.
> 
> I believe this means that formal name aliases are as official as the character names.

Yes, but they are official aliases, not official code point names :-)

> If we don't change the default, what is the downside in adding an optional type argument to unicodedata.name()?  After all, according to the standard, aliases *are* names, just a different *type* of names.

The .aliases() function would have to return a list, not a single
name, so a parameter would cause the return type to change, which
is not a good idea.

A new function also makes the origin of these names clear to the
user.
msg191777 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-24 15:07
On 24.06.2013 16:58, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> Here is an example of "prior art" that is relevant to this discussion:
> 
> """
> charnames::viacode(code)
> ..
> As mentioned above under ALIASES, Unicode 6.1 defines extra names (synonyms or aliases) for some code points, most of which were already available as Perl extensions. All these are accepted by \N{...} and the other functions in this module, but viacode has to choose which one name to return for a given input code point, so it returns the "best" name. To understand how this works, it is helpful to know more about the Unicode name properties. All code points actually have only a single name, which (starting in Unicode 2.0) can never change once a character has been assigned to the code point. But mistakes have been made in assigning names, for example sometimes a clerical error was made during the publishing of the Standard which caused words to be misspelled, and there was no way to correct those. The Name_Alias property was eventually created to handle these situations. If a name was wrong, a corrected synonym would be published for it, using Name_Alias. viacode will return
  t
>  hat corr
>  ected synonym as the "best" name for a code point. (It is even possible, though it hasn't happened yet, that the correction itself will need to be corrected, and so another Name_Alias can be created for that code point; viacode will return the most recent correction.)
> 
> The Unicode name for each of the control characters (such as LINE FEED) is the empty string. However almost all had names assigned by other standards, such as the ASCII Standard, or were in common use. viacode returns these names as the "best" ones available. Unicode 6.1 has created Name_Aliases for each of them, including alternate names, like NEW LINE. viacode uses the original name, "LINE FEED" in preference to the alternate. Similarly the name returned for U+FEFF is "ZERO WIDTH NO-BREAK SPACE", not "BYTE ORDER MARK".
> """ <http://perldoc.perl.org/charnames.html#charnames%3a%3aviacode(code)>
> 
> If .name() cannot be touched, what about implementing .bestname() with the above semantics?

I think it's better to let the programmer decide what the "best"
name should be, e.g. some people will like ESC better than ESCAPE or
\u001b or \x1b.

unicodedata only provides neutral access to what's in the Unicode database.
It doesn't make any decisions on what's good or bad ;-)
msg191781 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-24 16:10
> The .aliases() function would have to return a list, not a single
> name, so a parameter would cause the return type to change, which
> is not a good idea.

You misunderstood my proposal.  .name() will still return a single name, but the type parameter will control which name to return:

name(ch[, type=(None|'correction'|'control'|'alternate'|'figment'|'abbreviation')])

None - default, same as current behavior.

correction - indicates that the returned name is a corrected form for the original name (which remains valid) for the same code point.

control - return a new name added for a control character.

alternate - return an alternate name for a character

figment - return a name for a character that has been documented but was never in any actual standard.

abbreviation - return a common abbreviation for a character
msg191782 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-06-24 16:20
But some of these types could still have lists as values, no?
msg191788 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-06-24 17:04
On 24.06.2013 18:10, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
>> The .aliases() function would have to return a list, not a single
>> name, so a parameter would cause the return type to change, which
>> is not a good idea.
> 
> You misunderstood my proposal.  .name() will still return a single name, but the type parameter will control which name to return:
> 
> name(ch[, type=(None|'correction'|'control'|'alternate'|'figment'|'abbreviation')])
> 
> None - default, same as current behavior.
> 
> correction - indicates that the returned name is a corrected form for the original name (which remains valid) for the same code point.
> 
> control - return a new name added for a control character.
> 
> alternate - return an alternate name for a character
> 
> figment - return a name for a character that has been documented but was never in any actual standard.
> 
> abbreviation - return a common abbreviation for a character

How can you be sure that each of those alias types occurs only
once ?

The NameAliases.txt doesn't say anything about this, AFAIK:

http://www.unicode.org/Public/UNIDATA/NameAliases.txt

Also, what would name() return in case to alias of a particular
type is defined ?

I think it would be easier and more future proof to have a function
aliases(code) -> [(type, alias),...] which simply returns all
defined aliases. Applications could then add helpers for
select the type they would like to use.

It may make sense to also add the name(code) value as
e.g. ('standard', name(code)) to that list.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 24 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2013-06-18: Released mxODBC Django DE 1.2.0 ...   http://egenix.com/go47
2013-07-01: EuroPython 2013, Florence, Italy ...            7 days to go
2013-07-16: Python Meeting Duesseldorf ...                 22 days to go

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg210811 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-02-10 08:38
See also #20433.
msg229100 - (view) Author: (flying sheep) * Date: 2014-10-11 17:33
IDK if it came with unicode 7.0, but there is clarification:

# Note that currently the only instances of multiple aliases of the same
# type for a single code point are either of type "control" or "abbreviation".
# An alias of type "abbreviation" can, in principle, be added for any code
# point, although currently aliases of type "correction" do not have
# any additional aliases of type "abbreviation". Such relationships
# are not enforced by stability policies.

it says “currently”, so it isn’t guaranteed to stay that way, and other types could also be specified multiple times in the future.

so as much as i’d like it if we could follow Alexander’s proposal, i think we shouldn’t extend the function that way if it would either return a name string, a default value, a list of aliases, or raise an exception: too complex.

i think we should create:

unicodedata.aliases(chr, type=(None|'correction'|'control'|'alternate'|'figment'|'abbreviation'))

and make

aliases(chr) return a dict with all aliases for the character, and make
aliases(chr, type) return a list of aliases for that type (possibly empty)

examples:

aliases('\b') == {'control': ['BACKSPACE'], 'abbreviation': ['BS']}
aliases('\b', 'control') == ['BACKSPACE']
aliases('b') == {}
aliases('b', 'control') == []

---

alternative: when specifying a type, it’ll raise an error if no alias of this type exists. but because of the sparse nature of aliases i’m against that.
History
Date User Action Args
2014-10-11 17:33:20flying sheepsetnosy: + flying sheep
messages: + msg229100
2014-02-10 08:38:07ezio.melottisetstage: needs patch
messages: + msg210811
versions: + Python 3.5, - Python 3.4
2014-01-29 07:50:26serhiy.storchakalinkissue20433 superseder
2013-06-24 17:04:16lemburgsetmessages: + msg191788
2013-06-24 16:20:57loewissetmessages: + msg191782
2013-06-24 16:10:08belopolskysetmessages: + msg191781
2013-06-24 15:07:49lemburgsetmessages: + msg191777
2013-06-24 14:58:44lemburgsetmessages: + msg191775
2013-06-24 14:58:11belopolskysetmessages: + msg191774
2013-06-24 14:35:14belopolskysetmessages: + msg191771
2013-06-24 09:24:57lemburgsetmessages: + msg191751
2013-06-24 08:05:08serhiy.storchakasetmessages: + msg191748
2013-06-24 07:54:05lemburgsetmessages: + msg191747
2013-06-23 21:49:08belopolskysetmessages: + msg191733
2013-06-23 20:43:45belopolskysetmessages: + msg191729
2013-06-23 19:41:26belopolskysetmessages: + msg191724
2013-06-23 19:02:23serhiy.storchakasetmessages: + msg191719
2013-06-23 18:18:14belopolskysetmessages: + msg191715
2013-06-20 20:32:51belopolskysetmessages: + msg191538
2013-06-20 12:57:12loewissetmessages: + msg191510
2013-06-17 11:49:47pitrousetnosy: + lemburg, loewis, benjamin.peterson, ezio.melotti, serhiy.storchaka
2013-06-17 00:24:18belopolskycreate