Message 209618 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jamadagni
Recipients	ezio.melotti, jamadagni, vstinner
Date	2014-01-29.07:42:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1390981355.06.0.989685125573.issue20433@psf.upfronthosting.co.za>
In-reply-to

Content
Currently we have unicodedata.name() which returns the formal character name of the character chr as per the second column in UnicodeData.txt from http://www.unicode.org/Public/UNIDATA/. However, there are a few characters where the formal character name has spelling mistakes. Also, the control characters in the Basic Latin and Latin-1 blocks aren't really given meaningful character names. In one case, that of FEFF, the formal name ZERO WIDTH NO-BREAK SPACE refers to a deprecated usage of the character (and the alternate name BYTE ORDER MARK refers to the recommended usage). In all these cases, improved names are provided as stable aliases in NameAliases.txt from the same UNIDATA source. These are also part of the stable standard and are intended to alleviate the naming situation w.r.t. the above issues. For the stability, see: http://www.unicode.org/policies/stability_policy.html#Formal_Name_Alias Hence it would be most useful if the unicodedata module would add an aliasedname() method with the same signature as name() to provide the official aliased name in the case of characters with aliases, and when a character does not have an alias, to provide the same output as name(). As of Py 3.3, unicodedata.lookup() already uses/supports NameAliases.txt for returning the character given the name. The present requirement is to use it for returning the name given the character. Note that NameAliases.txt has abbreviated names for some characters (where the third column reads "abbreviation"). While these would be useful for lookup(), they would not be useful to be returned for aliasedname(). For instance, one would prefer to see "SPACE" returned for 0020 rather than "SP". So these entries should be disregarded for aliasedname(). Also, NameAliases.txt has multiple entries for some characters even after discarding the abbreviation entries. In these cases, the first entry should be used (for want of a better rule). It is presumed that these are provided in some order of preference. It should be noted that discussion on this topic on the "unicore" (Unicode members) mailing list (on the thread "When normative aliases exist..." started 2014-01-21) indicates that the order of entries is subject to change although the entries themselves will not be removed. In this case, the first non-abbreviation entry may change. This is acceptable for the behaviour of aliasedname(). Also note that aliases may be defined in future. Thus the string returned by aliasedname() for a given character is not guaranteed to be the same, but whatever is returned by it will surely be valid to use with lookup(). Those who desire a single immutable name and do not require the improvements provided by the aliases should use name() and not aliasedname(). Finally, for extended support, a namealiases() function should return all the aliases together with their types, allowing the user full choice of the desired but official alias. The attached code should clarify the required behaviour. (It is not a patch, just an illustration.)

Currently we have unicodedata.name() which returns the formal character name of the character chr as per the second column in UnicodeData.txt from http://www.unicode.org/Public/UNIDATA/.

However, there are a few characters where the formal character name has spelling mistakes. Also, the control characters in the Basic Latin and Latin-1 blocks aren't really given meaningful character names. In one case, that of FEFF, the formal name ZERO WIDTH NO-BREAK SPACE refers to a deprecated usage of the character (and the alternate name BYTE ORDER MARK refers to the recommended usage).

In all these cases, improved names are provided as stable aliases in NameAliases.txt from the same UNIDATA source. These are also part of the stable standard and are intended to alleviate the naming situation w.r.t. the above issues. For the stability, see: http://www.unicode.org/policies/stability_policy.html#Formal_Name_Alias

Hence it would be most useful if the unicodedata module would add an aliasedname() method with the same signature as name() to provide the official aliased name in the case of characters with aliases, and when a character does not have an alias, to provide the same output as name().

As of Py 3.3, unicodedata.lookup() already uses/supports NameAliases.txt for returning the character given the name. The present requirement is to use it for returning the name given the character.

Note that NameAliases.txt has abbreviated names for some characters (where the third column reads "abbreviation"). While these would be useful for lookup(), they would not be useful to be returned for aliasedname(). For instance, one would prefer to see "SPACE" returned for 0020 rather than "SP". So these entries should be disregarded for aliasedname().

Also, NameAliases.txt has multiple entries for some characters even after discarding the abbreviation entries. In these cases, the first entry should be used (for want of a better rule). It is presumed that these are provided in some order of preference.

It should be noted that discussion on this topic on the "unicore" (Unicode members) mailing list (on the thread "When normative aliases exist..." started 2014-01-21) indicates that the order of entries is subject to change although the entries themselves will not be removed. In this case, the first non-abbreviation entry may change. This is acceptable for the behaviour of aliasedname(). Also note that aliases may be defined in future. Thus the string returned by aliasedname() for a given character is not guaranteed to be the same, but whatever is returned by it will surely be valid to use with lookup(). Those who desire a single immutable name and do not require the improvements provided by the aliases should use name() and not aliasedname().

Finally, for extended support, a namealiases() function should return all the aliases together with their types, allowing the user full choice of the desired but official alias.

The attached code should clarify the required behaviour. (It is not a patch, just an illustration.)

History
Date	User	Action	Args
2014-01-29 07:42:35	jamadagni	set	recipients: + jamadagni, vstinner, ezio.melotti
2014-01-29 07:42:35	jamadagni	set	messageid: <1390981355.06.0.989685125573.issue20433@psf.upfronthosting.co.za>
2014-01-29 07:42:34	jamadagni	link	issue20433 messages
2014-01-29 07:42:34	jamadagni	create