This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unicode property value abbreviated names and long names
Type: enhancement Stage: needs patch
Components: Unicode Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Greg Price, PanderMusubi, benjamin.peterson, ezio.melotti, loewis, terry.reedy
Priority: normal Keywords: patch

Created on 2012-12-14 17:33 by PanderMusubi, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
create-unicodedata-dicts-prop-value-alias-20121223.py PanderMusubi, 2012-12-23 13:34 Create dictionaries for unicodedata package contining property value aliases in terms of abbreviated names and long names.
bc_ea_gc.py terry.reedy, 2012-12-23 22:11 Refactored 3.3 version
prop-val-aliases.patch Greg Price, 2019-09-20 07:56
Messages (10)
msg177476 - (view) Author: Pander (PanderMusubi) Date: 2012-12-14 17:33
The package unicodedata
  http://docs.python.org/3/library/unicodedata.html
offers looking up of property values in terms of general category, bidirectional class and east asian width for Unicode characters
  unicodedata.category(unichr)
  unicodedata.bidirectional(unichr)
  unicodedata.east_asian_width(chr)

The abbreviated name of the specific category is returned. However, for certain applications it is important to be able to get the from abbreviated name to the long name and vice versa.

The data needed to do this can be found at
  http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
under sections
  # General_Category (gc)
  # Bidi_Class (bc)
  # East_Asian_Width (ea)
Use only the second (abbreviated name) and third (long name) fields and ignoring other fields and possible comments.

For general category, also support translation back and forth of the one-letter abbreviations which are groups representing two-letter general categories abbreviations with the same initial letter.

Please extend this package with a way of translating back and forth between abbreviated name and long name for property values defined in Unicode for general category, bidirectional class and East Asian width. This functionality should be independent of retrieving the abbreviated names for Unicode character as is available now and should be accessible via separate methods or dictionaries in which developers can perform lookups themselves.

Implementing the functionality requested in this issue allows Python developers to get from an abbreviated property value to a meaningful property value name and vice versa without having to retrieve this information from the Unicode Consortium and/or shipping this information with their code with the risk of using outdated information.
msg177479 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-12-14 17:54
> for certain applications it is important to be able to get the from 
> abbreviated name to the long name and vice versa.

What kind of application?  I have a module where I defined my own dict that maps categories with their full names, but I'm not sure this feature is common enough that should be included and maintained in the stdlib.

If it's added, a dict is probably enough, but a script to parse the file you mentioned and update this dict should also be included.
msg177510 - (view) Author: Pander (PanderMusubi) Date: 2012-12-14 21:20
I myself have a lot of Python applications that process font files and interact with fonttools and FontForge, which are both written in Python too. As you also have your own dict for this purpose and probably other people too, it would be justified to add these three small dicts in the standard lib. Especially since this package in the standard lib follows the definitions from Unicode Consortium.

When this is shipped in one package developers will always have an in sync translation from abbreviated names to long names and vice versa. Over the last years I needed to adjust my dicts regularly for the added definitions by Unicode Consortium which are supported by stdlib.

At the moment, translation from Unicode codes U+1234 to human-readable Unicode names and vice versa is offered at the moment. Providing human-readable names for the property values is a service of the same level and will be catering to approximately the same user group.

If you agree that these dicts can be added I am willing to provide a script that will parse the aforementioned file.
msg177909 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-12-21 23:32
This seems like a plausible request to me. The three dicts comprise 70 code-alias pairs. If unicodedata had a Python version (should it?), the simplest thing would be to add bididict, eawdist, and gcdict to that version (and not to the C version). I don't know how well putting dicts in C code works. A unicodealias module could be added but I do not really like that idea. I would prefer adding data attributes and correspond docs to the current module.

Pander: submitting a proof-of-concept script that accesses and parses that url and produces ready-to-go python code like below might encourage adoption of your proposal. In any case, it would be here for others to use.

bididict = {
    'AL': 'Arabic_Letter',
...
    'WS': 'White_Space',
}

eawdict = ...
msg177985 - (view) Author: Pander (PanderMusubi) Date: 2012-12-23 13:33
Attached is the requested proof-of-concept script.
msg178018 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-12-23 22:11
I verified that the prototype file works in 2.7.3. I rewrote it for 3.3 using a refactored approach (and discovered that the site sometimes times out).
msg178142 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-12-25 15:48
The script should probably be integrated in Tools/unicode/makeunicodedata.py.
msg285270 - (view) Author: Pander (PanderMusubi) Date: 2017-01-11 20:32
Any updates or ideas on how to move this forward? Meanwhile, should the issue relate to version 3.6? Thanks. Ah, see also https://bugs.python.org/issue6331 please
msg320089 - (view) Author: Pander (PanderMusubi) Date: 2018-06-20 16:09
Since June 2018, Unicode version 11.0 is out. Perhaps that could help move this forward.
msg352840 - (view) Author: Greg Price (Greg Price) * Date: 2019-09-20 07:56
I've gone and implemented a version of this that's integrated into Tools/unicode/makeunicodedata.py , and into the unicodedata module.  Patch attached.  Demo:

>>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
                   'WS': ['White_Space']},
 'category': {'C': ['Other'],
# ...
 'east_asian_width': {'A': ['Ambiguous'],
# ...
                      'W': ['Wide']}}


Note that the values are lists.  That's because a value can have multiple aliases in addition to its "short name":

>>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit']


This implementation also provides the reverse mapping, from an alias to the "short name":

>>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ...


This draft doesn't have tests or docs, but it's otherwise complete. I've posted it at this stage for feedback on a few open questions:

* This version is in C; at import time some C code builds up the dicts, from static tables in the header generated by makeunicodedata.py .  It's not *that* much code... but it sure would be more convenient to do in Python instead.

  Should the unicodedata module perhaps have a Python part?  I'd be happy to go about that -- rename the existing C module to _unicodedata and add a small unicodedata.py wrapper -- if there's a feeling that it'd be a good idea.  Then this could go there instead of using the C code I've just written.


* Is this API the right one?
  * This version has e.g. unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .

  * Perhaps make category/bidirectional/east_asian_width into attributes rather than keys? So e.g. unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .

  * Or: the standard says "loose matching" should be applied to these names, so e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. To accomplish that, perhaps make it not dicts at all but functions?

    So e.g. unicodedata.property_value_by_alias('decimal number') == unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .

  * There's also room for bikeshedding on the names.


* How shall we handle ucd_3_2_0 for this feature?

  This implementation doesn't attempt to record the older version of the data.  My reasoning is that because the applications of the old data are quite specific and they haven't needed this information yet, it seems unlikely anyone will ever really want to know from this module just which aliases existed already in 3.2.0 and which didn't yet.

  OTOH, as a convenience I've caused e.g. unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the same object as unicodedata.property_value_by_alias . This allows unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata module itself, while minimizing the complexity it adds to the implementation.

  Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's still easy to get at them -- just get them from the module itself -- and it makes it explicit that you're getting current rather than old data.
History
Date User Action Args
2022-04-11 14:57:39adminsetgithub: 60888
2019-09-20 07:56:31Greg Pricesetfiles: + prop-val-aliases.patch
versions: + Python 3.9, - Python 3.8
nosy: + Greg Price

messages: + msg352840

keywords: + patch
2018-06-20 16:11:43ned.deilysetnosy: + benjamin.peterson

versions: + Python 3.8, - Python 3.7
2018-06-20 16:09:09PanderMusubisetmessages: + msg320089
2017-01-11 20:38:15serhiy.storchakasetversions: + Python 3.7, - Python 3.4
2017-01-11 20:32:05PanderMusubisetmessages: + msg285270
2012-12-25 15:48:18ezio.melottisetmessages: + msg178142
2012-12-23 22:11:42terry.reedysetfiles: + bc_ea_gc.py

messages: + msg178018
2012-12-23 13:34:00PanderMusubisetfiles: + create-unicodedata-dicts-prop-value-alias-20121223.py

messages: + msg177985
2012-12-21 23:32:07terry.reedysetnosy: + loewis, terry.reedy
messages: + msg177909
2012-12-14 21:20:01PanderMusubisetmessages: + msg177510
2012-12-14 17:54:03ezio.melottisetstage: needs patch
messages: + msg177479
versions: + Python 3.4, - Python 3.5
2012-12-14 17:33:12PanderMusubicreate