Message 352840 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Greg Price
Recipients	Greg Price, PanderMusubi, benjamin.peterson, ezio.melotti, loewis, terry.reedy
Date	2019-09-20.07:56:30
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1568966191.51.0.993224144287.issue16684@roundup.psfhosted.org>
In-reply-to

Content
I've gone and implemented a version of this that's integrated into Tools/unicode/makeunicodedata.py , and into the unicodedata module. Patch attached. Demo: >>> import unicodedata, pprint >>> pprint.pprint(unicodedata.property_value_aliases) {'bidirectional': {'AL': ['Arabic_Letter'], # ... 'WS': ['White_Space']}, 'category': {'C': ['Other'], # ... 'east_asian_width': {'A': ['Ambiguous'], # ... 'W': ['Wide']}} Note that the values are lists. That's because a value can have multiple aliases in addition to its "short name": >>> unicodedata.property_value_aliases['category'][unicodedata.category('4')] ['Decimal_Number', 'digit'] This implementation also provides the reverse mapping, from an alias to the "short name": >>> pprint.pprint(unicodedata.property_value_by_alias) {'bidirectional': {'Arabic_Letter': 'AL', # ... This draft doesn't have tests or docs, but it's otherwise complete. I've posted it at this stage for feedback on a few open questions: * This version is in C; at import time some C code builds up the dicts, from static tables in the header generated by makeunicodedata.py . It's not that much code... but it sure would be more convenient to do in Python instead. Should the unicodedata module perhaps have a Python part? I'd be happy to go about that -- rename the existing C module to _unicodedata and add a small unicodedata.py wrapper -- if there's a feeling that it'd be a good idea. Then this could go there instead of using the C code I've just written. * Is this API the right one? * This version has e.g. unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' . * Perhaps make category/bidirectional/east_asian_width into attributes rather than keys? So e.g. unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' . * Or: the standard says "loose matching" should be applied to these names, so e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. To accomplish that, perhaps make it not dicts at all but functions? So e.g. unicodedata.property_value_by_alias('decimal number') == unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' . * There's also room for bikeshedding on the names. * How shall we handle ucd_3_2_0 for this feature? This implementation doesn't attempt to record the older version of the data. My reasoning is that because the applications of the old data are quite specific and they haven't needed this information yet, it seems unlikely anyone will ever really want to know from this module just which aliases existed already in 3.2.0 and which didn't yet. OTOH, as a convenience I've caused e.g. unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the same object as unicodedata.property_value_by_alias . This allows unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata module itself, while minimizing the complexity it adds to the implementation. Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's still easy to get at them -- just get them from the module itself -- and it makes it explicit that you're getting current rather than old data.

I've gone and implemented a version of this that's integrated into Tools/unicode/makeunicodedata.py , and into the unicodedata module.  Patch attached.  Demo:

>>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
                   'WS': ['White_Space']},
 'category': {'C': ['Other'],
# ...
 'east_asian_width': {'A': ['Ambiguous'],
# ...
                      'W': ['Wide']}}


Note that the values are lists.  That's because a value can have multiple aliases in addition to its "short name":

>>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit']


This implementation also provides the reverse mapping, from an alias to the "short name":

>>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ...


This draft doesn't have tests or docs, but it's otherwise complete. I've posted it at this stage for feedback on a few open questions:

* This version is in C; at import time some C code builds up the dicts, from static tables in the header generated by makeunicodedata.py .  It's not *that* much code... but it sure would be more convenient to do in Python instead.

  Should the unicodedata module perhaps have a Python part?  I'd be happy to go about that -- rename the existing C module to _unicodedata and add a small unicodedata.py wrapper -- if there's a feeling that it'd be a good idea.  Then this could go there instead of using the C code I've just written.


* Is this API the right one?
  * This version has e.g. unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .

  * Perhaps make category/bidirectional/east_asian_width into attributes rather than keys? So e.g. unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .

  * Or: the standard says "loose matching" should be applied to these names, so e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. To accomplish that, perhaps make it not dicts at all but functions?

    So e.g. unicodedata.property_value_by_alias('decimal number') == unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .

  * There's also room for bikeshedding on the names.


* How shall we handle ucd_3_2_0 for this feature?

  This implementation doesn't attempt to record the older version of the data.  My reasoning is that because the applications of the old data are quite specific and they haven't needed this information yet, it seems unlikely anyone will ever really want to know from this module just which aliases existed already in 3.2.0 and which didn't yet.

  OTOH, as a convenience I've caused e.g. unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the same object as unicodedata.property_value_by_alias . This allows unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata module itself, while minimizing the complexity it adds to the implementation.

  Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's still easy to get at them -- just get them from the module itself -- and it makes it explicit that you're getting current rather than old data.

History
Date	User	Action	Args
2019-09-20 07:56:31	Greg Price	set	recipients: + Greg Price, loewis, terry.reedy, benjamin.peterson, ezio.melotti, PanderMusubi
2019-09-20 07:56:31	Greg Price	set	messageid: <1568966191.51.0.993224144287.issue16684@roundup.psfhosted.org>
2019-09-20 07:56:31	Greg Price	link	issue16684 messages
2019-09-20 07:56:31	Greg Price	create