classification
Title: Add globbing to unicodedata.lookup
Type: enhancement Stage:
Components: Unicode Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, rominf, steven.daprano, vstinner
Priority: normal Keywords:

Created on 2018-12-21 09:47 by rominf, last changed 2018-12-22 06:14 by rominf.

Messages (4)
msg332283 - (view) Author: Roman Inflianskas (rominf) Date: 2018-12-21 09:47
I propose to add partial_match: bool = False argument to unicodedata.lookup so that the programmer could search Unicode symbols using partial_names.
msg332317 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-22 00:37
I love the idea, but dislike the proposed interface.

As a general rule of thumb, Guido dislikes "constant bool parameters", where you pass a literal True or False to a parameter to a function to change its behaviour. Obviously this is not a hard rule, there are functions in the stdlib that do this, but like Guido I think we should avoid them in general.

Instead, I think we should allow the name to include globbing symbols * ? etc. (I think full blown re syntax is overkill.) I have an implementation which I use:

lookup(name) -> single character # the current behaviour

lookup(name_with_glob_symbols) -> list of characters

For example lookup('latin * Z') returns:

['LATIN CAPITAL LETTER Z', 'LATIN SMALL LETTER Z', 'LATIN CAPITAL LETTER D WITH SMALL LETTER Z', 'LATIN LETTER SMALL CAPITAL Z', 'LATIN CAPITAL LETTER VISIGOTHIC Z', 'LATIN SMALL LETTER VISIGOTHIC Z']


A straight substring match takes at worst twelve extra characters:

lookup('*' + name + '*')

and only two if the name is a literal:

lookup('*spam*')

This is less than `partial_match=True` (18 characters) and more flexible and powerful. There's no ambiguity between the two styles of call because the globbing symbols * ? and [] are never legal in Unicode names. See section 4.8 of

http://www.unicode.org/versions/Unicode11.0.0/ch04.pdf
msg332318 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-22 01:01
Here's my implementation:

from unicodedata import name
from unicodedata import lookup as _lookup
from fnmatch import translate
from re import compile, I

_NAMES = None

def getnames():
    global _NAMES
    if _NAMES is None:
        _NAMES = []
        for i in range(0x110000):
            s = name(chr(i), '')
            if s:
                _NAMES.append(s)
    return _NAMES

def lookup(name_or_glob):
    if any(c in name_or_glob for c in '*?['):
        match = compile(translate(name_or_glob), flags=I).match
        return [name for name in getnames() if match(name)]
    else:
        return _lookup(name_or_glob)




The major limitation of my implementation is that it doesn't match name aliases or sequences.

http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt
http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt

For example:

lookup('TAMIL SYLLABLE TAA?')  # NamedSequence

ought to return ['தா'] but doesn't.

Parts of the Unicode documentation uses the convention that canonical names are in UPPERCASE, aliases are lowercase, and sequences are in Mixed Case. and I think that we should follow that convention:

http://www.unicode.org/charts/aboutcharindex.html

That makes it easy to see what is the canonical name and what isn't.
msg332325 - (view) Author: Roman Inflianskas (rominf) Date: 2018-12-22 06:14
I like your proposal with globbing, steven.daprano.

I updated the title.
History
Date User Action Args
2018-12-22 06:14:21rominfsetmessages: + msg332325
title: Add partial_match: bool = False argument to unicodedata.lookup -> Add globbing to unicodedata.lookup
2018-12-22 01:01:27steven.dapranosetmessages: + msg332318
2018-12-22 00:37:08steven.dapranosetnosy: + steven.daprano
messages: + msg332317
2018-12-21 09:47:37rominfcreate