This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author steven.daprano
Recipients ezio.melotti, rominf, steven.daprano, vstinner
Date 2018-12-22.01:01:27
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1545440487.14.0.98272194251.issue35549@roundup.psfhosted.org>
In-reply-to
Content
Here's my implementation:

from unicodedata import name
from unicodedata import lookup as _lookup
from fnmatch import translate
from re import compile, I

_NAMES = None

def getnames():
    global _NAMES
    if _NAMES is None:
        _NAMES = []
        for i in range(0x110000):
            s = name(chr(i), '')
            if s:
                _NAMES.append(s)
    return _NAMES

def lookup(name_or_glob):
    if any(c in name_or_glob for c in '*?['):
        match = compile(translate(name_or_glob), flags=I).match
        return [name for name in getnames() if match(name)]
    else:
        return _lookup(name_or_glob)




The major limitation of my implementation is that it doesn't match name aliases or sequences.

http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt
http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt

For example:

lookup('TAMIL SYLLABLE TAA?')  # NamedSequence

ought to return ['தா'] but doesn't.

Parts of the Unicode documentation uses the convention that canonical names are in UPPERCASE, aliases are lowercase, and sequences are in Mixed Case. and I think that we should follow that convention:

http://www.unicode.org/charts/aboutcharindex.html

That makes it easy to see what is the canonical name and what isn't.
History
Date User Action Args
2018-12-22 01:01:29steven.dapranosetrecipients: + steven.daprano, vstinner, ezio.melotti, rominf
2018-12-22 01:01:27steven.dapranosetmessageid: <1545440487.14.0.98272194251.issue35549@roundup.psfhosted.org>
2018-12-22 01:01:27steven.dapranolinkissue35549 messages
2018-12-22 01:01:27steven.dapranocreate