Message 332318 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	ezio.melotti, rominf, steven.daprano, vstinner
Date	2018-12-22.01:01:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1545440487.14.0.98272194251.issue35549@roundup.psfhosted.org>
In-reply-to

Content
Here's my implementation: from unicodedata import name from unicodedata import lookup as _lookup from fnmatch import translate from re import compile, I _NAMES = None def getnames(): global _NAMES if _NAMES is None: _NAMES = [] for i in range(0x110000): s = name(chr(i), '') if s: _NAMES.append(s) return _NAMES def lookup(name_or_glob): if any(c in name_or_glob for c in '*?['): match = compile(translate(name_or_glob), flags=I).match return [name for name in getnames() if match(name)] else: return _lookup(name_or_glob) The major limitation of my implementation is that it doesn't match name aliases or sequences. http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt For example: lookup('TAMIL SYLLABLE TAA?') # NamedSequence ought to return ['தா'] but doesn't. Parts of the Unicode documentation uses the convention that canonical names are in UPPERCASE, aliases are lowercase, and sequences are in Mixed Case. and I think that we should follow that convention: http://www.unicode.org/charts/aboutcharindex.html That makes it easy to see what is the canonical name and what isn't.

Here's my implementation:

from unicodedata import name
from unicodedata import lookup as _lookup
from fnmatch import translate
from re import compile, I

_NAMES = None

def getnames():
    global _NAMES
    if _NAMES is None:
        _NAMES = []
        for i in range(0x110000):
            s = name(chr(i), '')
            if s:
                _NAMES.append(s)
    return _NAMES

def lookup(name_or_glob):
    if any(c in name_or_glob for c in '*?['):
        match = compile(translate(name_or_glob), flags=I).match
        return [name for name in getnames() if match(name)]
    else:
        return _lookup(name_or_glob)




The major limitation of my implementation is that it doesn't match name aliases or sequences.

http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt
http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt

For example:

lookup('TAMIL SYLLABLE TAA?')  # NamedSequence

ought to return ['தா'] but doesn't.

Parts of the Unicode documentation uses the convention that canonical names are in UPPERCASE, aliases are lowercase, and sequences are in Mixed Case. and I think that we should follow that convention:

http://www.unicode.org/charts/aboutcharindex.html

That makes it easy to see what is the canonical name and what isn't.

History
Date	User	Action	Args
2018-12-22 01:01:29	steven.daprano	set	recipients: + steven.daprano, vstinner, ezio.melotti, rominf
2018-12-22 01:01:27	steven.daprano	set	messageid: <1545440487.14.0.98272194251.issue35549@roundup.psfhosted.org>
2018-12-22 01:01:27	steven.daprano	link	issue35549 messages
2018-12-22 01:01:27	steven.daprano	create