Issue 35549: Add globbing to unicodedata.lookup

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79730

classification

Title:	Add globbing to unicodedata.lookup
Type:	enhancement	Stage:
Components:	Unicode	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, rominf, steven.daprano, vstinner
Priority:	normal	Keywords:

Created on 2018-12-21 09:47 by rominf, last changed 2022-04-11 14:59 by admin.

Messages (4)
msg332283 - (view)	Author: Roman Inflianskas (rominf)	Date: 2018-12-21 09:47
I propose to add partial_match: bool = False argument to unicodedata.lookup so that the programmer could search Unicode symbols using partial_names.
msg332317 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2018-12-22 00:37
I love the idea, but dislike the proposed interface. As a general rule of thumb, Guido dislikes "constant bool parameters", where you pass a literal True or False to a parameter to a function to change its behaviour. Obviously this is not a hard rule, there are functions in the stdlib that do this, but like Guido I think we should avoid them in general. Instead, I think we should allow the name to include globbing symbols * ? etc. (I think full blown re syntax is overkill.) I have an implementation which I use: lookup(name) -> single character # the current behaviour lookup(name_with_glob_symbols) -> list of characters For example lookup('latin * Z') returns: ['LATIN CAPITAL LETTER Z', 'LATIN SMALL LETTER Z', 'LATIN CAPITAL LETTER D WITH SMALL LETTER Z', 'LATIN LETTER SMALL CAPITAL Z', 'LATIN CAPITAL LETTER VISIGOTHIC Z', 'LATIN SMALL LETTER VISIGOTHIC Z'] A straight substring match takes at worst twelve extra characters: lookup('' + name + '') and only two if the name is a literal: lookup('spam') This is less than `partial_match=True` (18 characters) and more flexible and powerful. There's no ambiguity between the two styles of call because the globbing symbols * ? and [] are never legal in Unicode names. See section 4.8 of http://www.unicode.org/versions/Unicode11.0.0/ch04.pdf
msg332318 - (view)	Author: Steven D'Aprano (steven.daprano) *	Date: 2018-12-22 01:01
Here's my implementation: from unicodedata import name from unicodedata import lookup as _lookup from fnmatch import translate from re import compile, I _NAMES = None def getnames(): global _NAMES if _NAMES is None: _NAMES = [] for i in range(0x110000): s = name(chr(i), '') if s: _NAMES.append(s) return _NAMES def lookup(name_or_glob): if any(c in name_or_glob for c in '*?['): match = compile(translate(name_or_glob), flags=I).match return [name for name in getnames() if match(name)] else: return _lookup(name_or_glob) The major limitation of my implementation is that it doesn't match name aliases or sequences. http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt For example: lookup('TAMIL SYLLABLE TAA?') # NamedSequence ought to return ['தா'] but doesn't. Parts of the Unicode documentation uses the convention that canonical names are in UPPERCASE, aliases are lowercase, and sequences are in Mixed Case. and I think that we should follow that convention: http://www.unicode.org/charts/aboutcharindex.html That makes it easy to see what is the canonical name and what isn't.
msg332325 - (view)	Author: Roman Inflianskas (rominf)	Date: 2018-12-22 06:14
I like your proposal with globbing, steven.daprano. I updated the title.

History
Date	User	Action	Args
2022-04-11 14:59:09	admin	set	github: 79730
2018-12-22 06:14:21	rominf	set	messages: + msg332325 title: Add partial_match: bool = False argument to unicodedata.lookup -> Add globbing to unicodedata.lookup
2018-12-22 01:01:27	steven.daprano	set	messages: + msg332318
2018-12-22 00:37:08	steven.daprano	set	nosy: + steven.daprano messages: + msg332317
2018-12-21 09:47:37	rominf	create