Issue 46410: TypeError when parsing regexp with unicode named character sequence escape

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/90568

classification

Title:	TypeError when parsing regexp with unicode named character sequence escape
Type:	enhancement	Stage:
Components:	Interpreter Core, Regular Expressions, Unicode	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, jirkamarsik, mrabarnett, serhiy.storchaka, vstinner
Priority:	normal	Keywords:

Created on 2022-01-17 12:31 by jirkamarsik, last changed 2022-04-11 14:59 by admin.

Messages (3)
msg410770 - (view)	Author: Jirka Marsik (jirkamarsik)	Date: 2022-01-17 12:31
re.compile(r"\N{name of Unicode Named Character Sequence}"), e.g. re.compile(r"\N{KEYCAP NUMBER SIGN}"), throws a TypeError. The regular expression parser relies on 'unicodedata' to lookup character names. The 'unicodedata' module recently added support for Unicode Named Character Sequences (https://www.unicode.org/Public/13.0.0/ucd/NamedSequences.txt). Trying to use these named character sequences in a regular expression leads to a 'TypeError', as the regexp parser tries to call 'ord' on a string with length > 1.
msg410874 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2022-01-18 15:52
They're not supported in string literals either: Python 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> "\N{KEYCAP NUMBER SIGN}" File "<stdin>", line 1 "\N{KEYCAP NUMBER SIGN}" ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-21: unknown Unicode character name
msg415540 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2022-03-19 11:22
>>> import unicodedata >>> unicodedata.lookup('KEYCAP NUMBER SIGN') '#️' >>> print(ascii(unicodedata.lookup('KEYCAP NUMBER SIGN'))) '#\ufe0f\u20e3' Support of Unicode Named Character Sequences in the unicodeescape codec and in the RE parser would be a new feature.

History
Date	User	Action	Args
2022-04-11 14:59:54	admin	set	github: 90568
2022-03-19 11:22:19	serhiy.storchaka	set	versions: + Python 3.11, - Python 3.10 nosy: + vstinner, serhiy.storchaka messages: + msg415540 components: + Interpreter Core, Unicode type: behavior -> enhancement
2022-01-18 15:52:03	mrabarnett	set	messages: + msg410874
2022-01-17 12:31:30	jirkamarsik	create