This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: TypeError when parsing regexp with unicode named character sequence escape
Type: enhancement Stage:
Components: Interpreter Core, Regular Expressions, Unicode Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, jirkamarsik, mrabarnett, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2022-01-17 12:31 by jirkamarsik, last changed 2022-04-11 14:59 by admin.

Messages (3)
msg410770 - (view) Author: Jirka Marsik (jirkamarsik) Date: 2022-01-17 12:31
re.compile(r"\N{name of Unicode Named Character Sequence}"), e.g. re.compile(r"\N{KEYCAP NUMBER SIGN}"), throws a TypeError. The regular expression parser relies on 'unicodedata' to lookup character names. The 'unicodedata' module recently added support for Unicode Named Character Sequences (https://www.unicode.org/Public/13.0.0/ucd/NamedSequences.txt). Trying to use these named character sequences in a regular expression leads to a 'TypeError', as the regexp parser tries to call 'ord' on a string with length > 1.
msg410874 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2022-01-18 15:52
They're not supported in string literals either:

Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> "\N{KEYCAP NUMBER SIGN}"
  File "<stdin>", line 1
    "\N{KEYCAP NUMBER SIGN}"
                            ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-21: unknown Unicode character name
msg415540 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-19 11:22
>>> import unicodedata
>>> unicodedata.lookup('KEYCAP NUMBER SIGN')
'#️'
>>> print(ascii(unicodedata.lookup('KEYCAP NUMBER SIGN')))
'#\ufe0f\u20e3'

Support of Unicode Named Character Sequences in the unicodeescape codec and in the RE parser would be a new feature.
History
Date User Action Args
2022-04-11 14:59:54adminsetgithub: 90568
2022-03-19 11:22:19serhiy.storchakasetversions: + Python 3.11, - Python 3.10
nosy: + vstinner, serhiy.storchaka

messages: + msg415540

components: + Interpreter Core, Unicode
type: behavior -> enhancement
2022-01-18 15:52:03mrabarnettsetmessages: + msg410874
2022-01-17 12:31:30jirkamarsikcreate