Message 371638 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	matpi
Recipients	ezio.melotti, malin, matpi, mrabarnett
Date	2020-06-16.11:51:43
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1592308303.92.0.500837194033.issue40980@roundup.psfhosted.org>
In-reply-to

Content
> So b'\xe9' is mapped to \u00e9, it is `é`. Yes but \xe9 is not strictly valid utf-8, or say not the canonical representation of "é". So there is no way to get \xe9 starting from é without leaving utf-8. So starting with é as group name, I cannot programmatically encode it into a bytes pattern. > Of course, characters with Unicode code point greater than 0xff are impossible to appear in `bytes`. But \xce and \x94 are both lower than \xff, yet using \xce\x94 ("Δ".encode()) in a group name fails. According to the doc, the sole constraint on group names is that they have to be valid and unique Python identifiers. So this should work: ``` # Δ is a valid identifier >>> "Δ".isidentifier() True >>> Δ = 1 >>> Δ 1 >>> import re >>> name = "Δ" >>> re.match(b"(?P<" + name.encode() + b">)", b"") Traceback (most recent call last): File "<pyshell#4>", line 1, in <module> re.match(b"(?P<" + name.encode() + b">)", b"") File "/usr/lib/python3.8/re.py", line 191, in match return _compile(pattern, flags).match(string) File "/usr/lib/python3.8/re.py", line 304, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python3.8/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/lib/python3.8/sre_parse.py", line 948, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse raise source.error(msg, len(name) + 1) re.error: bad character in group name 'Î\x94' at position 4 re.match(b'(?P<\xce\x94>)', b'').groupdict() ```

> So b'\xe9' is mapped to \u00e9, it is `é`.

Yes but \xe9 is not strictly valid utf-8, or say not the canonical representation of "é". So there is no way to get \xe9 starting from é without leaving utf-8. So starting with é as group name, I cannot programmatically encode it into a bytes pattern.

> Of course, characters with Unicode code point greater than 0xff are impossible to appear in `bytes`.

But \xce and \x94 are both lower than \xff, yet using \xce\x94 ("Δ".encode()) in a group name fails.

According to the doc, the sole constraint on group names is that they have to be valid and unique Python identifiers. So this should work:

```
# Δ is a valid identifier
>>> "Δ".isidentifier()
True
>>> Δ = 1
>>> Δ
1
>>> import re
>>> name = "Δ"
>>> re.match(b"(?P<" + name.encode() + b">)", b"")
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    re.match(b"(?P<" + name.encode() + b">)", b"")
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
re.match(b'(?P<\xce\x94>)', b'').groupdict()
```

History
Date	User	Action	Args
2020-06-16 11:51:43	matpi	set	recipients: + matpi, ezio.melotti, mrabarnett, malin
2020-06-16 11:51:43	matpi	set	messageid: <1592308303.92.0.500837194033.issue40980@roundup.psfhosted.org>
2020-06-16 11:51:43	matpi	link	issue40980 messages
2020-06-16 11:51:43	matpi	create