This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: group names of bytes regexes are strings
Type: behavior Stage:
Components: Regular Expressions Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: malin, matpi
Priority: normal Keywords:

Created on 2020-06-14 21:03 by matpi, last changed 2022-04-11 14:59 by admin.

Messages (27)
msg371516 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-14 21:03
I noticed that match.groupdict() returns string keys, even for a bytes regex:

```
>>> import re
>>> re.match(b"(?P<a>)", b"").groupdict()
{'a': b''}
```

This seems somewhat strange, because string and bytes matching in re are kind of two separate parts, cf. doc:

> Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.
msg371607 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-15 23:58
This also affects functions/methods expecting a group name as parameter (e.g. match.group), the group name has to be passed as string.
msg371614 - (view) Author: Ma Lin (malin) * Date: 2020-06-16 05:30
Group name is `str` is very reasonable. Essentially it is just a name, it has nothing to do with `bytes`.

Other names in Python are also `str` type, such as codec names, hashlib names.
msg371629 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 10:13
Agreed to some extent, but there is the difference that group names are embedded in the pattern, which has to be bytes if the target is bytes.

My use case is in an all-bytes, no-string project where I construct a large regular expression at startup, with semi-dynamical group names.

So it seems natural to have everything in bytes to concatenate the regular expression, incl. the group names.

But then group names that I receive back are strings, so I cannot look them up directly into the set of group names that I used to create the expression in the first place.

Of course I can live with it by storing them as strings in the first place and encode()'ing them during concatenation, but it does not feel "natural".

Furthermore, even if it is "just a name", a non-ascii group name will raise an error in bytes, even if encoded...:

```
>>> re.compile("(?P<" + "é" + ">)")
re.compile('(?P<é>)')
>>> re.compile(b"(?P<" + "é".encode() + b">)")
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    re.compile(b"(?P<" + "é".encode() + b">)")
  File "/usr/lib/python3.8/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'é' at position 4
```

So no, it's not really "just a name", considering that in Python "é" should is a valid name.
msg371631 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 10:14
should *be a valid name
msg371633 - (view) Author: Ma Lin (malin) * Date: 2020-06-16 10:30
> a non-ascii group name will raise an error in bytes, even if encoded

Looks like this is a language limitation:

    >>> b'é'
      File "<stdin>", line 1
    SyntaxError: bytes can only contain ASCII literal characters.

No problem if you use escaped character:

    >>> re.match(b'(?P<\xe9>)', b'').groupdict()
    {'é': b''}

There may be some inconveniences in your program, but IMO there is nothing wrong, maybe this issue can be closed.
msg371634 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 11:19
Of course an inconvenience in my program is not per se the reason to change the language. I just wanted to motivate that the current situation gives unexpected results.

"\xe9" doesn't look like proper utf-8 to me:

```
>>> "é".encode("latin-1")
b'\xe9'
>>> "é".encode()
b'\xc3\xa9'
```

Let's try another one: how would you go for Δ ("\u0394") as a group name?


```
>>> "Δ".encode()
b'\xce\x94'
>>> "Δ".encode("latin-1")
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    "Δ".encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0394' in position 0: ordinal not in range(256)
>>> re.match(b'(?P<\xce\x94>)', b'').groupdict()
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    re.match(b'(?P<\xce\x94>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
>>> re.match(b'(?P<\u0394>)', b'').groupdict()
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    re.match(b'(?P<\u0394>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name '\\u0394' at position 4
```
msg371637 - (view) Author: Ma Lin (malin) * Date: 2020-06-16 11:41
`latin1` is the character set that Unicode code point from \u0000 to \u00ff, and the characters are directly mapped from/to bytes.

So b'\xe9' is mapped to \u00e9, it is `é`.

Of course, characters with Unicode code point greater than 0xff are impossible to appear in `bytes`.
msg371638 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 11:51
> So b'\xe9' is mapped to \u00e9, it is `é`.

Yes but \xe9 is not strictly valid utf-8, or say not the canonical representation of "é". So there is no way to get \xe9 starting from é without leaving utf-8. So starting with é as group name, I cannot programmatically encode it into a bytes pattern.

> Of course, characters with Unicode code point greater than 0xff are impossible to appear in `bytes`.

But \xce and \x94 are both lower than \xff, yet using \xce\x94 ("Δ".encode()) in a group name fails.

According to the doc, the sole constraint on group names is that they have to be valid and unique Python identifiers. So this should work:

```
# Δ is a valid identifier
>>> "Δ".isidentifier()
True
>>> Δ = 1
>>> Δ
1
>>> import re
>>> name = "Δ"
>>> re.match(b"(?P<" + name.encode() + b">)", b"")
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    re.match(b"(?P<" + name.encode() + b">)", b"")
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
re.match(b'(?P<\xce\x94>)', b'').groupdict()
```
msg371639 - (view) Author: Ma Lin (malin) * Date: 2020-06-16 12:11
In this case, you can only use 'latin1', which directly map one character (\u0000-\u00FF) to/from one byte.

If use 'utf-8', it may map one character to multiple bytes, such as 'Δ' -> b'\xce\x94'

'\x94' is an invalid identifier, it will raise an error:

    >>> '\xce'.isidentifier()   # '\xce' is 'Î'
    True
    >>> '\x94'.isidentifier()
    False

You may close this issue (I can't close it), we can continue the discussion.
msg371643 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 12:37
But Δ has no latin-1 representation. So Δ currently cannot be used as a group name in bytes regex, although it is a valid Python identifier. So that's a bug.

I mean, if you insist of having group names as strings even for bytes regexes, then it is not reasonable to prevent them from going _in_.

b"(??<\xce\x94>)" is a valid utf-8-encoded bytestring, why wouldn't you accept it as a valid re pattern?

IMHO, either

- group names from byte regexes should be returned as bytes
- or any utf-8-encoded representation of a valid Python identifier should be accepted as a group name of a bytes regex pattern.
msg371644 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 12:38
Sorry, b"(?P<\xce\x94>)"
msg371646 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 12:49
The issue with the second variant is that utf-8 is an arbitrary (although default) choice.

But: re is doing that same arbitrary choice already in decoding the group names into a string, which is my original complaint!
msg371652 - (view) Author: Ma Lin (malin) * Date: 2020-06-16 13:03
It seems you don't know some knowledge of encoding yet.

Naturally, `bytes` cannot contain character which Unicode code point is greater than \u00ff. So you can only use "latin1" encoding, which map from character to byte (or reverse) directly.

"utf-8", "utf-16" and "utf-32" are all encoding codecs, "utf-8" should not have a special status in this scene.
msg371657 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 13:51
> It seems you don't know some knowledge of encoding yet.

I don't have to be ashamed of my knowledge of encoding. Yet you are right that I was missing a subtlety, which is that latin-1 is a strict subset of Unicode rather than a completely arbitrary encoding. Thank you for that.

So what you are saying is that group names in bytes regexes can only be specified directly (without -explicit- encoding), so de facto they are limited to the latin-1 subset.

Very well.

But then, once again:

1) why convert them to string when spitting them out? bytes they were when going in, bytes they should remain... **By converting them you are choosing an arbitrary encoding, even if it is the "natural" one.**
2) this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names. If this was really the case, then I would expect to be able to use any string for which .isidentifier() is true as a group name, programmatically.
msg371660 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 14:34
I prove my point that the decoding to string is arbitrary:

```
>>> import re
>>> orig_name = "Ř"
>>> orig_ch = orig_name.encode("cp1250") # Because why not?
>>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
>>> name == orig_name
False
>>> name
'Ø'
>>> name.encode("latin-1") == orig_ch
True
```

For any dynamically-constructed bytes regex pattern, a string group name as output is unusable. Only after latin-1-reencoding can it be safely compared. This latin-1 choice is arbitrary.
msg371672 - (view) Author: Ma Lin (malin) * Date: 2020-06-16 15:33
> this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names.

Not all latin-1 characters are valid identifier, for example:

    >>> '\x94'.encode('latin1')
    b'\x94'
    >>> '\x94'.isidentifier()
    False

There is a workaround, you can convert `bytes` to `str` with "latin-1" decoder before processing, IIRC there will be no extra overhead (memory/speed) during processing, then the name and content are the same type. :)
msg371676 - (view) Author: Ma Lin (malin) * Date: 2020-06-16 15:51
Please look at these:

    >>> orig_name = "Ř"
    >>> orig_ch = orig_name.encode("cp1250") # Because why not?
    >>> orig_ch
    b'\xd8'
    >>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
    >>> name
    'Ø'  # '\xd8'
    >>> name == orig_name
    False
    >>> name.encode("latin-1")
    b'\xd8'
    >>> name.encode("latin-1") == orig_ch
    True

"Ř" (\u0158) --cp1250--> b'\xd8'
"Ø" (\u00d8) --latin-1--> b'\xd8'
msg371681 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 16:17
> > this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names.
> 
> Not all latin-1 characters are valid identifier, for example:
> 
>     >>> '\x94'.encode('latin1')
>     b'\x94'
>     >>> '\x94'.isidentifier()
>     False

True but that's not the point. Δ is a valid Python identifier but not a valid group name in bytes regexes, because it is not in the latin-1 plane. The documentation does not mention this.


> There is a workaround, you can convert `bytes` to `str` with "latin-1" decoder before processing, IIRC there will be no extra overhead (memory/speed) during processing, then the name and content are the same type. :)

I am not searching a workaround for my current code.

And the simplest workaround is to latin-1-convert back to bytes, because re should not latin-1-convert to string in the first place.

Are you saying that the proper way to use bytes regexes is to use string regexes instead?


> Please look at these:
> 
>     >>> orig_name = "Ř"
>     >>> orig_ch = orig_name.encode("cp1250") # Because why not?
>     >>> orig_ch
>     b'\xd8'
>     >>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
>     >>> name
>     'Ø'  # '\xd8'
>     >>> name == orig_name
>     False
>     >>> name.encode("latin-1")
>     b'\xd8'
>     >>> name.encode("latin-1") == orig_ch
>     True
> 
> "Ř" (\u0158) --cp1250--> b'\xd8'
> "Ø" (\u00d8) --latin-1--> b'\xd8'

That's no surprize, I carefully crafted this example. :-)

Rather, that is exactly my point: several different strings (which can all be valid Python identifiers) can have the same single-byte representation, simply by the mean of different encodings (duh).

So why convert group names to strings when outputting them from matches, when you don't know where the bytes come from, or even whether they ever were strings? That should be left to the programmer.
msg371692 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 19:41
And there's no need for a cryptic encoding like cp1250 for this problem to arise. Here is a simple example with Python's default encoding utf-8:

```
>>> a = "ú"
>>> b = list(re.match(b"(?P<" + a.encode() + b">)", b"").groupdict())[0]
>>> a.isidentifier()
True
>>> b.isidentifier()
True
>>> b
'ú'
>>> a.encode() == b.encode("latin1")
True
```

For reference, here is the very source of the issue: https://github.com/python/cpython/blob/master/Lib/sre_parse.py#L228
msg371696 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 20:17
The problem can also be played in reverse, maybe it is more telling:

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# maybe we can try to infer it by decoding the bytestring?
# let's try to do it with the default encoding... that natural, right?
>>> p.decode()
'(?P<ú>)'

# so we can reasonably expect the group name to be ú, right?
>>> list(re.compile(p).groupindex.keys()).pop()
'ú'

# Fail.
```
msg371697 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-16 20:37
You questioned my knowledge of encodings. Let's quote from one of the most famous introductory articles on the subject (https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/):

> It does not make sense to have a string without knowing what encoding it uses

So I have that bytestring that comes from somewhere, maybe it was originally utf-8 or cp1250 or ... encoded, but I won't tell or don't know, the only thing I swear is that it originally was a valid Python identifier.
Now I pass it as a group name in re.match (it was a valid Python identifier, so that has to be alright per the docs) and I get back a (unicode) string.
re.match, how dare you giving me back a string when _you have no clue what my bytestring originally represented, resp. what it originally was encoded with_?
Maybe re.match will even crash, because it wrongly and assumes the bytestring to have been latin-1 encoded!

So: latin-1 is an arbitrary choice that is no better than any other, and the fact that it "naturally" converts bytes to unicode code points is an implementation detail.
If you want to keep it so, it ought (cf. the quote above) to be made clear in the docs that group names come out as latin-1-encoded strings, with all the restrictions that follow from that choice.
But the more logical way would be to renounce this arbitrary encoding altogether.
msg371705 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-17 00:26
I just had an "aha moment": What re claims is that, rather than doing as I suggested:

> ```
> # consider the following bytestring pattern
> >>> p = b"(?P<\xc3\xba>)"
> 
> # what character does the group name correspond to?
> # maybe we can try to infer it by decoding the bytestring?
> # let's try to do it with the default encoding... that's natural, right?
> >>> p.decode()
> '(?P<ú>)'
> ```

the actual way to know what group name is represented would be to look at the (unicode) string with the same "graphical representation":

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# to discover it, we instead consider the string that "looks the same":
>>> "(?P<\xc3\xba>)"
'(?P<ú>)'

# ok so the group name will be "ú"
```

This way of going from bytes to strings _naively_ (which happens to be called latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be the same value, just because they "look the same" in the source code.

This is like throwing away everything we ever learned about Unicode and how a code point is fundamentally different from what is stored in memory.
msg371709 - (view) Author: Ma Lin (malin) * Date: 2020-06-17 03:30
Why you always want to use "utf-8" encoded identifier as group name in `bytes` pattern.

The direction is: a group name written in `bytes` pattern, and will convert to `str.
Not this direction: `str` group name -(utf8)-> `bytes` pattern -> `str` group name
msg371718 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-17 08:10
Because utf-8 is Python's default encoding, e.g. in source files, decode() and encode(). Literally everywhere.

If you ask around "I have a bytestring, I need a string, what do I do?", using latin-1 will not be the first answer (and moreover, the correct answer should be "it depends on the encoding", which re happily ignores by just asserting one).

Saying "just strip that b prefix, it's fine" cannot be taken seriously.

Yes latin-1 will never give an error on converting a bytestring, because it has full coverage of the 256 byte values, but saying that this is the reason why it should be used instead of another is forgetting why we have Unicode in the first place. **It is just pretending that Unicode never was a thing**. It is not because it can decode any bytestring that it will not return garbage _when the bytestring is not latin-1-encoded in the first place_.

Take a look at the documentation: https://docs.python.org/3/howto/unicode.html
7 references to latin-1, none saying that latin-1 is the way to go because it is so much better than anything else.

latin-1 used to be prominent in the 2.x world, it should slowly be time to recognize that this is over, and we cannot ignore anymore that encoding is a thing.
msg371719 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-17 08:17
If I don't have to think about the str -> bytes direction, re should first stop going in the other direction.

When I have bytes regexes I actually don't care about strings and would happily receive group names as bytes. But no, re decides that latin-1 is the way to go, and this way it 1) reduces my freedom in the choice of the group names, 2) makes me need to go read the internals to understand the the encoding it arbitrarily chose is latin-1, so that I can undo it properly and get back what I always wanted - a bytes group name.
msg371720 - (view) Author: Quentin Wenger (matpi) Date: 2020-06-17 08:20
bytes are _not_ Unicode code points, not even in the 256 range. End of the story.
History
Date User Action Args
2022-04-11 14:59:32adminsetgithub: 85152
2020-06-17 08:20:57matpisetmessages: + msg371720
2020-06-17 08:17:09matpisetmessages: + msg371719
2020-06-17 08:10:17matpisetmessages: + msg371718
2020-06-17 03:30:17malinsetmessages: + msg371709
2020-06-17 00:26:30matpisetmessages: + msg371705
2020-06-16 20:37:53matpisetmessages: + msg371697
2020-06-16 20:17:50matpisetmessages: + msg371696
2020-06-16 19:41:48matpisetmessages: + msg371692
2020-06-16 16:17:49matpisetmessages: + msg371681
2020-06-16 15:51:24malinsetmessages: + msg371676
2020-06-16 15:33:39malinsetmessages: + msg371672
2020-06-16 14:34:12matpisetmessages: + msg371660
2020-06-16 13:51:43matpisetmessages: + msg371657
2020-06-16 13:03:43malinsetnosy: - ezio.melotti, mrabarnett
messages: + msg371652
2020-06-16 12:49:11matpisetmessages: + msg371646
2020-06-16 12:38:23matpisetmessages: + msg371644
2020-06-16 12:37:24matpisetmessages: + msg371643
2020-06-16 12:11:44malinsetmessages: + msg371639
2020-06-16 11:51:43matpisetmessages: + msg371638
2020-06-16 11:41:03malinsetmessages: + msg371637
2020-06-16 11:19:48matpisetmessages: + msg371634
2020-06-16 10:30:47malinsetmessages: + msg371633
2020-06-16 10:14:49matpisetmessages: + msg371631
2020-06-16 10:13:21matpisetmessages: + msg371629
2020-06-16 05:30:20malinsetnosy: + malin
messages: + msg371614
2020-06-15 23:58:51matpisetmessages: + msg371607
2020-06-14 21:03:34matpicreate