Message 371681 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	matpi
Recipients	malin, matpi
Date	2020-06-16.16:17:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1592324269.32.0.908830086532.issue40980@roundup.psfhosted.org>
In-reply-to

Content
> > this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names. > > Not all latin-1 characters are valid identifier, for example: > > >>> '\x94'.encode('latin1') > b'\x94' > >>> '\x94'.isidentifier() > False True but that's not the point. Δ is a valid Python identifier but not a valid group name in bytes regexes, because it is not in the latin-1 plane. The documentation does not mention this. > There is a workaround, you can convert `bytes` to `str` with "latin-1" decoder before processing, IIRC there will be no extra overhead (memory/speed) during processing, then the name and content are the same type. :) I am not searching a workaround for my current code. And the simplest workaround is to latin-1-convert back to bytes, because re should not latin-1-convert to string in the first place. Are you saying that the proper way to use bytes regexes is to use string regexes instead? > Please look at these: > > >>> orig_name = "Ř" > >>> orig_ch = orig_name.encode("cp1250") # Because why not? > >>> orig_ch > b'\xd8' > >>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0] > >>> name > 'Ø' # '\xd8' > >>> name == orig_name > False > >>> name.encode("latin-1") > b'\xd8' > >>> name.encode("latin-1") == orig_ch > True > > "Ř" (\u0158) --cp1250--> b'\xd8' > "Ø" (\u00d8) --latin-1--> b'\xd8' That's no surprize, I carefully crafted this example. :-) Rather, that is exactly my point: several different strings (which can all be valid Python identifiers) can have the same single-byte representation, simply by the mean of different encodings (duh). So why convert group names to strings when outputting them from matches, when you don't know where the bytes come from, or even whether they ever were strings? That should be left to the programmer.

> > this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names.
> 
> Not all latin-1 characters are valid identifier, for example:
> 
>     >>> '\x94'.encode('latin1')
>     b'\x94'
>     >>> '\x94'.isidentifier()
>     False

True but that's not the point. Δ is a valid Python identifier but not a valid group name in bytes regexes, because it is not in the latin-1 plane. The documentation does not mention this.


> There is a workaround, you can convert `bytes` to `str` with "latin-1" decoder before processing, IIRC there will be no extra overhead (memory/speed) during processing, then the name and content are the same type. :)

I am not searching a workaround for my current code.

And the simplest workaround is to latin-1-convert back to bytes, because re should not latin-1-convert to string in the first place.

Are you saying that the proper way to use bytes regexes is to use string regexes instead?


> Please look at these:
> 
>     >>> orig_name = "Ř"
>     >>> orig_ch = orig_name.encode("cp1250") # Because why not?
>     >>> orig_ch
>     b'\xd8'
>     >>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
>     >>> name
>     'Ø'  # '\xd8'
>     >>> name == orig_name
>     False
>     >>> name.encode("latin-1")
>     b'\xd8'
>     >>> name.encode("latin-1") == orig_ch
>     True
> 
> "Ř" (\u0158) --cp1250--> b'\xd8'
> "Ø" (\u00d8) --latin-1--> b'\xd8'

That's no surprize, I carefully crafted this example. :-)

Rather, that is exactly my point: several different strings (which can all be valid Python identifiers) can have the same single-byte representation, simply by the mean of different encodings (duh).

So why convert group names to strings when outputting them from matches, when you don't know where the bytes come from, or even whether they ever were strings? That should be left to the programmer.

History
Date	User	Action	Args
2020-06-16 16:17:49	matpi	set	recipients: + matpi, malin
2020-06-16 16:17:49	matpi	set	messageid: <1592324269.32.0.908830086532.issue40980@roundup.psfhosted.org>
2020-06-16 16:17:49	matpi	link	issue40980 messages
2020-06-16 16:17:49	matpi	create