Message 371705 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	matpi
Recipients	malin, matpi
Date	2020-06-17.00:26:29
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1592353590.53.0.149226661557.issue40980@roundup.psfhosted.org>
In-reply-to

Content
I just had an "aha moment": What re claims is that, rather than doing as I suggested: > ``` > # consider the following bytestring pattern > >>> p = b"(?P<\xc3\xba>)" > > # what character does the group name correspond to? > # maybe we can try to infer it by decoding the bytestring? > # let's try to do it with the default encoding... that's natural, right? > >>> p.decode() > '(?P<ú>)' > ``` the actual way to know what group name is represented would be to look at the (unicode) string with the same "graphical representation": ``` # consider the following bytestring pattern >>> p = b"(?P<\xc3\xba>)" # what character does the group name correspond to? # to discover it, we instead consider the string that "looks the same": >>> "(?P<\xc3\xba>)" '(?P<Ãº>)' # ok so the group name will be "Ãº" ``` This way of going from bytes to strings _naively_ (which happens to be called latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be the same value, just because they "look the same" in the source code. This is like throwing away everything we ever learned about Unicode and how a code point is fundamentally different from what is stored in memory.

I just had an "aha moment": What re claims is that, rather than doing as I suggested:

> ```
> # consider the following bytestring pattern
> >>> p = b"(?P<\xc3\xba>)"
> 
> # what character does the group name correspond to?
> # maybe we can try to infer it by decoding the bytestring?
> # let's try to do it with the default encoding... that's natural, right?
> >>> p.decode()
> '(?P<ú>)'
> ```

the actual way to know what group name is represented would be to look at the (unicode) string with the same "graphical representation":

```
# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# to discover it, we instead consider the string that "looks the same":
>>> "(?P<\xc3\xba>)"
'(?P<Ãº>)'

# ok so the group name will be "Ãº"
```

This way of going from bytes to strings _naively_ (which happens to be called latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be the same value, just because they "look the same" in the source code.

This is like throwing away everything we ever learned about Unicode and how a code point is fundamentally different from what is stored in memory.

History
Date	User	Action	Args
2020-06-17 00:26:30	matpi	set	recipients: + matpi, malin
2020-06-17 00:26:30	matpi	set	messageid: <1592353590.53.0.149226661557.issue40980@roundup.psfhosted.org>
2020-06-17 00:26:30	matpi	link	issue40980 messages
2020-06-17 00:26:29	matpi	create