Message 346246 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	bup
Recipients	bup
Date	2019-06-21.19:46:08
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1561146368.97.0.686299796922.issue37367@roundup.psfhosted.org>
In-reply-to

Content
At present, the bytecode compiler can generate 512 different unicode characters, one for each integral from the range [0-511), 512 being the total number of syntactically valid permutations of 3 octal digits preceded by a backslash. However, this does not match the regex compiler, which raises an error regardless of the input type when it encounters an an octal escape character with a decimal value greater than 255. On the other hand... the bytes literal: >>> b'\407' is somehow valid, and can lead to extremely difficult bugs to track down, such as this nonsense: >>> re.compile(b'\407').search(b'\a') <re.Match object; span=(0, 1), match=b'\x07'> I propose that the regex parser be augmented, enabling for unicode patterns the interpretation of three character octal escapes from the range(256, 512), while the bytecode parser be adjusted to match the behavior of the regex parser, raising an error for bytes literals > b"\400", rather than truncating the 9th bit.

At present, the bytecode compiler can generate 512 different unicode characters, one for each integral from the range [0-511), 512 being the total number of syntactically valid permutations of 3 octal digits preceded by a backslash. However, this does not match the regex compiler, which raises an error regardless of the input type when it encounters an an octal escape character with a decimal value greater than 255. On the other hand... the bytes literal:

>>> b'\407'

is somehow valid, and can lead to extremely difficult bugs to track down, such as this nonsense:

>>> re.compile(b'\407').search(b'\a')
<re.Match object; span=(0, 1), match=b'\x07'>

I propose that the regex parser be augmented, enabling for unicode patterns the interpretation of three character octal escapes from the range(256, 512), while the bytecode parser be adjusted to match the behavior of the regex parser, raising an error for bytes literals > b"\400", rather than truncating the 9th bit.

History
Date	User	Action	Args
2019-06-21 19:46:08	bup	set	recipients: + bup
2019-06-21 19:46:08	bup	set	messageid: <1561146368.97.0.686299796922.issue37367@roundup.psfhosted.org>
2019-06-21 19:46:08	bup	link	issue37367 messages
2019-06-21 19:46:08	bup	create