Message 406818 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	control-k, ezio.melotti, mrabarnett, steven.daprano
Date	2021-11-23.07:22:45
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1637652165.28.0.91922558004.issue45869@roundup.psfhosted.org>
In-reply-to

Content
Hi Joran, I'm not sure why you think that /s should agree between ASCII and Unicode. That seems like an unjustified assumption to me. You say: "The expectation would be that the re.A (or re.ASCII) flag should not impact the matching behavior of a regular expression on strings consisting only of ASCII characters." But I'm not sure why you have that expectation. Is it documented somewhere? The docs clearly say that for character classes, "the characters they match depends on whether ASCII or LOCALE mode is in force." I am unable to find anything that says that the differences are limited only to non-ASCII code points. I don't think there is any standard definition of "whitespace" in either the ASCII standard, or the very many different regex engines (Perl, dot-Net, Java, ECMA, etc). Unicode does have an official whitespace character property, and as far as I can see '\x1c' through '\x1f' (File Separator, Group Separator, Record Separator and Unit Separator) are not considered whitespace: https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace But the str.isspace() method does consider them as whitespace, while bytes.isspace() does not. >>> '\x1c'.isspace() True >>> b'\x1c'.isspace() False

Hi Joran,

I'm not sure why you think that /s should agree between ASCII and Unicode. That seems like an unjustified assumption to me.

You say: "The expectation would be that the re.A (or re.ASCII) flag should not impact the matching behavior of a regular expression on strings consisting only of ASCII characters."

But I'm not sure why you have that expectation. Is it documented somewhere? The docs clearly say that for character classes, "the characters they match depends on whether ASCII or LOCALE mode is in force." I am unable to find anything that says that the differences are limited only to non-ASCII code points.

I don't think there is any standard definition of "whitespace" in either the ASCII standard, or the very many different regex engines (Perl, dot-Net, Java, ECMA, etc).

Unicode does have an official whitespace character property, and as far as I can see '\x1c' through '\x1f' (File Separator, Group Separator, Record Separator and Unit Separator) are not considered whitespace:

https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace

But the str.isspace() method does consider them as whitespace, while bytes.isspace() does not.


>>> '\x1c'.isspace()
True
>>> b'\x1c'.isspace()
False

History
Date	User	Action	Args
2021-11-23 07:22:45	steven.daprano	set	recipients: + steven.daprano, ezio.melotti, mrabarnett, control-k
2021-11-23 07:22:45	steven.daprano	set	messageid: <1637652165.28.0.91922558004.issue45869@roundup.psfhosted.org>
2021-11-23 07:22:45	steven.daprano	link	issue45869 messages
2021-11-23 07:22:45	steven.daprano	create