Author steven.daprano
Recipients control-k, ezio.melotti, mrabarnett, steven.daprano
Date 2021-11-23.07:22:45
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1637652165.28.0.91922558004.issue45869@roundup.psfhosted.org>
In-reply-to
Content
Hi Joran,

I'm not sure why you think that /s should agree between ASCII and Unicode. That seems like an unjustified assumption to me.

You say: "The expectation would be that the re.A (or re.ASCII) flag should not impact the matching behavior of a regular expression on strings consisting only of ASCII characters."

But I'm not sure why you have that expectation. Is it documented somewhere? The docs clearly say that for character classes, "the characters they match depends on whether ASCII or LOCALE mode is in force." I am unable to find anything that says that the differences are limited only to non-ASCII code points.

I don't think there is any standard definition of "whitespace" in either the ASCII standard, or the very many different regex engines (Perl, dot-Net, Java, ECMA, etc).

Unicode does have an official whitespace character property, and as far as I can see '\x1c' through '\x1f' (File Separator, Group Separator, Record Separator and Unit Separator) are not considered whitespace:

https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace

But the str.isspace() method does consider them as whitespace, while bytes.isspace() does not.


>>> '\x1c'.isspace()
True
>>> b'\x1c'.isspace()
False
History
Date User Action Args
2021-11-23 07:22:45steven.dapranosetrecipients: + steven.daprano, ezio.melotti, mrabarnett, control-k
2021-11-23 07:22:45steven.dapranosetmessageid: <1637652165.28.0.91922558004.issue45869@roundup.psfhosted.org>
2021-11-23 07:22:45steven.dapranolinkissue45869 messages
2021-11-23 07:22:45steven.dapranocreate