Title: Unicode and acii regular expressions do not agree on ascii space characters
Type: enhancement Stage:
Components: Regular Expressions Versions: Python 3.11
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: arhadthedev, control-k, ezio.melotti, mrabarnett, steven.daprano
Priority: normal Keywords:

Created on 2021-11-22 13:27 by control-k, last changed 2021-11-23 13:56 by arhadthedev.

File name Uploaded Description Edit control-k, 2021-11-22 13:27 Script that prints all differences in regex classes for ascii characters with or without re.A
Messages (6)
msg406773 - (view) Author: Joran van Apeldoorn (control-k) * Date: 2021-11-22 13:27
The expectation would be that the re.A (or re.ASCII) flag should not impact the matching behavior of a regular expression on strings consisting only of ASCII characters.  However, for the characters 0x1c till 0x1f, the classes \s and \S differ. For ASCII theses characters are not considered space characters while for unicode they are. 

Note that python strings do consider these characters spaces as '\xc1'.isspace() gives True. 

All other classes and characters stay the same for unicode and ASCII matching.
msg406787 - (view) Author: Joran van Apeldoorn (control-k) * Date: 2021-11-22 14:53
Small addition, the sre categories CATEGORY_LINEBREAK and CATEGORY_UNI_LINEBREAK also do not agree on ASCII characters.
The first is only '\n' while the second also includes for example '\r' and some others. These do not seem to correspond to anything however and are never used in or
msg406796 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2021-11-22 17:40
For comparison, the regex module says that 0x1C..0x1F aren't whitespace, and the Unicode property White_Space ("\p{White_Space}" in a pattern, where supported) also says that they aren't whitespace.
msg406818 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-11-23 07:22
Hi Joran,

I'm not sure why you think that /s should agree between ASCII and Unicode. That seems like an unjustified assumption to me.

You say: "The expectation would be that the re.A (or re.ASCII) flag should not impact the matching behavior of a regular expression on strings consisting only of ASCII characters."

But I'm not sure why you have that expectation. Is it documented somewhere? The docs clearly say that for character classes, "the characters they match depends on whether ASCII or LOCALE mode is in force." I am unable to find anything that says that the differences are limited only to non-ASCII code points.

I don't think there is any standard definition of "whitespace" in either the ASCII standard, or the very many different regex engines (Perl, dot-Net, Java, ECMA, etc).

Unicode does have an official whitespace character property, and as far as I can see '\x1c' through '\x1f' (File Separator, Group Separator, Record Separator and Unit Separator) are not considered whitespace:

But the str.isspace() method does consider them as whitespace, while bytes.isspace() does not.

>>> '\x1c'.isspace()
>>> b'\x1c'.isspace()
msg406819 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2021-11-23 07:26
In any case, any change to this would have to be limited to Python 3.11. It is not clearly a bug, so this would be an enhancement.
msg406837 - (view) Author: Joran van Apeldoorn (control-k) * Date: 2021-11-23 12:43

I was not suggesting that the documentation literally says they should be the same but it might be unexpected for users if ASCCI characters change properties depending on whether they are considered in a unicode or pure ASCII setting. 

The documentation says about re.A: "Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. ". The problem might be that there is no clear notion of "ASCII-only matching". I assumed this mean matching ASCII characters only, i.e., the character classes are simply limited to codes below 128. 

About \s the documentation says:
"Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.". This heavily implies that there are non-ASCII characters in Unicode that might be considered spaces, but that the ASCII characters are [ \t\n\r\f\v], although again, not stated literally. 

There might be valid reasons to change the definition (even for ASCII characters) depending on re.A, but should it then not follow the unicode standard for white space in the unicode case? (which would coincide with the current ASCII case). There seem to be many different places where python is opinionated about what a space is, but not much consistency behind it.

I am a bit worried about the undocumented nature of the precise definitions of the regex classes in general. How is a user supposed to know that the default behavior of \s, when no flag is passed, is to also match other ASCII characters then those mentioned for the ASCII case? In contrast to this, the \d class is directly defined as the unicode category [Nd]. 

It is likely to hard to change and to many things depend on it but the following definitions would make more sense to me, and hopefully others:
- Character classes are defined as a set of unicode properties/categories, following the same definitions as elsewhere in python.
- If re.A is passed, they are this same set but limited to codes below 128. 

After some digging in the code I traced the current definitions as follows:
 - For unicode Py_UNICODE_ISSPACE is called, which either does a lookup in the constant table _Py_ascii_whitespace or calls _PyUnicode_IsWhitespace for non ASCII characters. Both of these define a space as "Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs'", i.e., this is simply the unicode string isspace() definition. 
 - For ASCII Py_ISSPACE is called which does a lookup in _Py_ctype_table. It is unclear to me how this table was made.

So sre just follows the other python definitions.
In searching around i found issue  #18236 , which also considers how the python definition differs from the unicode one.
Date User Action Args
2021-11-23 13:56:56arhadthedevsetnosy: + arhadthedev
2021-11-23 12:43:14control-ksetmessages: + msg406837
2021-11-23 07:26:40steven.dapranosettype: behavior -> enhancement
messages: + msg406819
versions: - Python 3.8, Python 3.9, Python 3.10
2021-11-23 07:22:45steven.dapranosetnosy: + steven.daprano
messages: + msg406818
2021-11-22 17:40:39mrabarnettsetmessages: + msg406796
2021-11-22 14:53:20control-ksetmessages: + msg406787
2021-11-22 13:30:38control-ksettype: behavior
2021-11-22 13:27:59control-kcreate