classification
Title: re module: strange behaviour of space inside {m, n}
Type: behavior Stage:
Components: Library (Lib), Regular Expressions Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: pending Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, mrabarnett, pitrou, roysmith, serhiy.storchaka, sjmachin
Priority: normal Keywords:

Created on 2011-02-12 23:19 by sjmachin, last changed 2013-10-27 17:27 by serhiy.storchaka.

Messages (5)
msg128472 - (view) Author: John Machin (sjmachin) Date: 2011-02-12 23:19
A pattern like r"b{1,3}\Z" matches "b", "bb", and "bbb", as expected. There is no documentation of the behaviour of r"b{1, 3}\Z" -- it matches the LITERAL TEXT "b{1, 3}" in normal mode and "b{1,3}" in verbose mode.

# paste the following at the interactive prompt:
pat = r"b{1, 3}\Z"
bool(re.match(pat, "bb")) # False
bool(re.match(pat, "b{1, 3}")) # True
bool(re.match(pat, "bb", re.VERBOSE)) # False
bool(re.match(pat, "b{1, 3}", re.VERBOSE)) # False
bool(re.match(pat, "b{1,3}", re.VERBOSE)) # True

Suggested change, in decreasing order of preference:
(1) Ignore leading/trailing spaces when parsing the m and n components of {m,n}
(2) Raise an exception if the exact syntax is not followed
(3) Document the existing behaviour

Note: deliberately matching the literal text would be expected to be done by escaping the left brace:

pat2 = r"b\{1, 3}\Z"
bool(re.match(pat2, "b{1, 3}")) # True

and this is not prevented by the suggested changes.
msg176812 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-02 22:28
Interesting.

In my regex module (http://pypi.python.org/pypi/regex) I have:

bool(regex.match(pat, "bb", regex.VERBOSE)) # True
bool(regex.match(pat, "b{1,3}", regex.VERBOSE)) # False

because I thought that when the VERBOSE flag is turned on it should ignore whitespace except when it's inside a character class, so "b{1, 3}" would be treated as "b{1,3}".

Apparently re has another exception.
msg176813 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-02 22:40
$ echo 'bbbbbaaa' | grep -o 'b\{1,3\}a'
bbba
$ echo 'bbbbbaaa' | grep -o 'b\{1, 3\}a'
grep: Invalid content of \{\}
$ echo 'bbbbbaaa' | egrep -o 'b{1,3}a'
bbba
$ echo 'bbbbbaaa' | egrep -o 'b{1, 3}a'
$ echo 'bbb{1, 3}aa' | LC_ALL=C egrep -o 'b{1, 3}a'
b{1, 3}a

I.e. grep raises error and egrep chooses silent verbatim meaning. I don't know what any standards say about this.
msg176819 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-12-03 00:10
The question is whether re should always treat 'b{1, 3}a' as a literal, even with the VERBOSE flag.

I've checked with Perl 5.14.2, and it agrees with re: adding a space _always_ makes it a literal, even with the 'x' flag (/b{1, 3}a/x is treated as /b\{1,3}a/).
msg180700 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-26 19:01
Then let's leave all as is.
History
Date User Action Args
2013-10-27 17:27:19serhiy.storchakasetstatus: open -> pending
2013-02-11 20:02:21roysmithsetnosy: + roysmith
2013-01-26 19:01:21serhiy.storchakasetmessages: + msg180700
2012-12-03 00:10:46mrabarnettsetmessages: + msg176819
2012-12-02 22:40:27serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg176813
2012-12-02 22:28:55mrabarnettsetmessages: + msg176812
2012-12-02 21:53:47serhiy.storchakasetnosy: + mrabarnett

type: behavior
components: + Library (Lib), Regular Expressions
versions: + Python 3.2, Python 3.3, Python 3.4, - Python 3.1
2011-02-18 19:55:44terry.reedysetnosy: + ezio.melotti, pitrou
2011-02-12 23:19:56sjmachincreate