classification
Title: Regular expressions with multiple repeat codes
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: martin.panter Nosy List: ezio.melotti, martin.panter, mrabarnett, python-dev, r.david.murray, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2016-08-19 12:07 by martin.panter, last changed 2016-10-15 03:24 by martin.panter. This issue is now closed.

Files
File name Uploaded Description Edit
multiple-repeat.patch martin.panter, 2016-09-04 06:50 review
Messages (9)
msg273107 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-19 12:06
In the documentation for the “re” module, it says repetition codes like {4} and “*” operate on the preceding regular expression. But even though “a{4}” is a valid expression, the obvious way to apply a “*” repetition to it fails:

>>> re.compile("a{4}*")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/re.py", line 223, in compile
    return _compile(pattern, flags)
  File "/home/proj/python/cpython/Lib/re.py", line 292, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/home/proj/python/cpython/Lib/sre_compile.py", line 555, in compile
    p = sre_parse.parse(p, flags)
  File "/home/proj/python/cpython/Lib/sre_parse.py", line 792, in parse
    p = _parse_sub(source, pattern, 0)
  File "/home/proj/python/cpython/Lib/sre_parse.py", line 406, in _parse_sub
    itemsappend(_parse(source, state))
  File "/home/proj/python/cpython/Lib/sre_parse.py", line 610, in _parse
    source.tell() - here + len(this))
sre_constants.error: multiple repeat at position 4

As a workaround, I found I can wrap the inner repetition in (?:. . .):

>>> re.compile("(?:a{4})*")
re.compile('(?:a{4})*')

The problems with the workaround are (a) it is far from obvious, and (b) it adds more complicated syntax. Either this limitation should be documented, or if there is no good reason for it, it should be lifted. It is not clear if my workaround is entirely valid, or if I just found a way to bypass some sanity check.

My original use case was scanning a base-64 encoding for Issue 27799:

# Without the second level of brackets, this raises a "multiple repeat" error
chunk_re = br'(?: (?: [^A-Za-z0-9+/=]* [A-Za-z0-9+/=] ){4} )*'
msg273133 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-08-19 15:03
It seems perfectly logical and consistent to me.  {4} is a repeat count, as is *.  You get the same error if you do 'a?*', and the same bypass if you do '(a?)*' (though I haven't tested if that does anything useful :).  You don't need the ?:, as far as I can tell, you just need to have the * modifying a group, making the group the "preceding regular expression".
msg273147 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2016-08-19 17:34
"*" and the other quantifiers ("+", "?" and "{...}") operate on the preceding _item_, not the entire preceding expression. For example, "ab*" means "a" followed by zero or more repeats of "b".

You're not allowed to use multiple quantifiers together. The proper way is to use the non-capturing "(?:...)".

It's too late to change that because some of them already have a special meaning when used after another quantifier: "a*?" is a lazy quantifier, as are "a+?", "a??" and "a{1,4}?".

Many other regex implementations, including the "regex" module, use an additional "+" to signify a possessive quantifier: "a*+", "a++", "a?+" and "a{1,4}+".

That just leaves the additional "*", which is treated as an error in all the other regex implementations that I'm aware of.
msg273148 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-08-19 17:44
This appears to be a doc issue to clarify that * cannot directly follow a repetition code.  I believe there have been other (non)bug reports like this before.
msg273178 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-08-20 01:18
Okay so it sounds like my usage is valid if I add the brackets. I will try to come up with a documentation patch as some stage. The reason why it is not supported without brackets is to maintain a bit of consistency with the question mark (?), which modifies the preceding quantifier, and with the plus sign (+), which is also a modifier in other implementations.

For the record, Gnu grep does seem to accept my expression (although Posix says this is undefined, and neither support lazy or possessive quantifiers):

$ grep -E -o 'a{2}*' <<< "aaaaa"
aaaa

However pcregrep, which supports lazy (?) and possessive (+) quantifiers, doesn’t like my expression:

$ pcregrep -o 'a{2}*' <<< "aaaaa"
pcregrep: Error in command-line regex at offset 4: nothing to repeat
[Exit 2]
$ pcregrep -o '(?:a{2})*' <<< "aaaaa"
aaaa
msg274344 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-09-04 06:50
Here is a patch for the documentation.
msg274347 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-04 07:20
LGTM. Thanks Martin.
msg278682 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-10-15 01:45
New changeset 5f7d7e079e39 by Martin Panter in branch '3.5':
Issue #27800: Document limitation and workaround for multiple RE repetitions
https://hg.python.org/cpython/rev/5f7d7e079e39

New changeset 1f2ca7e4b64e by Martin Panter in branch '3.6':
Issue #27800: Merge RE repetition doc from 3.5 into 3.6
https://hg.python.org/cpython/rev/1f2ca7e4b64e

New changeset 98456ab88ab0 by Martin Panter in branch 'default':
Issue #27800: Merge RE repetition doc from 3.6
https://hg.python.org/cpython/rev/98456ab88ab0

New changeset 94f02193f00f by Martin Panter in branch '2.7':
Issue #27800: Document limitation and workaround for multiple RE repetitions
https://hg.python.org/cpython/rev/94f02193f00f
msg278690 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-10-15 03:24
I committed my patch as it was. I understand Silent Ghost’s objection was mainly that they thought the new paragraph or its positioning wouldn’t be very useful, but hopefully it is better than nothing. Perhaps in the future, the documentation could be restructured with subsections for repetition qualifiers and other kinds of special codes, which may help.
History
Date User Action Args
2016-10-15 03:24:45martin.pantersetstatus: open -> closed
resolution: fixed
messages: + msg278690

stage: commit review -> resolved
2016-10-15 01:45:34python-devsetnosy: + python-dev
messages: + msg278682
2016-10-01 14:57:11serhiy.storchakasetassignee: martin.panter
stage: patch review -> commit review
versions: + Python 3.7
2016-09-04 07:20:17serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg274347
2016-09-04 06:50:36martin.pantersetfiles: + multiple-repeat.patch
keywords: + patch
messages: + msg274344

stage: patch review
2016-08-20 01:18:28martin.pantersetmessages: + msg273178
2016-08-19 17:44:56terry.reedysetnosy: + terry.reedy
messages: + msg273148
2016-08-19 17:34:56mrabarnettsetmessages: + msg273147
2016-08-19 15:03:59r.david.murraysetnosy: + r.david.murray
messages: + msg273133
2016-08-19 12:07:00martin.pantercreate