Title: Undocumented feature prevents re module from finding certain matches
Type: behavior Stage: needs patch
Components: Regular Expressions Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: LewisGaul, abacabadabacaba, ezio.melotti, jacksonriley, malin, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2015-03-17 18:49 by abacabadabacaba, last changed 2019-11-12 10:33 by jacksonriley.

Messages (10)
msg238330 - (view) Author: Evgeny Kapun (abacabadabacaba) Date: 2015-03-17 18:49
This pattern matches:

    re.match('(?:()|(?(1)()|z)){2}(?(2)a|z)', 'a')

But this doesn't:

    re.match('(?:()|(?(1)()|z)){0,2}(?(2)a|z)', 'a')

The difference is that {2} is replaced by {0,2}. This shouldn't prevent the pattern from matching anywhere where it matched before.

The reason for this misbehavior is a feature which is designed to protect re engine from infinite loops, but in fact it sometimes prevents patterns from matching where they should. I think that this feature should be at least properly documented, by properly I mean that it should be possible to reconstruct the exact behavior from documentation, as the implementation is not particularly easy to understand.
msg355471 - (view) Author: Lewis Gaul (LewisGaul) * Date: 2019-10-27 15:14
Hi there, if anyone's able to provide any guidance on this issue I'd be happy to take a look into it.

Is this a behaviour that is feasible to fix, or should this just be documented in some way as suggested by Evgeny?
msg355493 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-10-27 18:35
Suppose you had a pattern:


It would advance one character on each iteration of the * until the . failed to match. The text is finite, so it would stop matching eventually.

Now suppose you had a pattern:


On each iteration of the * it wouldn't advance, so it would keep matching forever.

A way to avoid that is to stop the * if it hasn't advanced.

The example pattern shows that there's still a problem. It advances if a group has matched, but that group doens't match until the first iteration, after the test, and does not, itself, advance. The * stops because it hasn't advanced, but, in this instance, that doesn't mean it never will.

The solution is for the * to check not only whether it has advanced, but also whether a group has changed. (Strictly speaking, the latter check is needed only if the repeated part tests whether a group also in the repeated part has changed, but it's probably not worth "optimising" for that possibility.)

In the regex module, it increments a "capture changed" counter whenever any group is changed (a group's first match or a change to a group's span). That makes it easier for the * to check. The code needs to save that counter for backtracking and restore it when backtracking.

I've mentioned only the *, but the same remarks apply to + and {...}, except that the {...} should keep repeating until it has reached its prescribed minimum.
msg355646 - (view) Author: Lewis Gaul (LewisGaul) * Date: 2019-10-29 11:18
Thanks for the explanation Matthew, I'll take a further look at some point in the coming weeks.
msg355950 - (view) Author: Jackson Riley (jacksonriley) * Date: 2019-11-04 15:19
Hi Matthew, thank you for your suggestions of where to start.
Could you possibly give a pointer to the place in the code where the 'capture changed' counter is incremented? I had a bit of a hunt and couldn't find it but may have been looking in the wrong places!
msg355953 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-04 15:35
Matthew referred to the code of the regex module (of which he is the author).
msg355955 - (view) Author: Jackson Riley (jacksonriley) * Date: 2019-11-04 15:45
Ah thank you very much Serhiy, that's super helpful!
msg355982 - (view) Author: Jackson Riley (jacksonriley) * Date: 2019-11-04 22:02
I've got a bit confused and am doubting myself - is the below output expected?
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')
>>> m.groups()
('', '')
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')
>>> m.groups()
('', None)

The first pattern doesn't behave as I would (probably naively expect) given Matthew's explanation of this bug - wouldn't the bug cause the match to fail with {1,2} as well?
Also, it seems odd that changing the condition in the if clause should change what gets captured.

Anyone have any thoughts?
msg355989 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-11-05 00:48
It's been many years since I looked at the code, and there have been changes since then, so some of the details might not be correct.

As to have it should behave:

re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 2 matched? No.
Try to match 'z'. Fail and backtrack.
Retry the repeated part.
Iteration 2.
Has group 1 matched? Yes.
Group 2 matches.
Has group 2 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 matched.

re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')

Iteration 1.
Match the repeated part. Group 1 matches.
Iteration 2.
Match the repeated part. Group 1 matches.
Has group 1 matched? Yes.
Try to match 'a'. Success. Group 1 matched and group 2 didn't match.
msg356435 - (view) Author: Jackson Riley (jacksonriley) * Date: 2019-11-12 10:33
Hi Matthew, Serhiy,

I tried to identify the right places in re to fix things but have found it a bit difficult.
I wrote up my attempt (at which includes some examples that behave differently from how I would have expected this bug to manifest itself. 
Any further guidance would be greatly appreciated!
Date User Action Args
2019-11-12 10:33:07jacksonrileysetmessages: + msg356435
2019-11-05 00:48:04mrabarnettsetmessages: + msg355989
2019-11-04 22:02:29jacksonrileysetmessages: + msg355982
2019-11-04 15:45:50jacksonrileysetmessages: + msg355955
2019-11-04 15:35:21serhiy.storchakasetmessages: + msg355953
2019-11-04 15:19:05jacksonrileysetnosy: + jacksonriley
messages: + msg355950
2019-10-29 11:18:52LewisGaulsetmessages: + msg355646
2019-10-28 01:59:10malinsetnosy: + malin
2019-10-27 18:36:00mrabarnettsetmessages: + msg355493
2019-10-27 15:14:56LewisGaulsetnosy: + LewisGaul
messages: + msg355471
2016-10-16 08:54:37serhiy.storchakasetnosy: + serhiy.storchaka
stage: needs patch

versions: + Python 2.7, Python 3.5, Python 3.6, Python 3.7, - Python 3.4
2015-03-17 18:49:44abacabadabacabacreate