Issue23692
Created on 2015-03-17 18:49 by abacabadabacaba, last changed 2019-11-12 10:33 by jacksonriley.
Messages (10) | |||
---|---|---|---|
msg238330 - (view) | Author: Evgeny Kapun (abacabadabacaba) | Date: 2015-03-17 18:49 | |
This pattern matches: re.match('(?:()|(?(1)()|z)){2}(?(2)a|z)', 'a') But this doesn't: re.match('(?:()|(?(1)()|z)){0,2}(?(2)a|z)', 'a') The difference is that {2} is replaced by {0,2}. This shouldn't prevent the pattern from matching anywhere where it matched before. The reason for this misbehavior is a feature which is designed to protect re engine from infinite loops, but in fact it sometimes prevents patterns from matching where they should. I think that this feature should be at least properly documented, by properly I mean that it should be possible to reconstruct the exact behavior from documentation, as the implementation is not particularly easy to understand. |
|||
msg355471 - (view) | Author: Lewis Gaul (LewisGaul) * | Date: 2019-10-27 15:14 | |
Hi there, if anyone's able to provide any guidance on this issue I'd be happy to take a look into it. Is this a behaviour that is feasible to fix, or should this just be documented in some way as suggested by Evgeny? |
|||
msg355493 - (view) | Author: Matthew Barnett (mrabarnett) * ![]() |
Date: 2019-10-27 18:35 | |
Suppose you had a pattern: .* It would advance one character on each iteration of the * until the . failed to match. The text is finite, so it would stop matching eventually. Now suppose you had a pattern: (?:)* On each iteration of the * it wouldn't advance, so it would keep matching forever. A way to avoid that is to stop the * if it hasn't advanced. The example pattern shows that there's still a problem. It advances if a group has matched, but that group doens't match until the first iteration, after the test, and does not, itself, advance. The * stops because it hasn't advanced, but, in this instance, that doesn't mean it never will. The solution is for the * to check not only whether it has advanced, but also whether a group has changed. (Strictly speaking, the latter check is needed only if the repeated part tests whether a group also in the repeated part has changed, but it's probably not worth "optimising" for that possibility.) In the regex module, it increments a "capture changed" counter whenever any group is changed (a group's first match or a change to a group's span). That makes it easier for the * to check. The code needs to save that counter for backtracking and restore it when backtracking. I've mentioned only the *, but the same remarks apply to + and {...}, except that the {...} should keep repeating until it has reached its prescribed minimum. |
|||
msg355646 - (view) | Author: Lewis Gaul (LewisGaul) * | Date: 2019-10-29 11:18 | |
Thanks for the explanation Matthew, I'll take a further look at some point in the coming weeks. |
|||
msg355950 - (view) | Author: Jackson Riley (jacksonriley) * | Date: 2019-11-04 15:19 | |
Hi Matthew, thank you for your suggestions of where to start. Could you possibly give a pointer to the place in the code where the 'capture changed' counter is incremented? I had a bit of a hunt and couldn't find it but may have been looking in the wrong places! |
|||
msg355953 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2019-11-04 15:35 | |
Matthew referred to the code of the regex module (of which he is the author). https://pypi.org/project/regex/ |
|||
msg355955 - (view) | Author: Jackson Riley (jacksonriley) * | Date: 2019-11-04 15:45 | |
Ah thank you very much Serhiy, that's super helpful! |
|||
msg355982 - (view) | Author: Jackson Riley (jacksonriley) * | Date: 2019-11-04 22:02 | |
I've got a bit confused and am doubting myself - is the below output expected? >>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a') >>> m.groups() ('', '') >>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a') >>> m.groups() ('', None) The first pattern doesn't behave as I would (probably naively expect) given Matthew's explanation of this bug - wouldn't the bug cause the match to fail with {1,2} as well? Also, it seems odd that changing the condition in the if clause should change what gets captured. Anyone have any thoughts? |
|||
msg355989 - (view) | Author: Matthew Barnett (mrabarnett) * ![]() |
Date: 2019-11-05 00:48 | |
It's been many years since I looked at the code, and there have been changes since then, so some of the details might not be correct. As to have it should behave: re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a') Iteration 1. Match the repeated part. Group 1 matches. Iteration 2. Match the repeated part. Group 1 matches. Has group 2 matched? No. Try to match 'z'. Fail and backtrack. Retry the repeated part. Iteration 2. Has group 1 matched? Yes. Group 2 matches. Has group 2 matched? Yes. Try to match 'a'. Success. Group 1 matched and group 2 matched. re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a') Iteration 1. Match the repeated part. Group 1 matches. Iteration 2. Match the repeated part. Group 1 matches. Has group 1 matched? Yes. Try to match 'a'. Success. Group 1 matched and group 2 didn't match. |
|||
msg356435 - (view) | Author: Jackson Riley (jacksonriley) * | Date: 2019-11-12 10:33 | |
Hi Matthew, Serhiy, I tried to identify the right places in re to fix things but have found it a bit difficult. I wrote up my attempt (at https://enhackathon.github.io/2019/11/04/JacksonRiley.html) which includes some examples that behave differently from how I would have expected this bug to manifest itself. Any further guidance would be greatly appreciated! |
History | |||
---|---|---|---|
Date | User | Action | Args |
2019-11-12 10:33:07 | jacksonriley | set | messages: + msg356435 |
2019-11-05 00:48:04 | mrabarnett | set | messages: + msg355989 |
2019-11-04 22:02:29 | jacksonriley | set | messages: + msg355982 |
2019-11-04 15:45:50 | jacksonriley | set | messages: + msg355955 |
2019-11-04 15:35:21 | serhiy.storchaka | set | messages: + msg355953 |
2019-11-04 15:19:05 | jacksonriley | set | nosy:
+ jacksonriley messages: + msg355950 |
2019-10-29 11:18:52 | LewisGaul | set | messages: + msg355646 |
2019-10-28 01:59:10 | malin | set | nosy:
+ malin |
2019-10-27 18:36:00 | mrabarnett | set | messages: + msg355493 |
2019-10-27 15:14:56 | LewisGaul | set | nosy:
+ LewisGaul messages: + msg355471 |
2016-10-16 08:54:37 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka stage: needs patch versions: + Python 2.7, Python 3.5, Python 3.6, Python 3.7, - Python 3.4 |
2015-03-17 18:49:44 | abacabadabacaba | create |