This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Negative lookaround assertions sometimes leak capture groups
Type: behavior Stage:
Components: Regular Expressions Versions: Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, jirkamarsik, mrabarnett
Priority: normal Keywords:

Created on 2021-10-20 15:52 by jirkamarsik, last changed 2022-04-11 14:59 by admin.

Messages (2)
msg404479 - (view) Author: Jirka Marsik (jirkamarsik) Date: 2021-10-20 15:52
When you have capture groups inside a negative lookaround assertion, the strings captured by those capture groups can sometimes survive the failure of the assertion and feature in the returned Match object.

Here it is illustrated with lookbehinds and lookaheads:

>>> re.search(r"(?<!(a)c)de", "abde").group(1)
'a'
>>> re.search(r"(?!(a)c)ab", "ab").group(1)
'a'

Even though the search for the expression '(a)c' fails when trying to match 'c', the string 'a' is still reported as having been successfully matched by capture group 1. The expected behavior would be for the capture group 1 to not have a match.

Because of the following reasons, I believe this behavior is not intentional and is the result of Python not cleaning up after the asserted subexpression fails (e.g. by running the asserted subexpression in a new stack frame).

1) This behavior is not being systematically enforced.
   We can observe this behavior only in certain cases. Modifying the expression to use the branching operator `|` inside the asserted subexpression leads to the expected behavior.

>>> re.search(r"(?<!(a)c|(a)d)de", "abde").group(1) is None
True
>>> re.search(r"(?!(a)c|(a)d)ab", "ab").group(1) is None
True

2) Other languages do not leak capture groups from negative lookarounds.

   Node.js (ECMAScript):

> /(?<!(a)c)de/.exec("abde")[1]
undefined
> /(?!(a)c)ab/.exec("ab")[1]
undefined
> /(?<!(a)c|(a)d)de/.exec("abde")[1]
undefined
> /(?!(a)c|(a)d)ab/.exec("ab")[1]
undefined

   MRI (Ruby):

irb(main):001:0> /(?<!(a)c)de/.match("abde")[1]
<unsupported>
irb(main):002:0> /(?!(a)c)ab/.match("ab")[1]
=> #<MatchData "ab" 1:nil>
irb(main):003:0> /(?<!(a)c|(a)d)de/.match("abde")[1]
<unsupported>
irb(main):004:0> /(?!(a)c|(a)d)ab/.match("ab")[1]
=> #<MatchData "ab" 1:nil 2:nil>

  JShell (Java):

jshell> Matcher m = java.util.regex.Pattern.compile("(?<!(a)c)de").matcher("abde")
jshell> m.find()
jshell> m.group(1)
$3 ==> null
jshell> Matcher m = java.util.regex.Pattern.compile("(?<!(a)c|(a)d)de").matcher("abde")
jshell> m.find()
jshell> m.group(1)
$6 ==> null
jshell> Matcher m = java.util.regex.Pattern.compile("(?!(a)c)ab").matcher("ab")
m ==> java.util.regex.Matcher[pattern=(?!(a)c)ab region=0,2 lastmatch=]
jshell> m.find()
jshell> m.group(1)
$9 ==> null
jshell> Matcher m = java.util.regex.Pattern.compile("(?!(a)c|(a)d)ab").matcher("ab")
m ==> java.util.regex.Matcher[pattern=(?!(a)c|(a)d)ab region=0,2 lastmatch=]
jshell> m.find()
jshell> m.group(1)
$12 ==> null

3) Not leaking capture groups from negative lookarounds is symmetric to how capture groups are treated in failed matches.
   When regular expression engines fail to match a regular expression, they do not provide a partial match object that contains the state of capture groups at the time when when the matcher failed. Instead, the state of the matcher is discarded and some bottom value is returned (None, null or undefined). Similarly, one would expect nested subexpressions to behave the same way, so that capture groups from failed match attempts are discarded.
msg404615 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2021-10-21 16:19
It's definitely a bug.

In order for the pattern to match, the negative lookaround must match, which means that its subexpression mustn't match, so none of the groups in that subexpression have captured.
History
Date User Action Args
2022-04-11 14:59:51adminsetgithub: 89702
2021-10-21 16:19:16mrabarnettsetmessages: + msg404615
versions: + Python 3.10
2021-10-20 15:52:58jirkamarsikcreate