classification
Title: In re's positive lookbehind assertion repetition works
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, ezio.melotti, mrabarnett, py.user, serhiy.storchaka, tim.peters
Priority: normal Keywords:

Created on 2012-04-01 08:07 by py.user, last changed 2014-11-01 18:20 by serhiy.storchaka. This issue is now closed.

Messages (11)
msg157264 - (view) Author: py.user (py.user) * Date: 2012-04-01 08:07
>>> import re
>>> re.search(r'(?<=a){100,200}bc', 'abc', re.DEBUG)
max_repeat 100 200 
  assert -1 
    literal 97 
literal 98 
literal 99 
<_sre.SRE_Match object at 0xb7429f38>
>>> re.search(r'(?<=a){100,200}bc', 'abc', re.DEBUG).group()
'bc'
>>>


I expected "nothing to repeat"
msg221588 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-26 01:33
Can someone comment on this regex problem please, they're just not my cup of tea.
msg221594 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-06-26 07:24
Technically this is not a bug.
msg221601 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2014-06-26 10:25
Lookarounds can contain capture groups:

>>> import re
>>> re.search(r'a(?=(.))', 'ab').groups()
('b',)
>>> re.search(r'(?<=(.))b', 'ab').groups()
('a',)

so lookarounds that are optional or can have no repeats might have a use.

I'm not sure whether it's useful to repeat them more than once, but that's another matter.

I'd say that it's not a bug.
msg221631 - (view) Author: py.user (py.user) * Date: 2014-06-26 19:01
>>> m = re.search(r'(?<=(a)){10}bc', 'abc', re.DEBUG)
max_repeat 10 10 
  assert -1 
    subpattern 1 
      literal 97 
literal 98 
literal 99 
>>> m.group()
'bc'
>>>
>>> m.groups()
('a',)
>>>


It works like there are 10 letters "a" before letter "b".
msg221633 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2014-06-26 19:16
Lookarounds can capture, but they don't consume. That lookbehind is matching the same part of the string every time.
msg221635 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-06-26 19:24
I would not call this a bug - it's just usually a silly thing to do ;-)

Note, e.g., that p{N} is shorthand for writing p N times.  For example, p{4} is much the same as pppp (but not exactly so in all cases; e.g., if `p` happens to contain a capturing group, the numbering of all capturing groups will differ between those two spellings).

A successful assertion generally matches an empty string (does not advance the position being looked at in the target string).  So, e.g., if we're at some point in the target string where

(?<=a)

matches, then

(?<=a)(?<=a)

will also match at the same point, and so will

(?<=a)(?<=a)(?<=a)

and

(?<=a)(?<=a)(?<=a)(?<=a)

and so on & so on.  The position in the target string never changes, so each redundant assertion succeeds too.  So (?<=a){N} _should_ match there too.

> It works like there are 10 letters "a" before letter "b".

It's much more like you're asking whether "a" appears before "b", but are rather pointlessly asking the same question 10 times ;-)
msg221639 - (view) Author: py.user (py.user) * Date: 2014-06-26 19:49
Tim Peters wrote:
> (?<=a)(?<=a)(?<=a)(?<=a)


There are four different points.
If a1 before a2 and a2 before a3 and a3 before a4 and a4 before something.

Otherwise repetition of assertion has no sense. If it has no sense, there should be an exception.
msg221646 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-06-26 20:52
>> (?<=a)(?<=a)(?<=a)(?<=a)

> There are four different points.
> If a1 before a2 and a2 before a3 and a3 before a4 and a4
> before something.

Sorry, that view doesn't make any sense.  A successful lookbehind assertion matches the empty string.  Same as the regexp

()()()()

matches 4 empty strings (and all the _same_ empty string) at any point.

> Otherwise repetition of assertion has no sense.

As I said before, it's "usually a silly thing to do".  It does make sense, just not _useful_ sense - it's "silly" ;-)

> If it has no sense, there should be an exception.

Why?  Code like

    i += 0

is usually pointless too, but it's not up to a programming language to force you to code only useful things.

It's easy to write to write regexps that are pointless.  For example, the regexp

(?=a)b

can never succeed.  Should that raise an exception?  Or should the regexp

(?=a)a

raise an exception because the (?=a) part is redundant?  Etc.
msg221666 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-06-26 23:42
BTW, note that the idea "successful lookaround assertions match an empty string" isn't just a figure of speech:  it's the literal truth, and - indeed - is key to understanding what happens here.  You can see this by adding some capturing groups around the assertions.  Like so:

m = re.search("((?<=a))((?<=a))((?<=a))((?<=a))b", "xab")

Then

[m.span(i) for i in range(1, 5)]

produces

[(2, 2), (2, 2), (2, 2), (2, 2)]

That is, each assertion matched (the same) empty string immediately preceding "b" in the target string.

This makes perfect sense - although it may not be useful.  So I think this report should be closed with "so if it bothers you, don't do it" ;-)
msg221673 - (view) Author: py.user (py.user) * Date: 2014-06-27 04:48
Tim Peters wrote:
> Should that raise an exception?

>i += 0

>(?=a)b

>(?=a)a


These are another cases. The first is very special. The second and third are special too, but with different contents of assertion they can do useful work.

While "(?=any contents){N}a" never uses the "{N}" part in any useful manner.


> So I think this report should be closed

I looked into Perl behaviour today, it works like Python. It's not an error there.
History
Date User Action Args
2014-11-01 18:20:29serhiy.storchakasetstatus: open -> closed
resolution: not a bug
stage: resolved
2014-06-27 04:48:54py.usersetmessages: + msg221673
2014-06-26 23:42:32tim.peterssetmessages: + msg221666
2014-06-26 20:52:25tim.peterssetmessages: + msg221646
2014-06-26 19:49:53py.usersetmessages: + msg221639
2014-06-26 19:24:17tim.peterssetnosy: + tim.peters
messages: + msg221635
2014-06-26 19:16:23mrabarnettsetmessages: + msg221633
2014-06-26 19:01:46py.usersetmessages: + msg221631
2014-06-26 10:25:14mrabarnettsetmessages: + msg221601
2014-06-26 07:24:26serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg221594
2014-06-26 01:33:37BreamoreBoysetnosy: + BreamoreBoy

messages: + msg221588
versions: + Python 2.7, Python 3.4, Python 3.5, - Python 3.2
2012-04-01 08:07:50py.usercreate