classification
Title: re.sub substitution match group contains wrong value after unmatched pattern was processed
Type: Stage: resolved
Components: Regular Expressions Versions: Python 3.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: William Budd, ezio.melotti, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-06-21 02:38 by William Budd, last changed 2017-06-21 06:54 by serhiy.storchaka. This issue is now closed.

Messages (7)
msg296506 - (view) Author: William Budd (William Budd) Date: 2017-06-21 02:38
pattern = re.compile('<div>(<p>.*?</p>)</div>', flags=re.DOTALL)

----------------------------------------------------------------

# This works as expected in the following case:

print(re.sub(pattern, '\\1',
             '<div><p>foo</p></div>\n'
             '<div><p>bar</p>123456789</div>\n'))

# which outputs:

<p>foo</p>
<div><p>bar</p>123456789</div>

----------------------------------------------------------------

# However, it does NOT work as I expect in this case:

print(re.sub(pattern, '\\1',
             '<div><p>foo</p>123456789</div>\n'
             '<div><p>bar</p></div>\n'))

# actual output:

<p>foo</p>123456789</div>
<div><p>bar</p>

# expected output:

<div><p>foo</p>123456789</div>
<p>bar</p>

----------------------------------------------------------------

It seems that pattern matching/substitution iterations only go haywire once the matching iteration immediately prior to it turned out not to be a match. Maybe some internal variable is not cleaned up properly in an edge(?) case triggered by the example above?
msg296508 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-21 03:26
It works correctly. It finds a substring that starts with '<div><p>' and ends with '</p></div>' from left to right. The leftmost found substring starts from index 0 and ends before the final '\n'. Overlapped substrings are not found.
msg296509 - (view) Author: William Budd (William Budd) Date: 2017-06-21 03:32
I don't understand... Isn't the "?" in ".*?" supposed to make the ".*" matching non-greedy, hence matching the first "</p>" rather than the last "</p>"?
msg296510 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-21 03:47
Yes, it is non-greedy. But it needs matching not just '</p>', but '</p></div>'. After finding the first '</p>' it doesn't see the '</div>' after it and continue searching until found '</p></div>'.
msg296511 - (view) Author: William Budd (William Budd) Date: 2017-06-21 04:04
I now see you're right of course. Not a bug after all. Thank you.

I mistakenly assumed that the group boundary ")" would delimit the end of the non-greedy match group. I.e., ".*?</p>" versus ".*?</p></div>".

I don't see a way to accomplish the "even less greedy" variant I'm looking for though...
msg296515 - (view) Author: William Budd (William Budd) Date: 2017-06-21 04:50
Doh! This has a really easy solution, doesn't it; just replace "." with "[^<]": re.compile('<div>(<p>[^<]*?</p>)</div>', flags=re.DOTALL).

Sorry about the noise.
msg296527 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-21 06:54
Atomic groups can help you: '<div>((?><p>.*?</p>))</div>'.

But this feature is not supported in the re module yet (see issue433030). You can use the third-party regex module which is compatible with the re module and supports atomic grouping.

>>> import regex as re
>>> pattern = re.compile('<div>((?><p>.*?</p>))</div>', flags=re.DOTALL)
>>> print(re.sub(pattern, '\\1',
...              '<div><p>foo</p>123456789</div>\n'
...              '<div><p>bar</p></div>\n'))
<div><p>foo</p>123456789</div>
<p>bar</p>
History
Date User Action Args
2017-06-21 06:54:30serhiy.storchakasetmessages: + msg296527
2017-06-21 04:50:57William Buddsetmessages: + msg296515
2017-06-21 04:04:51William Buddsetmessages: + msg296511
2017-06-21 03:47:08serhiy.storchakasetmessages: + msg296510
2017-06-21 03:32:13William Buddsetmessages: + msg296509
2017-06-21 03:26:30serhiy.storchakasetstatus: open -> closed

nosy: + serhiy.storchaka
messages: + msg296508

resolution: not a bug
stage: resolved
2017-06-21 02:38:32William Buddcreate