classification
Title: re module: wrong capturing groups
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Ma Lin, beardypig, ezio.melotti, miss-islington, mrabarnett, serhiy.storchaka, xtreak
Priority: normal Keywords: 3.7regression, patch, patch, patch

Created on 2018-07-31 13:11 by beardypig, last changed 2019-02-18 14:10 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 11546 merged Ma Lin, 2019-01-14 01:26
PR 11546 merged Ma Lin, 2019-01-14 01:26
PR 11546 merged Ma Lin, 2019-01-14 01:26
PR 11919 merged miss-islington, 2019-02-18 13:27
Messages (10)
msg322771 - (view) Author: beardypig (beardypig) Date: 2018-07-31 13:11
I am experiencing and issue with the following regex when using finditer. 

    (?=<(?P<tag>\w+)/?>(?:(?P<text>.+?)</(?P=tag)>)?)", "<test><foo2/></test>

(I know it's not the best method of dealing with HTML, and this is a simplified version)

For example:

    [m.groupdict() for m in re.finditer(r"(?=<(?P<tag>\w+)/?>(?:(?P<text>.+?)</(?P=tag)>)?)", "<test><foo2/></test>")]

In Python 2.7, 3.5, and 3.6 it returns

    [{'tag': 'test', 'text': '<foo2/>'}, {'tag': 'foo2', 'text': None}]

But starting with 3.7 it returns

    [{'tag': 'test', 'text': '<foo2/>'}, {'tag': 'foo2', 'text': '<foo2/>'}]

The "text" group appears to be a copy of the previous "text" group.


Some other examples:

    "<test>Hello</test><foo/>" => [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': 'Hello'}] (expected: [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': None}])
    "<test>Hello</test><foo/><foo/>" => [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': 'Hello'}, {'tag': 'foo', 'text': None}] (expected: [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': None}, {'tag': 'foo', 'text': None}])
msg322799 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-07-31 16:42
➜  cpython git:(70d56fb525) ✗ ./python.exe
Python 3.7.0a2+ (tags/v3.7.0a2-341-g70d56fb525:70d56fb525, Jul 31 2018, 21:58:10)
[Clang 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
➜  cpython git:(70d56fb525) ✗ ./python.exe -c 'import re; print([m.groupdict() for m in re.finditer(r"(?=<(?P<tag>\w+)/?>(?:(?P<text>.+?)</(?P=tag)>)?)", "<test><foo2/></test>")])'
[{'tag': 'test', 'text': '<foo2/>'}, {'tag': 'foo2', 'text': '<foo2/>'}]


➜  cpython git:(e69fbb6a56) ✗ ./python.exe
Python 3.7.0a2+ (tags/v3.7.0a2-340-ge69fbb6a56:e69fbb6a56, Jul 31 2018, 22:12:06)
[Clang 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
➜  cpython git:(e69fbb6a56) ✗ ./python.exe -c 'import re; print([m.groupdict() for m in re.finditer(r"(?=<(?P<tag>\w+)/?>(?:(?P<text>.+?)</(?P=tag)>)?)", "<test><foo2/></test>")])'
[{'tag': 'test', 'text': '<foo2/>'}, {'tag': 'foo2', 'text': None}]

Does this have something to do with 70d56fb52582d9d3f7c00860d6e90570c6259371(bpo-25054, bpo-1647489) ?


Thanks
msg324990 - (view) Author: Ma Lin (Ma Lin) * Date: 2018-09-11 05:47
This bug generates wrong results silently, so I suggest mark it as release blocker for 3.7.1
msg333548 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-13 08:10
Simplify the test-case, it seem the `state` is not reset properly.

Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47)
>>> import re
>>> re.findall(r"(?=(<\w+>)(<\w+>)?)", "<aaa><bbb>")
[('<aaa>', '<bbb>'), ('<bbb>', '')]

Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28)
>>> import re
>>> re.findall(r"(?=(<\w+>)(<\w+>)?)", "<aaa><bbb>")
[('<aaa>', '<bbb>'), ('<bbb>', '<bbb>')]
msg333580 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-14 03:08
I tried to fix it, feel free to create a new PR if you don't want this one.

PR11546 has a small question, should `state->data_stack` be dealloced as well?

FYI, function `state_reset(SRE_STATE* state)` in file `_sre.c`:
https://github.com/python/cpython/blob/d4f9cf5545d6d8844e0726552ef2e366f5cc3abd/Modules/_sre.c#L340-L352
msg334078 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-20 03:58
Serhiy Storchaka lost his sight.
Please stop any work and rest, because your left eye will have more burden, and your mental burden will make it worse.
Go to hospital ASAP.

If any other core developer want to review this patch, I would like to give a detailed explanation, the logic is not very compilcated.
msg334139 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-21 14:34
Original post's bug was introduced in Python 3.7.0

When investigate the code, I found another bug about capturing groups. This bug exists since very early version.
regex module doesn't have this bug.

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)] on win32
>>> import re
>>> re.search(r"\b(?=(\t)|(x))x", "a\tx").groups()
('', 'x')

Expected result: (None, 'x')

Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
>>> import regex
>>> regex.search(r"\b(?=(\t)|(x))x", "a\tx").groups()
(None, 'x')
msg335832 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-18 13:26
New changeset 4a7f44a2ed49ff1e87db062e7177a56c6e4bbdb0 by Serhiy Storchaka (animalize) in branch 'master':
bpo-34294: re module, fix wrong capturing groups in rare cases. (GH-11546)
https://github.com/python/cpython/commit/4a7f44a2ed49ff1e87db062e7177a56c6e4bbdb0
msg335833 - (view) Author: miss-islington (miss-islington) Date: 2019-02-18 13:48
New changeset 0e379d43acc25277f02262212932d3c589a2031b by Miss Islington (bot) in branch '3.7':
bpo-34294: re module, fix wrong capturing groups in rare cases. (GH-11546)
https://github.com/python/cpython/commit/0e379d43acc25277f02262212932d3c589a2031b
msg335836 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-18 14:10
Thank you for your PR Ma Lin!
History
Date User Action Args
2019-02-18 14:10:13serhiy.storchakasetstatus: open -> closed
messages: + msg335836

keywords: + 3.7regression
resolution: fixed
stage: patch review -> resolved
2019-02-18 13:48:26miss-islingtonsetnosy: + miss-islington
messages: + msg335833
2019-02-18 13:27:45miss-islingtonsetpull_requests: + pull_request11944
2019-02-18 13:26:46serhiy.storchakasetmessages: + msg335832
2019-01-21 14:34:58Ma Linsetmessages: + msg334139
title: re.finditer and lookahead bug -> re module: wrong capturing groups
2019-01-20 03:58:27Ma Linsetmessages: + msg334078
2019-01-14 03:08:06Ma Linsetmessages: + msg333580
2019-01-14 01:27:14Ma Linsetkeywords: + patch
stage: patch review
pull_requests: + pull_request11164
2019-01-14 01:27:04Ma Linsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request11163
2019-01-14 01:26:51Ma Linsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request11162
2019-01-13 08:10:14Ma Linsetmessages: + msg333548
2018-09-11 05:47:21Ma Linsetnosy: + Ma Lin
messages: + msg324990
2018-07-31 16:42:45xtreaksetmessages: + msg322799
2018-07-31 16:17:15xtreaksetnosy: + xtreak
2018-07-31 13:20:51serhiy.storchakasetassignee: serhiy.storchaka

nosy: + serhiy.storchaka
2018-07-31 13:11:05beardypigcreate