classification
Title: re.sub calls repl function one time too many for catch-all regex
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, mrabarnett, scop, serhiy.storchaka
Priority: normal Keywords:

Created on 2018-05-20 13:11 by scop, last changed 2018-05-20 15:04 by serhiy.storchaka. This issue is now closed.

Messages (5)
msg317166 - (view) Author: Ville Skyttä (scop) * Date: 2018-05-20 13:11
(I'm fairly certain that the title doesn't describe the actual underlying issue all that well, however it is what I'm seeing so going with that for now.)

Compared to Python 3.6, 3.7 appears to call the repl function for re.sub one time too many, when given a catch-all regex. The extra call is made with a match consisting of an empty string. I think this is quite unexpected, and think it's a bug that I hope could be fixed before 3.7 is out.

Demonstration code:

    import re
    def repl(match):
        print(f"Called with match '{match.group(0)}'")
    re.sub(".*", repl, "foo")

3.6.3 produces the expected output:

    Called with match 'foo'

3.7.0b4+ (current git) demonstrates the extra call:

    Called with match 'foo'
    Called with match ''
msg317168 - (view) Author: Ville Skyttä (scop) * Date: 2018-05-20 13:33
Right, it's not limited to repl functions.

Python 3.6.3:
$ python -c 'import re;print(re.sub(".*", "X", "foo"))'
X

Python 3.7.0b4+:
$ python -c 'import re;print(re.sub(".*", "X", "foo"))'
XX

Poking serhiy.storchaka who according to the release notes, seems to have done quite a bit of work on re in 3.7.
msg317169 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-20 13:47
This is expected behavior and documented change in 3.7. The pattern ".*" can match an empty string, and it matches an empty string at the end of line. This behavior is consistent with the behavior of re.finditer() and with the behavior of all regular expression implementations in other programming languages. Actually it was an old bug in re.sub() that has been fixed in 3.7.

Compare, in 3.6:

>>> list(re.finditer('.*', 'foo'))
[<_sre.SRE_Match object; span=(0, 3), match='foo'>, <_sre.SRE_Match object; span=(3, 3), match=''>]
>>> re.sub('.*', lambda m: repr(m), 'foo')
"<_sre.SRE_Match object; span=(0, 3), match='foo'>"

In 3.7:

>>> list(re.finditer('.*', 'foo'))
[<re.Match object; span=(0, 3), match='foo'>, <re.Match object; span=(3, 3), match=''>]
>>> re.sub('.*', lambda m: repr(m), 'foo')
"<re.Match object; span=(0, 3), match='foo'><re.Match object; span=(3, 3), match=''>"

If you don't want to find an empty string, change you patter so that it will not match an empty string: ".+".
msg317170 - (view) Author: Ville Skyttä (scop) * Date: 2018-05-20 14:03
Oh, I see, sorry about the noise, then. (Only looked at the "Improved Modules" -> re section in the what's new, thus missed the doc.)
msg317175 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-20 15:04
This change is documented in the subsection "Changes in the Python API" of the section "Porting to Python 3.7".
History
Date User Action Args
2018-05-20 15:04:36serhiy.storchakasetmessages: + msg317175
2018-05-20 14:03:46scopsetmessages: + msg317170
2018-05-20 13:47:05serhiy.storchakasetstatus: open -> closed
resolution: not a bug
messages: + msg317169

stage: resolved
2018-05-20 13:33:40scopsetnosy: + serhiy.storchaka
messages: + msg317168
2018-05-20 13:14:27scopsetnosy: + ezio.melotti, mrabarnett
components: + Regular Expressions
2018-05-20 13:11:22scopcreate