classification
Title: re.sub[n] doesn't seem to handle /Z replacements correctly in all cases
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.4, Python 3.5, Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Alexander.Schmolck, malin, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2010-11-05 14:33 by Alexander.Schmolck, last changed 2019-04-27 04:41 by serhiy.storchaka. This issue is now closed.

Messages (4)
msg120499 - (view) Author: Alexander Schmolck (Alexander.Schmolck) Date: 2010-11-05 14:33
In certain cases a zero-width /Z match that should be replaced isn't.

An example might help:

 re.compile('(?m)(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\n])\Z)').subn(lambda m:next('<'+k+'>' for k,v in m.groupdict().items() if v is not None), 'foobar ')

this gives

 ('foobar<trailing_ws>', 1)

I would have expected

('foobar<trailing_ws><no_final_newline>', 2)

Contrast this with the following behavior:

 [m.span() for m in re.compile('(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\n])\Z)', re.M).finditer('foobar ')]

gives
 
 [(6, 7), (7, 7)]

The matches are clearly not overlapping and the re module docs for sub say "Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.", so I would have expected two replacements.


This seems to be what perl is doing:

 echo -n 'foobar ' | perl -pe 's/(?m)(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\n])\Z)/<$&>/g'                    

gives
 foobar< ><>%
msg120535 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2010-11-05 21:09
It's a bug caused by trying to avoid getting stuck when a zero-width match is found. Basically the fix is to advance one character after a zero-width match, but that doesn't always give the correct result.

There are a number of related issues like issue #1647489 ("zero-length match confuses re.finditer()").
msg228010 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-09-30 21:47
@Serhiy can you take a look at this as I recall you've been doing some regex work?
msg340960 - (view) Author: Ma Lin (malin) * Date: 2019-04-27 02:37
This bug was fixed in Python 3.7, see issue32308.

Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 23 2018, 23:31:17) [MSC v.1916 32 bit (Intel)] on win32
>>> re.compile('(?m)(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\n])\Z)').subn(lambda m:next('<'+k+'>' for k,v in m.groupdict().items() if v is not None), 'foobar ')
('foobar<trailing_ws>', 1)

Python 3.7.3rc1 (tags/v3.7.3rc1:69785b2127, Mar 12 2019, 22:37:55) [MSC v.1916 64 bit (AMD64)] on win32
>>> re.compile('(?m)(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\n])\Z)').subn(lambda m:next('<'+k+'>' for k,v in m.groupdict().items() if v is not None), 'foobar ')
('foobar<trailing_ws><no_final_newline>', 2)
History
Date User Action Args
2019-04-27 04:41:18serhiy.storchakasetstatus: open -> closed
resolution: out of date
stage: resolved
2019-04-27 02:37:04malinsetnosy: + malin
messages: + msg340960
2019-04-26 20:42:37BreamoreBoysetnosy: - BreamoreBoy
2014-09-30 21:47:20BreamoreBoysetnosy: + BreamoreBoy, serhiy.storchaka

messages: + msg228010
versions: + Python 3.4, Python 3.5, - Python 3.1
2010-11-05 23:00:52terry.reedysetversions: + Python 2.7, - Python 2.6
2010-11-05 21:09:58mrabarnettsetmessages: + msg120535
2010-11-05 15:38:19r.david.murraysetnosy: + mrabarnett
2010-11-05 14:33:47Alexander.Schmolckcreate