Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.sub[n] doesn't seem to handle /Z replacements correctly in all cases #54537

Closed
AlexanderSchmolck mannequin opened this issue Nov 5, 2010 · 4 comments
Closed

re.sub[n] doesn't seem to handle /Z replacements correctly in all cases #54537

AlexanderSchmolck mannequin opened this issue Nov 5, 2010 · 4 comments
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@AlexanderSchmolck
Copy link
Mannequin

AlexanderSchmolck mannequin commented Nov 5, 2010

BPO 10328
Nosy @serhiy-storchaka, @animalize

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-04-27.04:41:18.584>
created_at = <Date 2010-11-05.14:33:47.243>
labels = ['expert-regex', 'type-bug']
title = "re.sub[n] doesn't seem to handle /Z replacements correctly in all cases"
updated_at = <Date 2019-04-27.04:41:18.583>
user = 'https://bugs.python.org/AlexanderSchmolck'

bugs.python.org fields:

activity = <Date 2019-04-27.04:41:18.583>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2019-04-27.04:41:18.584>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2010-11-05.14:33:47.243>
creator = 'Alexander.Schmolck'
dependencies = []
files = []
hgrepos = []
issue_num = 10328
keywords = []
message_count = 4.0
messages = ['120499', '120535', '228010', '340960']
nosy_count = 4.0
nosy_names = ['mrabarnett', 'Alexander.Schmolck', 'serhiy.storchaka', 'malin']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue10328'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

@AlexanderSchmolck
Copy link
Mannequin Author

AlexanderSchmolck mannequin commented Nov 5, 2010

In certain cases a zero-width /Z match that should be replaced isn't.

An example might help:

re.compile('(?m)(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\\n])\Z)').subn(lambda m:next('<'+k+'>' for k,v in m.groupdict().items() if v is not None), 'foobar ')

this gives

('foobar<trailing_ws>', 1)

I would have expected

('foobar<trailing_ws><no_final_newline>', 2)

Contrast this with the following behavior:

[m.span() for m in re.compile('(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\\n])\Z)', re.M).finditer('foobar ')]

gives

[(6, 7), (7, 7)]

The matches are clearly not overlapping and the re module docs for sub say "Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.", so I would have expected two replacements.

This seems to be what perl is doing:

echo -n 'foobar ' | perl -pe 's/(?m)(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\\n])\Z)/<$&>/g'

gives
foobar< ><>%

@AlexanderSchmolck AlexanderSchmolck mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Nov 5, 2010
@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Nov 5, 2010

It's a bug caused by trying to avoid getting stuck when a zero-width match is found. Basically the fix is to advance one character after a zero-width match, but that doesn't always give the correct result.

There are a number of related issues like issue bpo-1647489 ("zero-length match confuses re.finditer()").

@BreamoreBoy
Copy link
Mannequin

BreamoreBoy mannequin commented Sep 30, 2014

@serhiy can you take a look at this as I recall you've been doing some regex work?

@animalize
Copy link
Mannequin

animalize mannequin commented Apr 27, 2019

This bug was fixed in Python 3.7, see bpo-32308.

Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 23 2018, 23:31:17) [MSC v.1916 32 bit (Intel)] on win32
>>> re.compile('(?m)(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\n])\Z)').subn(lambda m:next('<'+k+'>' for k,v in m.groupdict().items() if v is not None), 'foobar ')
('foobar<trailing_ws>', 1)

Python 3.7.3rc1 (tags/v3.7.3rc1:69785b2127, Mar 12 2019, 22:37:55) [MSC v.1916 64 bit (AMD64)] on win32
>>> re.compile('(?m)(?P<trailing_ws>[ \t]+\r*$)|(?P<no_final_newline>(?<=[^\n])\Z)').subn(lambda m:next('<'+k+'>' for k,v in m.groupdict().items() if v is not None), 'foobar ')
('foobar<trailing_ws><no_final_newline>', 2)

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant