re.split fails with lookahead/behind #67006

rexdwyer · 2014-11-07T21:42:01Z

BPO	22817
Nosy	@ezio-melotti, @serhiy-storchaka
Files	re_split_zero_width.patch: Backward incompatible!

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-03-02.08:59:16.402>
created_at = <Date 2014-11-07.21:42:01.324>
labels = ['expert-regex', 'type-bug']
title = 're.split fails with lookahead/behind'
updated_at = <Date 2015-03-02.08:59:16.401>
user = 'https://bugs.python.org/rexdwyer'

bugs.python.org fields:

activity = <Date 2015-03-02.08:59:16.401>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2015-03-02.08:59:16.402>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2014-11-07.21:42:01.324>
creator = 'rexdwyer'
dependencies = []
files = ['37147']
hgrepos = []
issue_num = 22817
keywords = ['patch']
message_count = 8.0
messages = ['230831', '230832', '230833', '230834', '230835', '230839', '230841', '237034']
nosy_count = 4.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'rexdwyer']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue22817'
versions = ['Python 3.4']

rexdwyer · 2014-11-07T21:42:01Z

I would like to split a DNA sequence with a restriction enzyme.
A description enzyme can be describe as, e.g. r'(?<CA)(?=GCTG)'
I cannot get re.split to split on this pattern as perl 5 does.

ezio-melotti · 2014-11-07T21:47:45Z

Can you provide a sample DNA sequence (or part of it), the exact code you used, the output you got, and what you expected?

serhiy-storchaka · 2014-11-07T21:58:27Z

>>> re.split(r'(?<=CA)(?=GCTG)', 'CAGCTG')
['CAGCTG']

I think expected output is ['CA', 'GCTG'].

rexdwyer · 2014-11-07T22:08:07Z

sorry if I wasn't clear.

s = 'ACGTCAGCTGAAACCCCAGCTGACGTACGT
re.split(r'(?<CA)(?=GCTG)',s)

expected output is:
acgtCA|GCTGaaacccCA|GCTGacgtacgt
-> ['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']

I would also be able to split a text on word boundaries:
re.split(r'\b', "the quick, brown fox")
-> ['the', ' ', 'quick', ', ', 'brown', ' ', 'fox']

but that doesn't work either so maybe it's a problem with all zero-width matches.

serhiy-storchaka · 2014-11-07T22:11:01Z

This looks as one of existing issue about zero-length matches (bpo-1647489, bpo-10328).

serhiy-storchaka · 2014-11-08T09:11:19Z

It is possible to change this behavior (see example patch). With this patch:

>>> re.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> re.split(r'\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']

But unfortunately this is backward incompatible change and will likely break existing code (and breaks tests). Consider following example: re.split('(:*)', 'ab'). Currently the result is ['ab'], but with the patch it is ['', '', 'a', '', 'b', '', ''].

In third-part regex module [1] there is the V1 flag which switches incompatible bahavior change.

>>> regex.split('(:*)', 'ab')
['ab']
>>> regex.split('(?V1)(:*)', 'ab')
['', '', 'a', '', 'b', '', '']
>>> regex.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCAGCTGAAACCCCAGCTGACGTACGT']
>>> regex.split(r'(?V1)(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> regex.split(r'\b', "the quick, brown fox")
['the quick, brown fox']
>>> regex.split(r'(?V1)\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']

I don't know how to solve this issue without introducing such flag (or adding special boolean argument to re.split()).

As a workaround I suggest you to use the regex module.

[1] https://pypi.python.org/pypi/regex

serhiy-storchaka · 2014-11-08T09:39:13Z

Previous attempts to solve this issue: bpo-852532, bpo-988761, bpo-3262.

serhiy-storchaka · 2015-03-02T08:59:16Z

re.split() with the r'(?<CA)(?=GCTG)' pattern raises a ValueError in 3.5 (see bpo-22818). In future releases it could be changed to work with zero-width patterns (such as lookaround assertions).

rexdwyer mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Nov 7, 2014

serhiy-storchaka closed this as completed Mar 2, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re.split fails with lookahead/behind #67006

re.split fails with lookahead/behind #67006

rexdwyer mannequin commented Nov 7, 2014

rexdwyer mannequin commented Nov 7, 2014

ezio-melotti commented Nov 7, 2014

serhiy-storchaka commented Nov 7, 2014

rexdwyer mannequin commented Nov 7, 2014

serhiy-storchaka commented Nov 7, 2014

serhiy-storchaka commented Nov 8, 2014

serhiy-storchaka commented Nov 8, 2014

serhiy-storchaka commented Mar 2, 2015

re.split fails with lookahead/behind #67006

re.split fails with lookahead/behind #67006

Comments

rexdwyer mannequin commented Nov 7, 2014

rexdwyer mannequin commented Nov 7, 2014

ezio-melotti commented Nov 7, 2014

serhiy-storchaka commented Nov 7, 2014

rexdwyer mannequin commented Nov 7, 2014

serhiy-storchaka commented Nov 7, 2014

serhiy-storchaka commented Nov 8, 2014

serhiy-storchaka commented Nov 8, 2014

serhiy-storchaka commented Mar 2, 2015