classification
Title: re.split fails with lookahead/behind
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.4
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, mrabarnett, rexdwyer, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2014-11-07 21:42 by rexdwyer, last changed 2015-03-02 08:59 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
re_split_zero_width.patch serhiy.storchaka, 2014-11-08 09:11 Backward incompatible! review
Messages (8)
msg230831 - (view) Author: Rex Dwyer (rexdwyer) Date: 2014-11-07 21:42
I would like to split a DNA sequence with a restriction enzyme.
A description enzyme can be describe as, e.g.  r'(?<CA)(?=GCTG)'
I cannot get re.split to split on this pattern as perl 5 does.
msg230832 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-11-07 21:47
Can you provide a sample DNA sequence (or part of it), the exact code you used, the output you got, and what you expected?
msg230833 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-07 21:58
>>> re.split(r'(?<=CA)(?=GCTG)', 'CAGCTG')
['CAGCTG']

I think expected output is ['CA', 'GCTG'].
msg230834 - (view) Author: Rex Dwyer (rexdwyer) Date: 2014-11-07 22:08
sorry if I wasn't clear.

s = 'ACGTCAGCTGAAACCCCAGCTGACGTACGT
re.split(r'(?<CA)(?=GCTG)',s)

expected output is:
acgtCA|GCTGaaacccCA|GCTGacgtacgt
-> ['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']

I would also be able to split a text on word boundaries:
re.split(r'\b', "the quick, brown fox")
-> ['the', ' ', 'quick', ', ', 'brown', ' ', 'fox']

but that doesn't work either so maybe it's a problem with all zero-width matches.
msg230835 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-07 22:11
This looks as one of existing issue about zero-length matches (issue1647489, issue10328).
msg230839 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-08 09:11
It is possible to change this behavior (see example patch). With this patch:

>>> re.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> re.split(r'\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']

But unfortunately this is backward incompatible change and will likely break existing code (and breaks tests). Consider following example: re.split('(:*)', 'ab'). Currently the result is ['ab'], but with the patch it is ['', '', 'a', '', 'b', '', ''].

In third-part regex module [1] there is the V1 flag which switches incompatible bahavior change.

>>> regex.split('(:*)', 'ab')
['ab']
>>> regex.split('(?V1)(:*)', 'ab')
['', '', 'a', '', 'b', '', '']
>>> regex.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCAGCTGAAACCCCAGCTGACGTACGT']
>>> regex.split(r'(?V1)(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> regex.split(r'\b', "the quick, brown fox")
['the quick, brown fox']
>>> regex.split(r'(?V1)\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']

I don't know how to solve this issue without introducing such flag (or adding special boolean argument to re.split()).

As a workaround I suggest you to use the regex module.

[1] https://pypi.python.org/pypi/regex
msg230841 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-11-08 09:39
Previous attempts to solve this issue: issue852532, issue988761, issue3262.
msg237034 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-02 08:59
re.split() with the r'(?<CA)(?=GCTG)' pattern raises a ValueError in 3.5 (see issue22818). In future releases it could be changed to work with zero-width patterns (such as lookaround assertions).
History
Date User Action Args
2015-03-02 08:59:16serhiy.storchakasetstatus: open -> closed
resolution: wont fix
messages: + msg237034

stage: resolved
2014-11-08 09:39:13serhiy.storchakasetmessages: + msg230841
2014-11-08 09:11:19serhiy.storchakasetfiles: + re_split_zero_width.patch
keywords: + patch
messages: + msg230839
2014-11-07 22:11:00serhiy.storchakasetmessages: + msg230835
2014-11-07 22:08:07rexdwyersetmessages: + msg230834
2014-11-07 21:58:27serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg230833
2014-11-07 21:47:45ezio.melottisetmessages: + msg230832
2014-11-07 21:42:01rexdwyercreate