New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re.split fails with lookahead/behind #67006
Comments
I would like to split a DNA sequence with a restriction enzyme. |
Can you provide a sample DNA sequence (or part of it), the exact code you used, the output you got, and what you expected? |
>>> re.split(r'(?<=CA)(?=GCTG)', 'CAGCTG')
['CAGCTG'] I think expected output is ['CA', 'GCTG']. |
sorry if I wasn't clear. s = 'ACGTCAGCTGAAACCCCAGCTGACGTACGT
re.split(r'(?<CA)(?=GCTG)',s) expected output is: I would also be able to split a text on word boundaries: but that doesn't work either so maybe it's a problem with all zero-width matches. |
This looks as one of existing issue about zero-length matches (bpo-1647489, bpo-10328). |
It is possible to change this behavior (see example patch). With this patch: >>> re.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> re.split(r'\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', ''] But unfortunately this is backward incompatible change and will likely break existing code (and breaks tests). Consider following example: re.split('(:*)', 'ab'). Currently the result is ['ab'], but with the patch it is ['', '', 'a', '', 'b', '', '']. In third-part regex module [1] there is the V1 flag which switches incompatible bahavior change. >>> regex.split('(:*)', 'ab')
['ab']
>>> regex.split('(?V1)(:*)', 'ab')
['', '', 'a', '', 'b', '', '']
>>> regex.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCAGCTGAAACCCCAGCTGACGTACGT']
>>> regex.split(r'(?V1)(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> regex.split(r'\b', "the quick, brown fox")
['the quick, brown fox']
>>> regex.split(r'(?V1)\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', ''] I don't know how to solve this issue without introducing such flag (or adding special boolean argument to re.split()). As a workaround I suggest you to use the regex module. |
Previous attempts to solve this issue: bpo-852532, bpo-988761, bpo-3262. |
re.split() with the r'(?<CA)(?=GCTG)' pattern raises a ValueError in 3.5 (see bpo-22818). In future releases it could be changed to work with zero-width patterns (such as lookaround assertions). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: