Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.split fails with lookahead/behind #67006

Closed
rexdwyer mannequin opened this issue Nov 7, 2014 · 8 comments
Closed

re.split fails with lookahead/behind #67006

rexdwyer mannequin opened this issue Nov 7, 2014 · 8 comments
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@rexdwyer
Copy link
Mannequin

rexdwyer mannequin commented Nov 7, 2014

BPO 22817
Nosy @ezio-melotti, @serhiy-storchaka
Files
  • re_split_zero_width.patch: Backward incompatible!
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-03-02.08:59:16.402>
    created_at = <Date 2014-11-07.21:42:01.324>
    labels = ['expert-regex', 'type-bug']
    title = 're.split fails with lookahead/behind'
    updated_at = <Date 2015-03-02.08:59:16.401>
    user = 'https://bugs.python.org/rexdwyer'

    bugs.python.org fields:

    activity = <Date 2015-03-02.08:59:16.401>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-03-02.08:59:16.402>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2014-11-07.21:42:01.324>
    creator = 'rexdwyer'
    dependencies = []
    files = ['37147']
    hgrepos = []
    issue_num = 22817
    keywords = ['patch']
    message_count = 8.0
    messages = ['230831', '230832', '230833', '230834', '230835', '230839', '230841', '237034']
    nosy_count = 4.0
    nosy_names = ['ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'rexdwyer']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue22817'
    versions = ['Python 3.4']

    @rexdwyer
    Copy link
    Mannequin Author

    rexdwyer mannequin commented Nov 7, 2014

    I would like to split a DNA sequence with a restriction enzyme.
    A description enzyme can be describe as, e.g. r'(?<CA)(?=GCTG)'
    I cannot get re.split to split on this pattern as perl 5 does.

    @rexdwyer rexdwyer mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Nov 7, 2014
    @ezio-melotti
    Copy link
    Member

    Can you provide a sample DNA sequence (or part of it), the exact code you used, the output you got, and what you expected?

    @serhiy-storchaka
    Copy link
    Member

    >>> re.split(r'(?<=CA)(?=GCTG)', 'CAGCTG')
    ['CAGCTG']

    I think expected output is ['CA', 'GCTG'].

    @rexdwyer
    Copy link
    Mannequin Author

    rexdwyer mannequin commented Nov 7, 2014

    sorry if I wasn't clear.

    s = 'ACGTCAGCTGAAACCCCAGCTGACGTACGT
    re.split(r'(?<CA)(?=GCTG)',s)

    expected output is:
    acgtCA|GCTGaaacccCA|GCTGacgtacgt
    -> ['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']

    I would also be able to split a text on word boundaries:
    re.split(r'\b', "the quick, brown fox")
    -> ['the', ' ', 'quick', ', ', 'brown', ' ', 'fox']

    but that doesn't work either so maybe it's a problem with all zero-width matches.

    @serhiy-storchaka
    Copy link
    Member

    This looks as one of existing issue about zero-length matches (bpo-1647489, bpo-10328).

    @serhiy-storchaka
    Copy link
    Member

    It is possible to change this behavior (see example patch). With this patch:

    >>> re.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
    ['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
    >>> re.split(r'\b', "the quick, brown fox")
    ['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']

    But unfortunately this is backward incompatible change and will likely break existing code (and breaks tests). Consider following example: re.split('(:*)', 'ab'). Currently the result is ['ab'], but with the patch it is ['', '', 'a', '', 'b', '', ''].

    In third-part regex module [1] there is the V1 flag which switches incompatible bahavior change.

    >>> regex.split('(:*)', 'ab')
    ['ab']
    >>> regex.split('(?V1)(:*)', 'ab')
    ['', '', 'a', '', 'b', '', '']
    >>> regex.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
    ['ACGTCAGCTGAAACCCCAGCTGACGTACGT']
    >>> regex.split(r'(?V1)(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
    ['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
    >>> regex.split(r'\b', "the quick, brown fox")
    ['the quick, brown fox']
    >>> regex.split(r'(?V1)\b', "the quick, brown fox")
    ['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']

    I don't know how to solve this issue without introducing such flag (or adding special boolean argument to re.split()).

    As a workaround I suggest you to use the regex module.

    [1] https://pypi.python.org/pypi/regex

    @serhiy-storchaka
    Copy link
    Member

    Previous attempts to solve this issue: bpo-852532, bpo-988761, bpo-3262.

    @serhiy-storchaka
    Copy link
    Member

    re.split() with the r'(?<CA)(?=GCTG)' pattern raises a ValueError in 3.5 (see bpo-22818). In future releases it could be changed to work with zero-width patterns (such as lookaround assertions).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants