This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.split() incorrectly splitting on zero-width pattern
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Elias Tarhini, ezio.melotti, mrabarnett
Priority: normal Keywords:

Created on 2019-03-22 02:48 by Elias Tarhini, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (4)
msg338581 - (view) Author: Elias Tarhini (Elias Tarhini) Date: 2019-03-22 02:48
I believe I've found a bug in the `re` module -- specifically, in the 3.7+ support for splitting on zero-width patterns. Compare Java's behavior...

    jshell> "1211".split("(?<=(\\d))(?!\\1)(?=\\d)");
    $1 ==> String[3] { "1", "2", "11" }

...with Python's:

    >>> re.split(r'(?<=(\d))(?!\1)(?=\d)', '1211')
    ['1', '1', '2', '2', '11']

(The pattern itself is pretty straightforward in design, but regex syntax can cloud things, so to be totally clear: it finds any point that follows a digit and precedes a *different* digit.)

* Tested on 3.7.1 win10 and 3.7.0 linux.
msg338582 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-03-22 03:26
From the docs:

"""If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."""

The pattern does contain a capture, so that's why the result has additional '1' and '2'.

Presumably, Java's split doesn't do that.

Not a bug.
msg338704 - (view) Author: Elias Tarhini (Elias Tarhini) Date: 2019-03-23 21:51
Thank you. Was too zeroed-in on the idea that it was from the zero-width pattern, and I forgot to consider the group. Looks like `re.sub(pattern, 'some-delim', s).split('some-delim')` is a way to do this if it's not possible to use a non-capturing group
msg338705 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2019-03-23 22:13
The list alternates between substrings (s, between the splits) and captures (c):

['1', '1', '2', '2', '11']
 -s-  -c-  -s-  -c-  -s--

You can use slicing to extract the substrings:

>>> re.split(r'(?<=(\d))(?!\1)(?=\d)', '12111')[ : : 2]
['1', '2', '111']
History
Date User Action Args
2022-04-11 14:59:12adminsetgithub: 80578
2019-03-23 22:13:44mrabarnettsetmessages: + msg338705
2019-03-23 21:51:18Elias Tarhinisetmessages: + msg338704
2019-03-22 03:26:31mrabarnettsetstatus: open -> closed
resolution: not a bug
messages: + msg338582

stage: resolved
2019-03-22 02:48:42Elias Tarhinicreate