Title: re.split doesn't split with zero-width regex
Components: Tests Versions: Python 2.7
Created on 2008-07-02 22:07 by mrabarnett, last changed 2022-04-11 14:56 by admin.

split_zero_width.diff mrabarnett, 2008-07-03 00:59
msg69134 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-02 22:07
re.split doesn't split a string when the regex matches a zero characters.

For example:

re.split(r'\b', 'a b') returns ['a b'] instead of ['', 'a', ' ', 'b', ''].

re.split(r'(?<!\w)(?=\w)', 'a b') returns ['a b'] instead of ['', 'a ',
msg69139 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-02 22:51
The attached patch appears to work.
msg69146 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-07-02 23:28
Probably by design. There's probably even a unittest for this behavior.
msg69150 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-02 23:57
I've found that this issue has been discussed before: #988761.
msg69157 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-03 00:59
New patch version after studying #988761 and doing more testing.
msg69408 - (view) Author: Mike Coleman (mkc) Date: 2008-07-08 02:36
I don't want to discourage you, but #852532, which is essentially the
same bug report, was closed--without explanation--as 'wont fix' in
April, after four-plus years.  I wish you good luck--this is an
important and irritating bug, in my opinion...
msg69438 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-07-08 16:39
There appear to be 2 opinions on this issue:

1. It's a bug, a corner case that got missed.

2. It's always been like this, so it's probably a design decision,
although no-one can't point to where or when the decision was made...

Looking at the code, I think it's a bug.

Expected behaviour: if 'pattern' is a non-capturing regex, then
re.split(pattern, text) == re.sub(pattern, MARKER, text).split(MARKER).
msg69852 - (view) Author: Mike Coleman (mkc) Date: 2008-07-16 22:40
I think it's probably both.  The original design was incorrect, though
this probably wasn't apparent to the designer.  But as a significant
user of 're', it really stands out as a problem.
msg70749 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-05 16:08
I think it's better to leave this alone.  Such a subtle change is likely
to trip over more people in worse ways than the alleged "bug".
msg70752 - (view) Author: Mike Coleman (mkc) Date: 2008-08-05 16:18
Okay.  For what it's worth, note that my original 2004 patch for this
(#988761) is completely backward compatible (a flag must be set in the
call to get the new behavior).
msg73523 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2008-09-21 19:41
I wonder whether it could be put into Python 3 where certain breaks in
backwards compatibility are to be expected.
msg73567 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-22 11:54
I think Mike Coleman proposal of enabling this behaviour via flag is
probably best and IMHO we should consider it under these circumstances.
 Intuitively, I think you're interpretation of what re.split should do
under zero-width conditions is logical, and I almost think this should
be a 2-minor number transition à la from __future__ import
zeroWidthRegexpSplit if we are to consider it as the long-term 'right
thing to do'.  3000 (3.0) seems a good place to also consider it for
true overhaul / reexamination, especially as we are writing 'upgrade'
scripts for many of the other Python features.  However, I would say
this, Guido has spoken and it may be too late for the pebbles to vote.

I would like to add this patch as a new item to the general Regexp
Enhancements thread of issue 2636 though, as I think it is an idea worth
considering when overhauling Regexp.
msg73592 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-22 20:39
The problem with doing this per 3.0 is that it's impossible to write a
conversion script.

I'm okay with adding a flag to enable this behavior though.  Please open
a new bug with a new patch, preferably one that applies cleanly to the
trunk, and a separate patch for the py3k branch unless the trunk patch
merges cleanly.  There should also be unittests and documentation.  The
patches should be marked for Python 2.7 and 3.1 -- it's way too late to
get this into 2.6 and 3.0.
msg104226 - (view) Author: Tim Pietzcker (pietzcker) Date: 2010-04-26 12:29
Sorry to revive this dormant (?) topic - has anybody brought this any further? This "feature" has tripped me up a few times, and I would be all for adding a flag to enable the "split on zero-size matches" behavior, but I myself am not competent enough to code a patch.
msg104257 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2010-04-26 17:31
You could try the regex module mentioned in issue 2636.
