Issue 988761: re.split emptyok flag (fix for #852532)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/40540

classification

Title:	re.split emptyok flag (fix for #852532)
Type:		Stage:
Components:	Extension Modules	Versions:	Python 2.6

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:
Assigned To:	effbot	Nosy List:	akuchling, colander_man, effbot, filip, georg.brandl, gregory.p.smith, mkc, rhettinger
Priority:	normal	Keywords:	patch

Created on 2004-07-11 03:25 by mkc, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
emptyok.patch	mkc, 2004-07-11 03:25	emptyok patch against current CVS HEAD 2004-07-10

Messages (12)
msg46352 - (view)	Author: Mike Coleman (mkc)	Date: 2004-07-11 03:25
This patch addresses bug #852532. The underlying problem is that re.split ignores any match it makes that has length zero, which blocks a number of useful possibilities. The attached patch implements a flag 'emptyok', which if set to True, causes re.split to allow zero length matches. My preference would be to just change the behavior of re.split, rather than adding this flag. The old behavior isn't documented (though a couple of cases in test_re.py do depend on it). As a practical matter, though, I realize that there may be some code out there relying on this undocumented behavior. And I'm hoping that this useful feature can be added quickly. Perhaps this new behavior could be made the default in a future version of Python. (Linux 2.6.3 i686)
msg46353 - (view)	Author: Chris King (colander_man)	Date: 2004-07-21 12:46
Logged In: YES user_id=573252 Practical example where the current behaviour produces undesirable results (splitting on character transitions): >>> import re >>> re.split(r'(?<=[A-Z])(?=[^a-z])','SOMEstring') ['SOMEstring'] # desired is ['SOME','string']
msg46354 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2004-07-27 14:08
Logged In: YES user_id=11375 Overall I like the patch and wouldn't mind seeing the change become the default behaviour. However, I'm nervous about possibly not understanding the reason the prohibition on zero-length matches was added in the first place. Can you please do some research in the CVS logs and python-dev archives to figure out why the limitation was implemented in the first place?
msg46355 - (view)	Author: Mike Coleman (mkc)	Date: 2004-07-28 16:23
Logged In: YES user_id=555 I picked through CVS, python-dev and google and came up with this. The current behavior was present way back in the earliest regsub.py in CVS (dated Sep 1992); subsequent implementation seem to be mirroring this behavior. The CVS comment back in 1992 described split as modeled on nawk. A check of nawk(1) confirms that nawk only splits on non-null matches. Perl (circa 5.6) on the other hand, appears to split the way this patch does (though I wasn't aware of that when I wrote the patch) so that might argue in the other direction. I would note, too, that re.findall and re.finditer tend in this direction ("Empty matches are included in the result unless they touch the beginning of another match."). The python-dev archive doesn't seem to go back far enough to be relevant and I'm not sure how to search it. General googling (python "re.split" empty match) found a few hits. Probably the most relevant is Tim Peters saying "Python won't change here (IMO)" and giving the example that he also gives in a comment to bug #852532 (which this patch addresses). He also wonders in his comment about the possibility of a "design constraint", but I think this patch addresses that concern. As far as I can tell, the current behavior was a design decision made over 10 years ago, between two alternatives that probably didn't matter much at the time. Skipping empty matches probably seemed harmless before lookahead/lookbehind assertions. Now, though, the current behavior seems like a significant hindrance. Furthermore, it ought to be pretty trivial to modify any existing patterns to get the old behavior, should that be desired (e.g., use 'x+' instead of 'x*'). (I didn't notice that re.findall doc when I originally wrote this patch. Perhaps the doc in the patch should be slightly modified to help emphasize the similarity between how re.findall and re.split handle empty matches.)
msg46356 - (view)	Author: Mike Coleman (mkc)	Date: 2004-09-03 20:15
Logged In: YES user_id=555 Apparently this patch is stalled, but I'd like to get it in, in some form, for 2.4. The only question, as far as I know, is whether empty matches following non-empty matches "count" or not (they do in the original patch). If I make a patch with the "doesn't count" behavior, could we apply that right away? I'd rather get either behavior in for 2.4 than wait for 2.5...
msg46357 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2004-09-05 22:53
Logged In: YES user_id=80475 Fred, what do you think of the proposal. Are the backwards compatability issues worth it?
msg46358 - (view)	Author: Mike Coleman (mkc)	Date: 2005-12-09 17:04
Logged In: YES user_id=555 This patch seems to have been stalled now for over a year. Could it be applied? Or, alternatively, could someone provide some sort of reason why it shouldn't be? Thanks.
msg46359 - (view)	Author: Filip Salomonsson (filip)	Date: 2006-01-16 21:56
Logged In: YES user_id=308203 I agree completely that splitting on non-zero matches should be supported - and that the default behavior should change at some point - but I don't think this patch quite covers it. Taking an example from the dev-python thread back in August of 2004 (http://mail.python.org/pipermail/python-dev/2004-August/047272.html): >>> re.split('x', 'abxxxcdefxxx', emptyok=True) ['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', ''] To me, this means there's an empty string, beginning and ending in pos 0, followed by a zero-width divider also beginning and ending in the same position, followed by an 'a', etc. That seems awkward to me. I think a more intuitive result would be (I'm omitting the emptyok argument in the following examples): >>> re.split('x', 'abxxxcdefxxx') ['a', 'b', 'c', 'd', 'e', 'f', ''] That is, empty matches cause a split when they are not adjacent to a non-empty match and not at the beginning or the end of the string. Grouping parentheses would, of course, reveal the empty-string boundaries: >>> re.split('(x)', 'abxxxcdefxxx') ['', 'a', '', 'b', 'xxx', '', 'c', '', 'd', '', 'e', '', 'f', 'xxx', ''] Using the same approach, these results would also seem perfectly reasonable to me: >>> re.split('(?m)$', 'foo\nbar\nbaz') ['foo', '\nbar', '\nbaz'] >>> re.split('(?m)^', 'foo\nbar\nbaz') ['foo\n', 'bar\n', 'baz'] Splitting a one-character string should be possible only if the pattern matches that character: >>> re.split('\w', 'a') ['', ''] >>> re.split('\d*', 'a') ['a']
msg46360 - (view)	Author: Mike Coleman (mkc)	Date: 2006-01-17 22:37
Logged In: YES user_id=555 I think I still agree with my original answer on this (see http://mail.python.org/pipermail/python-dev/2004-August/047321.html). I'm completely worn down on this, though, so I'd happily take any of these options as an improvement over the present situation.
msg46361 - (view)	Author: Mike Coleman (mkc)	Date: 2007-02-22 03:23
Hello from 2004! This is your long-lost bug in re.split--how's it going? I'm still alive and well. I think everyone pretty much agrees that I really am a bug, and at least one guy still writes code just to work around me every few weeks or so. My attempt to keep a low profile is doing well--I'm not even documented in the library reference. This allows me to meet new Python users on a regular basis (whether they like it or not). Well, that's it for now. If I don't hear from you until then, I'll drop you another line in 2009. (Hey I'm a poet, too!) Regards, bug 852532/patch 988761
msg69372 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-07-07 04:39
take a look at the patch being worked on in issue #3262.
msg69851 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2008-07-16 22:35
Closing as a duplicate of #3262, which seems to be active.

History
Date	User	Action	Args
2022-04-11 14:56:05	admin	set	github: 40540
2008-07-16 22:35:16	georg.brandl	set	status: open -> closed nosy: + georg.brandl resolution: duplicate messages: + msg69851
2008-07-07 04:39:50	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg69372
2004-07-11 03:25:40	mkc	create