re.split emptyok flag (fix for #852532) #40540

mkc · 2004-07-11T03:25:40Z

BPO	988761
Nosy	@akuchling, @birkenfeld, @rhettinger, @gpshead
Files	emptyok.patch: emptyok patch against current CVS HEAD 2004-07-10

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2008-07-16.22:35:16.479>
created_at = <Date 2004-07-11.03:25:40.000>
labels = ['extension-modules']
title = 're.split emptyok flag (fix for python/cpython#39646)'
updated_at = <Date 2008-07-16.22:35:16.478>
user = 'https://bugs.python.org/mkc'

bugs.python.org fields:

activity = <Date 2008-07-16.22:35:16.478>
actor = 'georg.brandl'
assignee = 'effbot'
closed = True
closed_date = <Date 2008-07-16.22:35:16.479>
closer = 'georg.brandl'
components = ['Extension Modules']
creation = <Date 2004-07-11.03:25:40.000>
creator = 'mkc'
dependencies = []
files = ['6089']
hgrepos = []
issue_num = 988761
keywords = ['patch']
message_count = 12.0
messages = ['46352', '46353', '46354', '46355', '46356', '46357', '46358', '46359', '46360', '46361', '69372', '69851']
nosy_count = 8.0
nosy_names = ['effbot', 'akuchling', 'georg.brandl', 'rhettinger', 'gregory.p.smith', 'mkc', 'colander_man', 'filip']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue988761'
versions = ['Python 2.6']

mkc · 2004-07-11T03:25:40Z

This patch addresses bug bpo-852532. The underlying
problem is that re.split ignores any match it makes
that has length zero, which blocks a number of useful
possibilities. The attached patch implements a flag
'emptyok', which if set to True, causes re.split to
allow zero length matches.

My preference would be to just change the behavior of
re.split, rather than adding this flag. The old
behavior isn't documented (though a couple of cases in
test_re.py do depend on it). As a practical matter,
though, I realize that there may be some code out there
relying on this undocumented behavior. And I'm hoping
that this useful feature can be added quickly. Perhaps
this new behavior could be made the default in a future
version of Python.

(Linux 2.6.3 i686)

colanderman · 2004-07-21T12:46:03Z

Logged In: YES
user_id=573252

Practical example where the current behaviour produces
undesirable results (splitting on character transitions):

>>> import re
>>> re.split(r'(?<=[A-Z])(?=[^a-z])','SOMEstring')
['SOMEstring']    # desired is ['SOME','string']

akuchling · 2004-07-27T14:08:28Z

Logged In: YES
user_id=11375

Overall I like the patch and wouldn't mind seeing the change
become the default behaviour. However, I'm nervous about
possibly not understanding the reason the prohibition on
zero-length matches was added in the first place. Can you
please do some research in the CVS logs and python-dev
archives to figure out why the limitation was implemented in
the first place?

mkc · 2004-07-28T16:23:32Z

Logged In: YES
user_id=555

I picked through CVS, python-dev and google and came up with
this. The current behavior was present way back in the
earliest regsub.py in CVS (dated Sep 1992); subsequent
implementation seem to be mirroring this behavior. The CVS
comment back in 1992 described split as modeled on nawk. A
check of nawk(1) confirms that nawk only splits on non-null
matches. Perl (circa 5.6) on the other hand, appears to
split the way this patch does (though I wasn't aware of that
when I wrote the patch) so that might argue in the other
direction. I would note, too, that re.findall and
re.finditer tend in this direction ("Empty matches are
included in the result unless they touch the beginning of
another match.").

The python-dev archive doesn't seem to go back far enough to
be relevant and I'm not sure how to search it. General
googling (python "re.split" empty match) found a few hits.
Probably the most relevant is Tim Peters saying "Python
won't change here (IMO)" and giving the example that he also
gives in a comment to bug bpo-852532 (which this patch
addresses). He also wonders in his comment about the
possibility of a "design constraint", but I think this patch
addresses that concern.

As far as I can tell, the current behavior was a design
decision made over 10 years ago, between two alternatives
that probably didn't matter much at the time. Skipping
empty matches probably seemed harmless before
lookahead/lookbehind assertions. Now, though, the current
behavior seems like a significant hindrance. Furthermore,
it ought to be pretty trivial to modify any existing
patterns to get the old behavior, should that be desired
(e.g., use 'x+' instead of 'x*').

(I didn't notice that re.findall doc when I originally wrote
this patch. Perhaps the doc in the patch should be slightly
modified to help emphasize the similarity between how
re.findall and re.split handle empty matches.)

mkc · 2004-09-03T20:15:04Z

Logged In: YES
user_id=555

Apparently this patch is stalled, but I'd like to get it in,
in some form, for 2.4. The only question, as far as I know,
is whether empty matches following non-empty matches "count"
or not (they do in the original patch).

If I make a patch with the "doesn't count" behavior, could
we apply that right away? I'd rather get either behavior in
for 2.4 than wait for 2.5...

rhettinger · 2004-09-05T22:53:40Z

Logged In: YES
user_id=80475

Fred, what do you think of the proposal. Are the backwards
compatability issues worth it?

mkc · 2005-12-09T17:04:11Z

Logged In: YES
user_id=555

This patch seems to have been stalled now for over a year.
Could it be applied? Or, alternatively, could someone
provide some sort of reason why it shouldn't be? Thanks.

filip · 2006-01-16T21:56:08Z

Logged In: YES
user_id=308203

I agree completely that splitting on non-zero matches should
be supported - and that the default behavior should change
at some point - but I don't think this patch quite covers
it. Taking an example from the dev-python thread back in
August of 2004
(http://mail.python.org/pipermail/python-dev/2004-August/047272.html):

>>> re.split('x*', 'abxxxcdefxxx', emptyok=True)
['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', '']

To me, this means there's an empty string, beginning and
ending in pos 0, followed by a zero-width divider also
beginning and ending in the same position, followed by an
'a', etc. That seems awkward to me. I think a more intuitive
result would be (I'm omitting the emptyok argument in the
following examples):

>>> re.split('x*', 'abxxxcdefxxx')
['a', 'b', 'c', 'd', 'e', 'f', '']

That is, empty matches cause a split when they are not
adjacent to a non-empty match and not at the beginning or
the end of the string. Grouping parentheses would, of
course, reveal the empty-string boundaries:

>>> re.split('(x*)', 'abxxxcdefxxx')
['', 'a', '', 'b', 'xxx', '', 'c', '', 'd', '', 'e', '',
'f', 'xxx', '']

Using the same approach, these results would also seem
perfectly reasonable to me:

>>> re.split('(?m)$', 'foo\nbar\nbaz')
['foo', '\nbar', '\nbaz']
>>> re.split('(?m)^', 'foo\nbar\nbaz')
['foo\n', 'bar\n', 'baz']

Splitting a one-character string should be possible only if
the pattern matches that character:

>>> re.split('\w*', 'a')
['', '']
>>> re.split('\d*', 'a')
['a']

mkc · 2006-01-17T22:37:46Z

Logged In: YES
user_id=555

I think I still agree with my original answer on this (see
http://mail.python.org/pipermail/python-dev/2004-August/047321.html).

I'm completely worn down on this, though, so I'd happily
take any of these options as an improvement over the present
situation.

mkc · 2007-02-22T03:23:22Z

Hello from 2004! This is your long-lost bug in re.split--how's it going? I'm still alive and well. I think everyone pretty much agrees that I really am a bug, and at least one guy still writes code just to work around me every few weeks or so. My attempt to keep a low profile is doing well--I'm not even documented in the library reference. This allows me to meet new Python users on a regular basis (whether they like it or not).

Well, that's it for now. If I don't hear from you until then, I'll drop you another line in 2009. (Hey I'm a poet, too!)

Regards,
bug 852532/patch 988761

gpshead · 2008-07-07T04:39:48Z

take a look at the patch being worked on in issue bpo-3262.

birkenfeld · 2008-07-16T22:35:16Z

Closing as a duplicate of bpo-3262, which seems to be active.

mkc mannequin assigned effbot Jul 11, 2004

mkc mannequin added the extension-modules C modules in the Modules dir label Jul 11, 2004

mkc mannequin assigned effbot Jul 11, 2004

mkc mannequin added the extension-modules C modules in the Modules dir label Jul 11, 2004

birkenfeld closed this as completed Jul 16, 2008

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re.split emptyok flag (fix for #852532) #40540

re.split emptyok flag (fix for #852532) #40540

mkc mannequin commented Jul 11, 2004

mkc mannequin commented Jul 11, 2004

colanderman mannequin commented Jul 21, 2004

akuchling commented Jul 27, 2004

mkc mannequin commented Jul 28, 2004

mkc mannequin commented Sep 3, 2004

rhettinger commented Sep 5, 2004

mkc mannequin commented Dec 9, 2005

filip mannequin commented Jan 16, 2006

mkc mannequin commented Jan 17, 2006

mkc mannequin commented Feb 22, 2007

gpshead commented Jul 7, 2008

birkenfeld commented Jul 16, 2008

re.split emptyok flag (fix for #852532) #40540

re.split emptyok flag (fix for #852532) #40540

Comments

mkc mannequin commented Jul 11, 2004

mkc mannequin commented Jul 11, 2004

colanderman mannequin commented Jul 21, 2004

akuchling commented Jul 27, 2004

mkc mannequin commented Jul 28, 2004

mkc mannequin commented Sep 3, 2004

rhettinger commented Sep 5, 2004

mkc mannequin commented Dec 9, 2005

filip mannequin commented Jan 16, 2006

mkc mannequin commented Jan 17, 2006

mkc mannequin commented Feb 22, 2007

gpshead commented Jul 7, 2008

birkenfeld commented Jul 16, 2008