Author filip
Recipients
Date 2006-01-16.21:56:08
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Logged In: YES 
user_id=308203

I agree completely that splitting on non-zero matches should
be supported - and that the default behavior should change
at some point - but I don't think this patch quite covers
it. Taking an example from the dev-python thread back in
August of 2004
(http://mail.python.org/pipermail/python-dev/2004-August/047272.html):

>>> re.split('x*', 'abxxxcdefxxx', emptyok=True)
['', 'a', 'b', '', 'c', 'd', 'e', 'f', '', '']

To me, this means there's an empty string, beginning and
ending in pos 0, followed by a zero-width divider also
beginning and ending in the same position, followed by an
'a', etc. That seems awkward to me. I think a more intuitive
result would be (I'm omitting the emptyok argument in the
following examples):

>>> re.split('x*', 'abxxxcdefxxx')
['a', 'b', 'c', 'd', 'e', 'f', '']

That is, empty matches cause a split when they are not
adjacent to a non-empty match and not at the beginning or
the end of the string. Grouping parentheses would, of
course, reveal the empty-string boundaries:

>>> re.split('(x*)', 'abxxxcdefxxx')
['', 'a', '', 'b', 'xxx', '', 'c', '', 'd', '', 'e', '',
'f', 'xxx', '']

Using the same approach, these results would also seem
perfectly reasonable to me:

>>> re.split('(?m)$', 'foo\nbar\nbaz')
['foo', '\nbar', '\nbaz']
>>> re.split('(?m)^', 'foo\nbar\nbaz')
['foo\n', 'bar\n', 'baz']

Splitting a one-character string should be possible only if
the pattern matches that character:

>>> re.split('\w*', 'a')
['', '']
>>> re.split('\d*', 'a')
['a']
History
Date User Action Args
2007-08-23 15:38:36adminlinkissue988761 messages
2007-08-23 15:38:36admincreate