Message393871
So I have taken a look at the original patch that was provided and I have been able to update it so that it is compatible with the current release. I have also flipped the logic in the wrapping functions so that they take a `keepempty` flag (which is the opposite of the `prune` flag).
I had to make a few extra changes since there are now some extra catches in things like PyUnicode_Split which spot that if len(self) > len(sep) then they can just return [self]. However that now needs an extra test since that shortcut can only be used if len(self) > 0. You can find the code here: https://github.com/markcbell/cpython/tree/split-keepempty
However in exploring this, I'm not sure that this patch interacts correctly with maxsplit. For example,
' x y z'.split(maxsplit=1, keepempty=True)
results in
['', '', 'x', 'y z']
since the first two empty strings items are "free" and don't count towards the maxsplit. I think the length of the result returned must be <= maxsplit + 1, is this right?
I'm about to rework the logic to avoid this, but before I go too far could someone double check my test cases to make sure that I have the correct idea about how this is supposed to work please. Only the 8 lines marked "New case" show new behaviour, all the other come from how string.split works currently. Of course the same patterns should apply to bytestrings and bytearrays.
''.split() == []
''.split(' ') == ['']
''.split(' ', keepempty=False) == [] # New case
' '.split(' ') == ['', '', '']
' '.split(' ', maxsplit=1) == ['', ' ']
' '.split(' ', maxsplit=1, keepempty=False) == [' '] # New case
' a b c '.split() == ['a', 'b', 'c']
' a b c '.split(maxsplit=0) == ['a b c ']
' a b c '.split(maxsplit=1) == ['a', 'b c ']
' a b c '.split(' ') == ['', '', 'a', 'b', 'c', '', '']
' a b c '.split(' ', maxsplit=0) == [' a b c ']
' a b c '.split(' ', maxsplit=1) == ['', ' a b c ']
' a b c '.split(' ', maxsplit=2) == ['', '', 'a b c ']
' a b c '.split(' ', maxsplit=3) == ['', '', 'a', 'b c ']
' a b c '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c ']
' a b c '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' ']
' a b c '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', '']
' a b c '.split(' ', keepempty=False) == ['a', 'b', 'c'] # New case
' a b c '.split(' ', maxsplit=0, keepempty=False) == [' a b c '] # New case
' a b c '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c '] # New case
' a b c '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c '] # New case
' a b c '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' '] # New case
' a b c '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c'] # New case |
|
Date |
User |
Action |
Args |
2021-05-18 13:13:51 | Mark.Bell | set | recipients:
+ Mark.Bell, gvanrossum, barry, syeberman, mrabarnett, karlcow, serhiy.storchaka, Catherine.Devlin, veky, cheryl.sabella, corona10, ZackerySpytz, Philippe Cloutier |
2021-05-18 13:13:51 | Mark.Bell | set | messageid: <1621343631.06.0.324337514234.issue28937@roundup.psfhosted.org> |
2021-05-18 13:13:51 | Mark.Bell | link | issue28937 messages |
2021-05-18 13:13:50 | Mark.Bell | create | |
|