This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Mark.Bell
Recipients Catherine.Devlin, Mark.Bell, Philippe Cloutier, ZackerySpytz, barry, cheryl.sabella, corona10, gvanrossum, karlcow, mrabarnett, serhiy.storchaka, syeberman, veky
Date 2021-05-18.13:13:50
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1621343631.06.0.324337514234.issue28937@roundup.psfhosted.org>
In-reply-to
Content
So I have taken a look at the original patch that was provided and I have been able to update it so that it is compatible with the current release. I have also flipped the logic in the wrapping functions so that they take a `keepempty` flag (which is the opposite of the `prune` flag). 

I had to make a few extra changes since there are now some extra catches in things like PyUnicode_Split which spot that if len(self) > len(sep) then they can just return [self]. However that now needs an extra test since that shortcut can only be used if len(self) > 0. You can find the code here: https://github.com/markcbell/cpython/tree/split-keepempty

However in exploring this, I'm not sure that this patch interacts correctly with maxsplit. For example, 
    '   x y z'.split(maxsplit=1, keepempty=True)
results in
    ['', '', 'x', 'y z']
since the first two empty strings items are "free" and don't count towards the maxsplit. I think the length of the result returned must be <= maxsplit + 1, is this right?

I'm about to rework the logic to avoid this, but before I go too far could someone double check my test cases to make sure that I have the correct idea about how this is supposed to work please. Only the 8 lines marked "New case" show new behaviour, all the other come from how string.split works currently. Of course the same patterns should apply to bytestrings and bytearrays.

    ''.split() == []
    ''.split(' ') == ['']
    ''.split(' ', keepempty=False) == []    # New case

    '  '.split(' ') == ['', '', '']
    '  '.split(' ', maxsplit=1) == ['', ' ']
    '  '.split(' ', maxsplit=1, keepempty=False) == [' ']    # New case

    '  a b c  '.split() == ['a', 'b', 'c']
    ​'  a b c  '.split(maxsplit=0) == ['a b c  ']
    ​'  a b c  '.split(maxsplit=1) == ['a', 'b c  ']

    '  a b c  '.split(' ') == ['', '', 'a', 'b', 'c', '', '']
    ​'  a b c  '.split(' ', maxsplit=0) == ['  a b c  ']
    ​'  a b c  '.split(' ', maxsplit=1) == ['', ' a b c  ']
    ​'  a b c  '.split(' ', maxsplit=2) == ['', '', 'a b c  ']
    ​'  a b c  '.split(' ', maxsplit=3) == ['', '', 'a', 'b c  ']
    ​'  a b c  '.split(' ', maxsplit=4) == ['', '', 'a', 'b', 'c  ']
    ​'  a b c  '.split(' ', maxsplit=5) == ['', '', 'a', 'b', 'c', ' ']
    ​'  a b c  '.split(' ', maxsplit=6) == ['', '', 'a', 'b', 'c', '', '']

    ​'  a b c  '.split(' ', keepempty=False) == ['a', 'b', 'c']    # New case
    ​'  a b c  '.split(' ', maxsplit=0, keepempty=False) == ['  a b c  ']    # New case
    ​'  a b c  '.split(' ', maxsplit=1, keepempty=False) == ['a', 'b c  ']    # New case
    ​'  a b c  '.split(' ', maxsplit=2, keepempty=False) == ['a', 'b', 'c  ']    # New case
    ​'  a b c  '.split(' ', maxsplit=3, keepempty=False) == ['a', 'b', 'c', ' ']    # New case
    ​'  a b c  '.split(' ', maxsplit=4, keepempty=False) == ['a', 'b', 'c']    # New case
History
Date User Action Args
2021-05-18 13:13:51Mark.Bellsetrecipients: + Mark.Bell, gvanrossum, barry, syeberman, mrabarnett, karlcow, serhiy.storchaka, Catherine.Devlin, veky, cheryl.sabella, corona10, ZackerySpytz, Philippe Cloutier
2021-05-18 13:13:51Mark.Bellsetmessageid: <1621343631.06.0.324337514234.issue28937@roundup.psfhosted.org>
2021-05-18 13:13:51Mark.Belllinkissue28937 messages
2021-05-18 13:13:50Mark.Bellcreate