classification
Title: str.split(sep=None, maxsplit=-1,any=False)
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.9
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: docs@python Nosy List: docs@python, hcoin, rhettinger, serhiy.storchaka
Priority: normal Keywords:

Created on 2019-07-18 13:59 by hcoin, last changed 2019-07-19 23:40 by rhettinger. This issue is now closed.

Messages (5)
msg348116 - (view) Author: Harry Coin (hcoin) Date: 2019-07-18 13:59
When first I read the str.split documentation I parsed it to mean
'ab\t cd ef'.split(sep=' \t') --> ['ab','cd','ef']
Especially as the given example in the docs with the <> would have led to the given result read the way I read it.

I suggest adding a parameter 'any=False' which by default gives the current behavior.  But when True treats each character in the sep string as a delimiter and eliminates any combination of them from the resulting list.

The use cases are many, for example parsing the /etc/hosts file where we see an address, some white space that could be any combination of \t and ' ' followed by more text. 

One could imagine 'abc  \tdef, hgi,jlk'.split(', \t',any=True) -> ['abc','def','hgi','jlk'] being used quite often.
msg348147 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-07-19 07:18
This API is already overloaded with two distinct algorithms -- see https://stackoverflow.com/questions/16645083 .  If new functionality is truly needed, it should be in a separate method rather than feature creeping in additional variants.

Also, the OP's post seems to be grounded in an initial misreading of the docs rather than in compelling use cases for a new option.  So, there may be room for improving the documentation, especially the examples.
msg348189 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-07-19 18:34
An alternative is to use regular expressions. 

>>> re.split('[\t ]+', 'ab\t cd ef')
['ab', 'cd', 'ef']
.
msg348192 - (view) Author: Harry Coin (hcoin) Date: 2019-07-19 18:59
I suspect the number of times the str.split builtin was examined for use 
and rejected in favor of the much more complex and 'heavy' re module 
far, far exceeds the number of times it found use with more than one 
character in the split string.

The str.split documentation 'feels like' the python equivalent of the 
linux 'tr' utility that treats the separator characters as a set instead 
of a sequence.   Notice the default and the help(str.split) 
documentation tends to encourage that intuition as no sep= has a very 
different behavior:  no argument 'removes any whitespace and discards 
empty strings from the result'.  That leads one to suspect each 
character in a string would do the same.

Mostly it's a use-case driven obviousness, you'd think python would 
naturally do that in str.split. So very many cases seek to resolve a 
string into a list of the interesting bits without regard to any mix of 
separators  (tabs, spaces, etc to increase the readability of the file).

I think it would be a heavily used enhancement to add the 'any=True' 
parameter.

Or,  in the alternative, allow the argument to sep to be an iterable so 
that:

'ab, cd'.split(sep=' ,') -->  ['ab, cd']

but

'ab, cd'.split(sep=[' ',',']) -> ['ab', 'cd']

On 7/19/19 1:34 PM, Serhiy Storchaka wrote:
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
>
> An alternative is to use regular expressions.
>
>>>> re.split('[\t ]+', 'ab\t cd ef')
> ['ab', 'cd', 'ef']
> .
>
> ----------
> nosy: +serhiy.storchaka
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue37620>
> _______________________________________
msg348203 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-07-19 23:40
Harry, thanks for the suggestion but we're going to pass for now.

For the most part, str.split() has stood the test of time and we use regexes for more customized or complicated variants.  In general, adding API options increases complexity for users, making the basic methods harder to learn and remember.  Fewer choices tends to make for easier programming. Guido has repeatedly given guidance to prefer separate functions and methods over putting multiple algorithmic flags in a single function or method.

If you want to pursue this further, consider posting to the python-ideas list.  You'll find more traction if you're able to find real-world code that would benefit from the new API.

One other thought: the API for the strip/lstrip/rstrip methods has the equivalent of the *any* option when *chars* are specified.  That has not worked out well -- people get surprised when the methods strip more than the exact string specified: 'there'.rstrip('re') --> 'th'.
History
Date User Action Args
2019-07-19 23:40:54rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg348203

stage: resolved
2019-07-19 18:59:56hcoinsetmessages: + msg348192
2019-07-19 18:34:56serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg348189
2019-07-19 07:18:58rhettingersetversions: + Python 3.9
nosy: + rhettinger, docs@python

messages: + msg348147

assignee: docs@python
components: + Documentation, - Library (Lib)
2019-07-18 13:59:00hcoincreate