Issue 37620: str.split(sep=None, maxsplit=-1,any=False)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/81801

classification

Title:	str.split(sep=None, maxsplit=-1,any=False)
Type:	enhancement	Stage:	resolved
Components:	Documentation	Versions:	Python 3.9

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	docs@python, hcoin, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2019-07-18 13:59 by hcoin, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (5)
msg348116 - (view)	Author: Harry Coin (hcoin)	Date: 2019-07-18 13:59
When first I read the str.split documentation I parsed it to mean 'ab\t cd ef'.split(sep=' \t') --> ['ab','cd','ef'] Especially as the given example in the docs with the <> would have led to the given result read the way I read it. I suggest adding a parameter 'any=False' which by default gives the current behavior. But when True treats each character in the sep string as a delimiter and eliminates any combination of them from the resulting list. The use cases are many, for example parsing the /etc/hosts file where we see an address, some white space that could be any combination of \t and ' ' followed by more text. One could imagine 'abc \tdef, hgi,jlk'.split(', \t',any=True) -> ['abc','def','hgi','jlk'] being used quite often.
msg348147 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-07-19 07:18
This API is already overloaded with two distinct algorithms -- see https://stackoverflow.com/questions/16645083 . If new functionality is truly needed, it should be in a separate method rather than feature creeping in additional variants. Also, the OP's post seems to be grounded in an initial misreading of the docs rather than in compelling use cases for a new option. So, there may be room for improving the documentation, especially the examples.
msg348189 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-07-19 18:34
An alternative is to use regular expressions. >>> re.split('[\t ]+', 'ab\t cd ef') ['ab', 'cd', 'ef'] .
msg348192 - (view)	Author: Harry Coin (hcoin)	Date: 2019-07-19 18:59
I suspect the number of times the str.split builtin was examined for use and rejected in favor of the much more complex and 'heavy' re module far, far exceeds the number of times it found use with more than one character in the split string. The str.split documentation 'feels like' the python equivalent of the linux 'tr' utility that treats the separator characters as a set instead of a sequence. Notice the default and the help(str.split) documentation tends to encourage that intuition as no sep= has a very different behavior: no argument 'removes any whitespace and discards empty strings from the result'. That leads one to suspect each character in a string would do the same. Mostly it's a use-case driven obviousness, you'd think python would naturally do that in str.split. So very many cases seek to resolve a string into a list of the interesting bits without regard to any mix of separators (tabs, spaces, etc to increase the readability of the file). I think it would be a heavily used enhancement to add the 'any=True' parameter. Or, in the alternative, allow the argument to sep to be an iterable so that: 'ab, cd'.split(sep=' ,') --> ['ab, cd'] but 'ab, cd'.split(sep=[' ',',']) -> ['ab', 'cd'] On 7/19/19 1:34 PM, Serhiy Storchaka wrote: > Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment: > > An alternative is to use regular expressions. > >>>> re.split('[\t ]+', 'ab\t cd ef') > ['ab', 'cd', 'ef'] > . > > ---------- > nosy: +serhiy.storchaka > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue37620> > _______________________________________
msg348203 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2019-07-19 23:40
Harry, thanks for the suggestion but we're going to pass for now. For the most part, str.split() has stood the test of time and we use regexes for more customized or complicated variants. In general, adding API options increases complexity for users, making the basic methods harder to learn and remember. Fewer choices tends to make for easier programming. Guido has repeatedly given guidance to prefer separate functions and methods over putting multiple algorithmic flags in a single function or method. If you want to pursue this further, consider posting to the python-ideas list. You'll find more traction if you're able to find real-world code that would benefit from the new API. One other thought: the API for the strip/lstrip/rstrip methods has the equivalent of the any option when chars are specified. That has not worked out well -- people get surprised when the methods strip more than the exact string specified: 'there'.rstrip('re') --> 'th'.

History
Date	User	Action	Args
2022-04-11 14:59:18	admin	set	github: 81801
2019-07-19 23:40:54	rhettinger	set	status: open -> closed resolution: rejected messages: + msg348203 stage: resolved
2019-07-19 18:59:56	hcoin	set	messages: + msg348192
2019-07-19 18:34:56	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg348189
2019-07-19 07:18:58	rhettinger	set	versions: + Python 3.9 nosy: + rhettinger, docs@python messages: + msg348147 assignee: docs@python components: + Documentation, - Library (Lib)
2019-07-18 13:59:00	hcoin	create