Author rhettinger
Recipients abarry, barry, mrabarnett, rhettinger, serhiy.storchaka
Date 2016-12-11.19:00:22
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1481482822.84.0.045866357928.issue28937@psf.upfronthosting.co.za>
In-reply-to
Content
A few randomly ordered thoughts about splitting:

* The best general purpose text splitter I've ever seen is in MS Excel and is called "Text to Columns".  It has a boolean flag, "treat consecutive delimiters as one" which is off by default.

* There is a nice discussion on the complexities of the current design on StackOverflow:  http://stackoverflow.com/questions/16645083  In addition, there are many other SO questions about the behavior of str.split().

* The learning curve for str.split() is already high.  The doc entry for it has been revised many times to try and explain what it does.  I'm concerned that adding another algorithmic option to it may make it more difficult to learn and use in the common cases (API design principle:  giving users more options can impair usability).  Usually in Python courses, I recommend using str.split() for the simple, common cases and using regex when you need more control.

* What I do like about the proposal is that that there is no clean way to take the default whitespace splitting algorithm and customize to a particular subset of whitespace (i.e. tabs only).

* A tangential issue is that it was a mistake to expose the maxsplit=-1 implementation detail.   In Python 2.7, the help was "S.split([sep [,maxsplit]])".  But folks implementing the argument clinic have no way of coping with optional arguments that don't have a default value (like dict.pop), so they changed the API so that the implementation detail was exposed, "S.split(sep=None, maxsplit=-1)".   IMO, this is an API regression.  We really don't want people passing in -1 to indicate that there are no limits.  The Python way would have been to use None as a default or to stick with the existing API where the number of arguments supplied is part of the API (much like type() has two different meanings depending on whether it has an arity of 1 or 3).

Overall, I'm +0 on the proposal but there should be good consideration given to 1) whether there is a sufficient need to warrant increasing API complexity, making split() more difficult to learn and remember, 2) considering whether "prune" is the right word (can someone who didn't write the code read it clearly afterwards), 3) or addressing this through documentation (i.e. showing the simple regexes needed for cases not covered by str.split).
History
Date User Action Args
2016-12-11 19:00:22rhettingersetrecipients: + rhettinger, barry, mrabarnett, serhiy.storchaka, abarry
2016-12-11 19:00:22rhettingersetmessageid: <1481482822.84.0.045866357928.issue28937@psf.upfronthosting.co.za>
2016-12-11 19:00:22rhettingerlinkissue28937 messages
2016-12-11 19:00:22rhettingercreate