Message 282936 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	abarry, barry, mrabarnett, rhettinger, serhiy.storchaka
Date	2016-12-11.19:00:22
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1481482822.84.0.045866357928.issue28937@psf.upfronthosting.co.za>
In-reply-to

Content
A few randomly ordered thoughts about splitting: * The best general purpose text splitter I've ever seen is in MS Excel and is called "Text to Columns". It has a boolean flag, "treat consecutive delimiters as one" which is off by default. * There is a nice discussion on the complexities of the current design on StackOverflow: http://stackoverflow.com/questions/16645083 In addition, there are many other SO questions about the behavior of str.split(). * The learning curve for str.split() is already high. The doc entry for it has been revised many times to try and explain what it does. I'm concerned that adding another algorithmic option to it may make it more difficult to learn and use in the common cases (API design principle: giving users more options can impair usability). Usually in Python courses, I recommend using str.split() for the simple, common cases and using regex when you need more control. * What I do like about the proposal is that that there is no clean way to take the default whitespace splitting algorithm and customize to a particular subset of whitespace (i.e. tabs only). * A tangential issue is that it was a mistake to expose the maxsplit=-1 implementation detail. In Python 2.7, the help was "S.split([sep [,maxsplit]])". But folks implementing the argument clinic have no way of coping with optional arguments that don't have a default value (like dict.pop), so they changed the API so that the implementation detail was exposed, "S.split(sep=None, maxsplit=-1)". IMO, this is an API regression. We really don't want people passing in -1 to indicate that there are no limits. The Python way would have been to use None as a default or to stick with the existing API where the number of arguments supplied is part of the API (much like type() has two different meanings depending on whether it has an arity of 1 or 3). Overall, I'm +0 on the proposal but there should be good consideration given to 1) whether there is a sufficient need to warrant increasing API complexity, making split() more difficult to learn and remember, 2) considering whether "prune" is the right word (can someone who didn't write the code read it clearly afterwards), 3) or addressing this through documentation (i.e. showing the simple regexes needed for cases not covered by str.split).

A few randomly ordered thoughts about splitting:

* The best general purpose text splitter I've ever seen is in MS Excel and is called "Text to Columns".  It has a boolean flag, "treat consecutive delimiters as one" which is off by default.

* There is a nice discussion on the complexities of the current design on StackOverflow:  http://stackoverflow.com/questions/16645083  In addition, there are many other SO questions about the behavior of str.split().

* The learning curve for str.split() is already high.  The doc entry for it has been revised many times to try and explain what it does.  I'm concerned that adding another algorithmic option to it may make it more difficult to learn and use in the common cases (API design principle:  giving users more options can impair usability).  Usually in Python courses, I recommend using str.split() for the simple, common cases and using regex when you need more control.

* What I do like about the proposal is that that there is no clean way to take the default whitespace splitting algorithm and customize to a particular subset of whitespace (i.e. tabs only).

* A tangential issue is that it was a mistake to expose the maxsplit=-1 implementation detail.   In Python 2.7, the help was "S.split([sep [,maxsplit]])".  But folks implementing the argument clinic have no way of coping with optional arguments that don't have a default value (like dict.pop), so they changed the API so that the implementation detail was exposed, "S.split(sep=None, maxsplit=-1)".   IMO, this is an API regression.  We really don't want people passing in -1 to indicate that there are no limits.  The Python way would have been to use None as a default or to stick with the existing API where the number of arguments supplied is part of the API (much like type() has two different meanings depending on whether it has an arity of 1 or 3).

Overall, I'm +0 on the proposal but there should be good consideration given to 1) whether there is a sufficient need to warrant increasing API complexity, making split() more difficult to learn and remember, 2) considering whether "prune" is the right word (can someone who didn't write the code read it clearly afterwards), 3) or addressing this through documentation (i.e. showing the simple regexes needed for cases not covered by str.split).

History
Date	User	Action	Args
2016-12-11 19:00:22	rhettinger	set	recipients: + rhettinger, barry, mrabarnett, serhiy.storchaka, abarry
2016-12-11 19:00:22	rhettinger	set	messageid: <1481482822.84.0.045866357928.issue28937@psf.upfronthosting.co.za>
2016-12-11 19:00:22	rhettinger	link	issue28937 messages
2016-12-11 19:00:22	rhettinger	create