Message 340943 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	mark.dickinson, rhettinger, steven.daprano
Date	2019-04-26.21:14:25
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1556313266.36.0.372847048885.issue36546@roundup.psfhosted.org>
In-reply-to

Content
Thanks for propelling this forward :-) I'm really happy to have an easy to reach tool that readily summarizes the shape of data and that can be used to compare how distributions differ. > Octave and Maple call their parameter "method", so if we > stick with "method" we're in good company. The Langford paper also uses the word "method", so that is likely just the right word. > I'm more concerned about the values taken by the method parameter. > "Inclusive" and "Exclusive"h ave a related but distinct meaning > when it comes to quartiles which is different from the Excel usage Feel free to change it to whatever communicates the best. The meaning I was going for is closer to the notions of open-interval or closed interval. In terms of use cases, one is for describing population data where the minimum input really is the 0th percentile and the maximum is the 100th percentile. The other is for sample data where the underlying population will have values outside the range of the empirical samples. I'm not sure what words bests describe the distinction. The word "inclusive" and "exclusive" approximated that idea but maybe you can do better. > I have a working version of quantiles() which supports cutpoints > and all nine calculation methods supported by R. My recommendation is to not do this. Usually, it's better to start simple, focusing on core use cases (i.e. sample and population), then let users teach us what additions they really need (this is a YAGNI argument). Once a feature is offered, it can never be taken away even if it proves to be not helpful in most situations or is mostly unused. In his 20 year retrospective, Hyndman expressed dismay that his paper had the opposite effect of what was intended (hoping for a standardization on a single approach rather than a proliferation of all nine methods). My experience in API design is that offering users too many choices will complicate their lives, leading to suboptimal and incorrect choices and creating confusion. That is likely why most software packages other than R only offer one or two options. If you hold off, you can always add these options later. We might just find that what we've got suffices for most everyday uses. Also, I thought the spirit of the statistics module was to offer a few core statistical tools aimed at non-experts, deferring to external packages for more rich collections of optimized, expert tools that cover every option. For me, the best analogy is my two cameras. One is a point and shoot that is easy to use and does a reasonable job. The other is a professional SLR with hundreds of settings that I had to go to photography school to learn to use. FWIW, I held-off on adding "cut_points" because the normal use case is to get equally spaced quantiles. It would be unusual to want 0.25 and 0.50 but not 0.75. The other reason is that user provided cut-points conflict with core concept of "Divide dist into n continuous intervals with equal probability." User provided cut-points provide other ways to go wrong as well (not being sorted, 0.0 or 1.0 not being valid for some methods, values outside the range 0.0 to 1.0). The need for cut_points makes more sense for numpy or scipy where is common to pass around a linspace. Everyday Python isn't like that.

Thanks for propelling this forward :-)  I'm really happy to have an easy to reach tool that readily summarizes the shape of data and that can be used to compare how distributions differ.


> Octave and Maple call their parameter "method", so if we 
> stick with  "method" we're in good company.

The Langford paper also uses the word "method", so that is likely just the right word.


> I'm more concerned about the values taken by the method parameter.
> "Inclusive" and "Exclusive"h ave a related but distinct meaning 
> when it comes to quartiles which is different from the Excel usage

Feel free to change it to whatever communicates the best.  The meaning I was going for is closer to the notions of open-interval or closed interval.  In terms of use cases, one is for describing population data where the minimum input really is the 0th percentile and the maximum is the 100th percentile.  The other is for sample data where the underlying population will have values outside the range of the empirical samples.  I'm not sure what words bests describe the distinction.  The word "inclusive" and "exclusive" approximated that idea but maybe you can do better.


> I have a working version of quantiles() which supports cutpoints 
> and all nine calculation methods supported by R.

My recommendation is to not do this.  Usually, it's better to start simple, focusing on core use cases (i.e. sample and population), then let users teach us what additions they really need (this is a YAGNI argument).  Once a feature is offered, it can never be taken away even if it proves to be not helpful in most situations or is mostly unused.

In his 20 year retrospective, Hyndman expressed dismay that his paper had the opposite effect of what was intended (hoping for a standardization on a single approach rather than a proliferation of all nine methods).  My experience in API design is that offering users too many choices will complicate their lives, leading to suboptimal and incorrect choices and creating confusion.   That is likely why most software packages other than R only offer one or two options.

If you hold off, you can always add these options later.  We might just find that what we've got suffices for most everyday uses.   Also, I thought the spirit of the statistics module was to offer a few core statistical tools aimed at non-experts, deferring to external packages for more rich collections of optimized, expert tools that cover every option.  For me, the best analogy is my two cameras. One is a point and shoot that is easy to use and does a reasonable job. The other is a professional SLR with hundreds of settings that I had to go to photography school to learn to use.

FWIW, I held-off on adding "cut_points" because the normal use case is to get equally spaced quantiles.  It would be unusual to want 0.25 and 0.50 but not 0.75.   The other reason is that user provided cut-points conflict with core concept of "Divide *dist* into *n* continuous intervals with equal probability."  User provided cut-points provide other ways to go wrong as well (not being sorted, 0.0 or 1.0 not being valid for some methods, values outside the range 0.0 to 1.0).  The need for cut_points makes more sense for numpy or scipy where is common to pass around a linspace. Everyday Python isn't like that.

History
Date	User	Action	Args
2019-04-26 21:14:26	rhettinger	set	recipients: + rhettinger, mark.dickinson, steven.daprano
2019-04-26 21:14:26	rhettinger	set	messageid: <1556313266.36.0.372847048885.issue36546@roundup.psfhosted.org>
2019-04-26 21:14:26	rhettinger	link	issue36546 messages
2019-04-26 21:14:25	rhettinger	create