Message 334075 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	mark.dickinson, remi.lapeyre, rhettinger, steven.daprano
Date	2019-01-19.23:27:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1547940452.3.0.168782150517.issue35775@roundup.psfhosted.org>
In-reply-to

Content
Rémi. I've read over your patch and have some comments: (1) You call sorted() to produce a list, but then instead of retrieving the item using ``data[i-1]`` you use ``itertools.islice``. That seems unnecessary to me. Do you have a reason for using ``islice``? (2) select is not very useful on its own, we actually want it so we can calculate quantiles, e.g. percentiles, deciles, quartiles. If we want the k-quantile (e.g. k=100 for percentiles) then there are k+1 k-quantiles in total, including the minimum and maximum. E.g quartiles divide the data set into four equal sections, so there are five boundary values including the min and max. So the caller is likely to be calling select repeatedly on the same data set, and hence making a copy of that data and sorting it repeatedly. If the data set is small, repeatedly making sorted copies is still cheap enough, but for large data sets, that will be expensive. Do you have any thoughts on how to deal with that?

Rémi. I've read over your patch and have some comments:

(1) You call sorted() to produce a list, but then instead of retrieving the item using ``data[i-1]`` you use ``itertools.islice``. That seems unnecessary to me. Do you have a reason for using ``islice``?

(2) select is not very useful on its own, we actually want it so we can calculate quantiles, e.g. percentiles, deciles, quartiles. If we want the k-quantile (e.g. k=100 for percentiles) then there are k+1 k-quantiles in total, including the minimum and maximum. E.g quartiles divide the data set into four equal sections, so there are five boundary values including the min and max.

So the caller is likely to be calling select repeatedly on the same data set, and hence making a copy of that data and sorting it repeatedly. If the data set is small, repeatedly making sorted copies is still cheap enough, but for large data sets, that will be expensive.

Do you have any thoughts on how to deal with that?

History
Date	User	Action	Args
2019-01-19 23:27:34	steven.daprano	set	recipients: + steven.daprano, rhettinger, mark.dickinson, remi.lapeyre
2019-01-19 23:27:32	steven.daprano	set	messageid: <1547940452.3.0.168782150517.issue35775@roundup.psfhosted.org>
2019-01-19 23:27:32	steven.daprano	link	issue35775 messages
2019-01-19 23:27:32	steven.daprano	create