Author thomasahle
Recipients thomasahle
Date 2014-05-28.12:11:31
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1401279098.1.0.387904478686.issue21592@psf.upfronthosting.co.za>
In-reply-to
Content
The statistics module currently contains the following comment:

"FIXME: investigate ways to calculate medians without sorting? Quickselect?"

This is important, because users expect standard library functions to use state of the art implementations, and median by sorting has never been that.

There are many good linear time alternatives, the classical median-of-k algorithm was posted to the mailing list in a nice version by David Eppstein in 2002 [1]. The fastest method in practice is probably Quick Select in the Floyd-Rivest version [2] which is similar to quick sort.

These algorithms also have the feature of letting us select any k-th order statistic, not just the median. This seems conceptually a lot simpler than the current median/median_low/median_high split.

However, sticking with the current api, a new implementation would have to support calculating a median as the average of n//2 and (n+1)//2'th order statistics. This could be implemented either by calling the select function twice, or by implementing a multi-select algorithm, which is also a well studied problem [3].

I'll be happy to contribute code, or help out in any other way.

[1]: https://mail.python.org/pipermail/python-list/2002-May/132170.html
[2]: https://en.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorithm
[3]: https://domino.mpi-inf.mpg.de/intranet/ag1/ag1publ.nsf/0/59C74289A2291143C12571C30017DEA8/$file/Mehlhorn_a_2005_o.pdf
History
Date User Action Args
2014-05-28 12:11:38thomasahlesetrecipients: + thomasahle
2014-05-28 12:11:38thomasahlesetmessageid: <1401279098.1.0.387904478686.issue21592@psf.upfronthosting.co.za>
2014-05-28 12:11:37thomasahlelinkissue21592 messages
2014-05-28 12:11:32thomasahlecreate