classification
Title: Make statistics.median run in linear time
Type: performance Stage: patch review
Components: Library (Lib) Versions: Python 3.4, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: alex, corona10, ezio.melotti, jtaylor, rhettinger, scoder, steven.daprano, terry.reedy, thomasahle, tim.peters, upendra-k14, vajrasky
Priority: normal Keywords: patch

Created on 2014-05-28 12:11 by thomasahle, last changed 2018-05-07 13:25 by corona10.

Files
File name Uploaded Description Edit
select.py steven.daprano, 2014-06-01 04:13 timing tests for various selection implementions
select.pxd scoder, 2014-06-01 07:27 external static type declarations for select.py
select.html scoder, 2014-06-01 07:28 annotated sources generated by Cython
select.pxd scoder, 2014-06-01 10:54 external static type declarations for select.py, allows arbitrary input containers
select.html scoder, 2014-06-01 10:55 annotated sources generated by Cython (allowing arbitrary input containers)
select2.py thomasahle, 2014-06-02 11:54
effmedian.py upendra-k14, 2016-01-03 03:28 Alternative method for finding median using grouping of 5 elements
Pull Requests
URL Status Linked Edit
PR 3170 closed samnaughtonb, 2017-08-21 18:44
PR 5177 open python-dev, 2018-01-14 00:00
PR 6715 closed corona10, 2018-05-06 06:30
Messages (24)
msg219265 - (view) Author: Thomas Dybdahl Ahle (thomasahle) * Date: 2014-05-28 12:11
The statistics module currently contains the following comment:

"FIXME: investigate ways to calculate medians without sorting? Quickselect?"

This is important, because users expect standard library functions to use state of the art implementations, and median by sorting has never been that.

There are many good linear time alternatives, the classical median-of-k algorithm was posted to the mailing list in a nice version by David Eppstein in 2002 [1]. The fastest method in practice is probably Quick Select in the Floyd-Rivest version [2] which is similar to quick sort.

These algorithms also have the feature of letting us select any k-th order statistic, not just the median. This seems conceptually a lot simpler than the current median/median_low/median_high split.

However, sticking with the current api, a new implementation would have to support calculating a median as the average of n//2 and (n+1)//2'th order statistics. This could be implemented either by calling the select function twice, or by implementing a multi-select algorithm, which is also a well studied problem [3].

I'll be happy to contribute code, or help out in any other way.

[1]: https://mail.python.org/pipermail/python-list/2002-May/132170.html
[2]: https://en.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorithm
[3]: https://domino.mpi-inf.mpg.de/intranet/ag1/ag1publ.nsf/0/59C74289A2291143C12571C30017DEA8/$file/Mehlhorn_a_2005_o.pdf
msg219271 - (view) Author: Thomas Dybdahl Ahle (thomasahle) * Date: 2014-05-28 14:43
I have written some proof of concept code here [1], I would appreciate you commenting on it, before I turn it into a patch, as I haven't contributed code to Python before.

I have tried to write it as efficiently as possible, but it is of course possible that the c-implemented `sorted()` code will be faster than even the smartest python-implemented select.

[1]: http://pastebin.com/30x0j39a
msg219324 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-05-28 23:49
On Wed, May 28, 2014 at 02:43:29PM +0000, Thomas Dybdahl Ahle wrote:

> I have written some proof of concept code here [1], I would appreciate 
> you commenting on it, before I turn it into a patch, as I haven't 
> contributed code to Python before.

Thanks Thomas! I will try to look at this over the weekend (today is 
Thursday my time). If I haven't responded by Monday your time, please 
feel free to send me a reminder.

> I have tried to write it as efficiently as possible, but it is of 
> course possible that the c-implemented `sorted()` code will be faster 
> than even the smartest python-implemented select.

My quick-and-dirty tests suggest that you need at least 10 million items 
in the list before a pure Python median-of-median algorithm is as fast 
as the median algorithm based on sorting in C.

> [1]: http://pastebin.com/30x0j39a
msg219329 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2014-05-29 03:25
Yeah, I remember I tried to improve the performance of the median in the past using median-of-k algorithm. But too bad I could not beat sorted C function efficiency. Maybe you have a better luck!
msg219387 - (view) Author: Thomas Dybdahl Ahle (thomasahle) * Date: 2014-05-30 12:43
If you have a good, realistic test set, we can try testing quick-select vs sorting. If it's still not good, I can also reimplement it in C.
msg219419 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-05-30 19:44
To accept contributions, we need a signed (possibly electronically) contribution form. 
https://www.python.org/psf/contrib/contrib-form/
msg219437 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2014-05-31 02:53
I suggest this needs a measurable goal, like "minimize expected-case time" or "minimize worst-case time".  "Use a state of the art implementation" isn't a measurable goal - it's an abstract wish with no measurable merit on its own ;-)

Note that I wrote the "median-of-k" code referenced in the first post here (it was in reply to David Eppstein).  Even then it was hard to beat sort() + index.

It's harder now, and for an important reason:  Python's _current_ sort can exploit many kinds of partial order, and can often complete in linear time.

This makes picking "typical" input for timing crucial.  If you want to skew it to put sort() + index at maximum disadvantage, use only shuffled (random order) pairwise-unequal inputs.  But most streams of data do have some kinds of partial order (which is why the newer sort() implementation has been so successful), and "typical" is a much tougher thing to capture than "shuffled".
msg219448 - (view) Author: Thomas Dybdahl Ahle (thomasahle) * Date: 2014-05-31 09:15
I think "minimize expected-case time" is a good goal. If we wanted "minimize worst-case time" we would have to use k-means rather than quickselect.

My trials on random data, where sort arguably has a disadvantage, suggests sorting is about twice as fast for most input sizes. With pypy quick-select is easily 5-10 times faster, which I take as a suggestion that a C-implementation might be worth a try.

For designing a realistic test-suite, I suppose we need to look at what tasks medians are commonly used for. I'm thinking median filters from image processing, medians clustering, robust regressing, anything else?
msg219458 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-05-31 16:18
> I have written some proof of concept code here [1], I would appreciate
> you commenting on it, before I turn it into a patch, as I haven't 
> contributed code to Python before.

I would encourage you to consult the devguide, prepare a patch, and upload it here so that we can use rietveld to review it and add inline comments.
msg219483 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-06-01 04:13
I've run some performance tests on six variations of the O(N) select algorithm, based on Tim Peters' and Thomas Ahle's code, comparing them to the naive O(N log N) "sort first" algorithm, and sorting is consistently faster up to the limit I tested.

About the tests I ran:

- I tested four versions of Tim's median-of-median-of-k 
  algorithm, for k = 7, 23, 47 and 97.

- Thomas' "select" function, which is a median-of-median-of-3.

- Thomas' "select2" function, which uses two pivots.

- Data was randomly shuffled.

- Functions were permitted to modify the data in place, and 
  were not required to make a copy of the data first. E.g. 
  I used alist.sort() rather than sorted(alist).

- I ran two separate sets of tests. The first tested individual
  calls to the various selection functions, on random data. Each
  function got its own copy of the shuffled data. 

- The second set of tests called the selection function three
  times in a row, using different ranks, and used the average
  of the three times.
  
My test suite is attached if anyone wants to critique it or run it themselves.

Results:


== Single call mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.001    0.027    0.004    0.003    0.003    0.005    0.002
   10000    0.002    0.008    0.006    0.005    0.005    0.007    0.006
   50000    0.014    0.041    0.029    0.027    0.028    0.039    0.035
  100000    0.035    0.088    0.069    0.065    0.067    0.132    0.067
  500000    0.248    0.492    0.352    0.349    0.345    0.378    0.433
 1000000    0.551    1.008    0.768    0.669    0.723    1.007    0.627
 2000000    1.173    2.004    1.791    1.335    1.376    3.049    1.108
 3000000    1.992    3.282    2.291    2.256    2.299    2.451    1.756
 4000000    2.576    4.135    3.130    2.960    2.937    5.022    3.318
 5000000    3.568    5.233    3.914    3.504    3.629    4.912    4.458
 6000000    4.237    6.233    4.710    4.323    4.514    5.066    3.876
 7000000    4.962    7.403    5.447    5.037    5.129    7.053    7.774
 8000000    5.854    8.696    6.151    5.963    5.908    8.704    5.836
 9000000    6.749    9.540    7.078    6.869    6.985    6.354    3.834
10000000    7.667   10.944    7.621    7.322    7.439   10.092    7.112
11000000    8.400   11.966    8.566    8.284    8.112   10.511    8.184
Total elapsed time: 23.84 minutes


My conclusions from single calls:

Thomas' select() and Tim's select7() as pure Python functions are too slow for serious contention. [Aside: I wonder how PyPy would go with them?] 

There's not much difference in performance between the various median-of-median-of-k functions for larger k, but it seems to me that overall k=47 is marginally faster than either k=23 or k=97.

Overall, sorting is as good or better (and usually *much better*) than any of the pure-Python functions for the values of N tested, at least on my computer. C versions may be worth testing, but I'm afraid that is beyond me. Thomas' select2 using dual pivots seems like the most promising.

There are a couple of anomalous results where select2 unexpectedly (to me!) does much, much better than sorting, e.g. for N=9 million. Pure chance perhaps?

The overall trend seems to me to suggest that a pure-Python version of select2 may become reliably faster than sorting from N=10 million or so, at least with random data on my computer. YMMV, and I would expect that will non-random partially sorted data, the results may be considerably different.


== Average of three calls mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.001    0.012    0.007    0.008    0.007    0.022    0.007
   10000    0.002    0.022    0.015    0.015    0.015    0.041    0.016
   50000    0.016    0.125    0.086    0.080    0.085    0.259    0.073
  100000    0.037    0.258    0.181    0.155    0.156    0.650    0.137
  500000    0.242    1.374    0.950    0.963    1.075    4.828    1.135
 1000000    0.564    2.892    1.998    1.952    2.100    5.055    1.721
 2000000    1.227    5.822    4.084    3.876    4.070   18.535    3.379
 3000000    2.034    8.825    6.264    6.256    5.798   29.206    4.851
 4000000    2.761   12.275    8.209    7.767    9.111   38.186    8.899
 5000000    3.587   14.829   10.289   10.385   10.685   53.101    8.149
 6000000    4.320   17.926   12.925   12.455   12.639   73.876   10.336
 7000000    5.237   21.504   15.221   14.740   16.167   87.315   12.254
 8000000    6.145   24.503   16.918   15.761   18.430  103.394   16.923
 9000000    6.947   26.801   19.993   18.755   20.676  106.303   16.444
10000000    8.113   30.933   21.352   20.341   20.417  102.421   16.987
11000000    9.031   33.912   24.676   23.624   22.448  114.279   18.698
Total elapsed time: 81.39 minutes


In this set of tests, each function is called three times on the same set of data. As expected, once the list is sorted on the first call, sorting it again on the second call is very fast, and so the "sort" column is quite similar to the previous set of tests.

What I didn't expect is just how badly the various other selection functions cope with being called three times on the same list with different ranks. The extreme case is Thomas' select() function. Total time to call it three times on a list of 11 million items is 342 seconds (3*114), compared to 10 seconds to call it once. I expected that having partially ordered the data on the first call, the second and third calls would take less time rather than more. Was I ever wrong. Unless my analysis is wrong, something bad is happening here, and I don't know what it is.

[Aside: this suggests that, unlike sort() which can take advantage of partially ordered data to be more efficient, the other selection functions are hurt by partially ordered data. Is this analogous to simple versions of Quicksort which degrade to O(N**2) if the data is already sorted?]

What is abundantly clear is that if you want to make more than one selection from a list, you ought to sort it first.


Given these results, I do not believe that a pure-python implementation of any of these selection algorithms can be justified on performance grounds for CPython.

Thanks to Tim Peters and Thomas Ahle for their valuable assistance in writing the selection functions in the first place.
msg219485 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2014-06-01 04:39
I ran the script (modified very slightly for python 2 support) under PyPy 2.3:

$ pypy  select.py
== Single call mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.000    0.010    0.000    0.000    0.000    0.003    0.003
   10000    0.000    0.001    0.001    0.001    0.001    0.000    0.000
   50000    0.002    0.007    0.004    0.002    0.002    0.000    0.000
  100000    0.004    0.010    0.004    0.004    0.005    0.000    0.001
  500000    0.026    0.030    0.019    0.020    0.024    0.004    0.004
 1000000    0.057    0.052    0.037    0.039    0.044    0.007    0.004
 2000000    0.113    0.092    0.069    0.078    0.087    0.017    0.014
 3000000    0.176    0.135    0.109    0.119    0.136    0.024    0.013
 4000000    0.243    0.180    0.137    0.162    0.177    0.024    0.022
 5000000    0.298    0.225    0.176    0.196    0.221    0.035    0.024
 6000000    0.373    0.266    0.207    0.236    0.266    0.051    0.038
 7000000    0.439    0.313    0.248    0.277    0.311    0.054    0.038
 8000000    0.506    0.360    0.282    0.317    0.356    0.039    0.039
 9000000    0.566    0.404    0.315    0.352    0.406    0.055    0.068
10000000    0.626    0.445    0.349    0.395    0.444    0.065    0.046
11000000    0.697    0.492    0.387    0.439    0.490    0.059    0.086
Total elapsed time: 0.96 minutes



$ pypy  select.py                                                                       🕙 57.7s
== Average of three calls mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.000    0.010    0.001    0.001    0.004    0.003    0.002
   10000    0.000    0.005    0.001    0.001    0.002    0.000    0.000
   50000    0.002    0.014    0.006    0.006    0.008    0.002    0.001
  100000    0.005    0.018    0.012    0.011    0.016    0.002    0.001
  500000    0.026    0.071    0.051    0.060    0.076    0.019    0.007
 1000000    0.055    0.135    0.102    0.124    0.148    0.046    0.013
 2000000    0.115    0.257    0.208    0.244    0.291    0.092    0.027
 3000000    0.181    0.396    0.301    0.347    0.383    0.097    0.044
 4000000    0.243    0.530    0.417    0.485    0.559    0.127    0.050
 5000000    0.312    0.656    0.522    0.570    0.688    0.172    0.072
 6000000    0.377    0.789    0.610    0.688    0.772    0.215    0.072
 7000000    0.448    0.927    0.715    0.850    0.978    0.315    0.087
 8000000    0.510    1.049    0.812    0.967    1.193    0.403    0.108
 9000000    0.575    1.191    0.929    1.088    1.241    0.462    0.107
10000000    0.641    1.298    1.070    1.217    1.310    0.470    0.128
11000000    0.716    1.464    1.121    1.343    1.517    0.401    0.147
Total elapsed time: 2.21 minutes
msg219492 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-06-01 07:27
I tried the same with a Cython compiled version of select.py in the latest CPython 3.5 build. It pretty clearly shows that select2 is pretty much always faster than sorting, by a factor of 2-5x or so. I'll also attach the annotated source file that Cython generates.

*** CPython 3.5 (de01f7c37b53)

== Single call mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.000    0.001    0.001    0.001    0.001    0.001    0.001
   10000    0.001    0.003    0.002    0.002    0.002    0.003    0.002
   50000    0.005    0.015    0.010    0.010    0.010    0.013    0.008
  100000    0.012    0.032    0.023    0.023    0.023    0.027    0.017
  500000    0.085    0.174    0.131    0.129    0.155    0.167    0.103
 1000000    0.190    0.375    0.300    0.272    0.311    0.456    0.292
 2000000    0.422    0.828    0.588    0.579    0.689    0.865    0.560
 3000000    0.680    1.187    0.918    0.906    0.915    1.427    0.801
 4000000    0.948    1.574    1.180    1.146    1.177    1.659    1.004
 5000000    1.253    2.027    1.684    1.523    1.598    1.874    1.085
 6000000    1.577    2.441    1.892    1.754    1.787    2.659    1.055
 7000000    1.934    2.870    2.128    2.062    2.093    3.289    1.274
 8000000    2.279    3.304    2.430    2.421    2.471    2.569    2.449
 9000000    2.560    3.767    2.835    2.768    2.771    3.089    2.348
10000000    2.790    4.123    3.153    3.044    3.097    4.366    3.764
11000000    3.199    4.605    3.658    3.467    3.383    3.867    4.599
Total elapsed time: 9.13 minutes


*** Cython / CPython 3.5

== Single call mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.000    0.001    0.000    0.000    0.001    0.000    0.000
   10000    0.001    0.001    0.001    0.001    0.001    0.000    0.000
   50000    0.006    0.006    0.005    0.005    0.006    0.001    0.001
  100000    0.013    0.014    0.011    0.012    0.013    0.004    0.004
  500000    0.089    0.091    0.073    0.075    0.079    0.048    0.049
 1000000    0.200    0.192    0.156    0.158    0.194    0.081    0.073
 2000000    0.451    0.417    0.324    0.355    0.404    0.210    0.135
 3000000    0.678    0.614    0.496    0.501    0.530    0.291    0.277
 4000000    0.980    0.835    0.720    0.680    0.718    0.402    0.441
 5000000    1.276    1.197    0.846    0.857    0.905    0.491    0.425
 6000000    1.535    1.274    1.067    1.040    1.087    0.534    0.451
 7000000    1.842    1.500    1.226    1.214    1.279    0.549    0.507
 8000000    2.168    1.726    1.384    1.398    1.491    0.557    0.535
 9000000    2.438    1.987    1.566    1.582    1.660    0.966    0.544
10000000    2.768    2.187    1.747    1.773    1.911    1.116    0.640
11000000    3.116    2.417    1.922    1.950    2.063    1.283    1.024
Total elapsed time: 5.48 minutes
msg219493 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-06-01 07:57
Here's also the pathological "average of three calls" case. As Steven suggests, it shows that select() suffers quite heavily (algorithmically), but select2() also suffers enough to jump up to about 2/3 of the runtime of sorting (so it's still 1/3 faster even here). This sounds like select2 should be much faster on random data (run 1) and about as fast on sorted data (runs 2+3). Not unexpected, given the algorithmical characteristics of Timsort.

== Average of three calls mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.000    0.002    0.001    0.001    0.002    0.001    0.000
   10000    0.001    0.004    0.003    0.003    0.003    0.001    0.001
   50000    0.006    0.018    0.017    0.016    0.016    0.011    0.003
  100000    0.013    0.037    0.029    0.031    0.037    0.016    0.009
  500000    0.091    0.246    0.204    0.216    0.227    0.218    0.057
 1000000    0.205    0.535    0.443    0.434    0.459    0.530    0.156
 2000000    0.458    1.137    0.917    0.922    1.052    2.124    0.328
 3000000    0.734    1.743    1.448    1.510    1.607    2.805    0.500
 4000000    1.010    2.400    1.888    2.029    2.157    3.039    0.655
 5000000    1.278    3.021    2.458    2.404    2.717    4.789    0.853
 6000000    1.571    3.629    2.873    3.094    3.279    4.136    1.030
 7000000    1.884    4.258    3.520    3.621    3.530    7.788    1.312
 8000000    2.198    4.977    4.042    4.175    4.080    9.035    1.446
 9000000    2.525    5.555    4.539    4.723    4.633   10.933    1.608
10000000    2.844    6.345    4.929    5.035    5.588   10.398    1.852
11000000    3.194    7.081    5.822    5.973    6.183    8.291    2.111
Total elapsed time: 13.33 minutes
msg219496 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2014-06-01 10:54
Updating the type declaration file to remove the dependency on the list builtin and allow arbitrary containers. The test code has this dependency (calls a.sort()), but the current statistics module in the stdlib does not (calls sorted()). Timings do not change, at least not more than you would expect by randomisation (i.e. repeated runs go up and down within the same bounds). Note that the timings are still specific to lists and would be higher for other containers.
msg219498 - (view) Author: Julian Taylor (jtaylor) Date: 2014-06-01 12:20
in the case of the median you can archive similar performance to a multiselect by simply calling min([len(data) // 2 + 1]) for the second order statistic which you need for the averaging of even number of elements.

maybe an interesting datapoint would be to compare with numpys selection algorithm which is a intromultiselect (implemented in C for native datattypes).
It uses a standard median of 3 quickselect with a cutoff in recursion depth to median of median of group of 5.
the multiselect is implemented using a sorted list of kth order statistics and reducing the search space for each kth by maintaining a stack of all visited pivots.
E.g. if you search for 30 and 100, when during the search for 30 one has visited pivot 70 and 110, the search for 100 only needs to select in l[70:110].
The not particularly readable implementation is in: ./numpy/core/src/npysort/selection.c.src
unfortunately for object types it currently falls back to quicksort so you can't directly compare performance with the pure python variants.
msg219569 - (view) Author: Thomas Dybdahl Ahle (thomasahle) * Date: 2014-06-02 11:54
I don't know if it's worth the overhead to implement a multiselect, given we only expose a median function.

I've rewritten select2 to be intro, just falling back on sorting. This doesn't appear to degrade the performance.

I also added np.median to the test-suite. And it is indeed pretty snappy. Though not more than select2 under pypy. There is a discussion here https://github.com/numpy/numpy/issues/1811

== Single call mode ==
N        sort     select7  select23 select47 select97 select   select2  select2b np.median
-------- -------- -------- -------- -------- -------- -------- -------- -------- ---------
    5000    0.002    0.006    0.004    0.004    0.004    0.008    0.003    0.003    0.000
   10000    0.004    0.011    0.008    0.008    0.008    0.014    0.007    0.007    0.001
   50000    0.025    0.057    0.044    0.041    0.043    0.054    0.028    0.028    0.005
  100000    0.055    0.117    0.087    0.085    0.089    0.137    0.079    0.080    0.014
  500000    0.366    0.635    0.474    0.467    0.485    0.534    0.445    0.446    0.105
 1000000    0.802    1.321    1.001    0.985    1.012    1.392    0.936    0.920    0.216
 2000000    1.833    2.666    2.020    1.989    2.040    3.039    1.815    1.821    0.468
 3000000    2.829    4.039    3.034    2.980    3.116    3.191    2.622    2.634    0.704
 4000000    4.013    5.653    4.275    4.284    4.209    6.200    3.715    3.755    0.998
 5000000    5.192    6.888    5.137    5.029    5.201    5.826    5.047    5.084    1.271
msg219934 - (view) Author: Julian Taylor (jtaylor) Date: 2014-06-07 13:02
for median alone a multiselect is probably overkill (thats why I mentioned the minimum trick)

but a selection algorithm is useful on its own for all of python and then a multiselect should be considered.
Of course that means it would need to be implemented in C like sorted() so you actually have a significant performance gain that makes adding a new python function worthwhile.

Also just to save numpys honor, you are benchmarking python list -> numpy array conversion and not the actual selection in your script with the numpy comparison. The conversion is significantly slower than the selection itself. Also select2b is inplace while np.partition is out of place. Repeated inplace selection typically gets faster as the number of required swaps goes down and can even be constant in time if the requested value does not change.
With that fixed numpy outperforms pypy by about a factor 2 (but pypys performance is indeed quite impressive as it is far more generic)
msg219940 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2014-06-07 14:17
On Sat, Jun 07, 2014 at 01:02:52PM +0000, Julian Taylor wrote:
> but a selection algorithm is useful on its own for all of python and 
> then a multiselect should be considered.

I like the idea of a select and/or multiselect for 3.5. As a 
new feature, it cannot go into 3.4.
msg257394 - (view) Author: Upendra Kumar (upendra-k14) * Date: 2016-01-03 03:28
This is my first attempt to contribute. I have used the Median of Medians with n/5 groups of 5 elements each. It has linear time complexity. But, still I am not sure about it, because constants may be high. Therefore, if anyone can please benchmark it with other methods or check that can this method be a viable solution to finding median more efficiently?

I will like it to be checked before I make attempts to convert it into a patch.
msg257395 - (view) Author: Upendra Kumar (upendra-k14) * Date: 2016-01-03 03:33
This method can also be further used to implement multiselect functionality with slight changes in the nth-order function.
msg257409 - (view) Author: Julian Taylor (jtaylor) Date: 2016-01-03 11:58
the median of median of 5 is quite significantly slower than a quickselect.
numpy implements an introselect which uses quickselect but falls back to median of median of 5 if not enough progress is done.
In the numpy implementation for 100000 element median (multiselect with 2 selections, one median one min) quickselect is around 3 times faster than mom5
msg309909 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-01-14 01:33
Everyone here should heed Tim's comments.  The statistics module already has a history of suboptimal decisions made in the service of theoretical perfection (i.e. mean(seq) is over a 100x slower than fsum(seq)/len(seq)).

While variants of quick-select have a nice O(n) theoretical time, the variability is very-high and has really bad worst cases. The existing sort() is unbelievably fast, has a reasonable worst case, exploits existing order to great advantage, has nice cache performance, and has become faster still with the recently added type-specialized comparisons.  This sets a very high bar for any proposed patches.
msg316233 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-05-06 10:43
How does the performance change with this patch?

Quick-select is a nice idea in theory, but unless it is written in C, it is unlikely to beat sorting the list unless you have HUGE data sets. Its been nearly four years since I last did some benchmarks, but at the time there was no comparison, sorting was clearly much better (although Stefan found that select was faster than sorting).

In particular, all the quickselect versions I tested suffered catastrophic performance slowdowns if the data was already sorted: anything from double the time to ten times as much time.
msg316270 - (view) Author: Dong-hee Na (corona10) * Date: 2018-05-07 13:25
For CPython, I agree with the opinion that finding a median value with Tim sort(C version) is faster than quickselect algorithm with pure python.
Or we need to implement quickselect by C API.

The advantage of implementing this method by quickselect with pure python will be for pypy or any other python compatible compiler.

Considering the role that CPython provides for the standard library to other compatible compilers, it is worth considering from a performance point of view with a compatible compiler. However, for CPython, we have to take a performance hit.

Here is the benchmark result with pypy3 on MacOS

N        sort     select7  select23 select47 select97 select   select2  PR6715
-------- -------- -------- -------- -------- -------- -------- -------  -------
    5000    0.000    0.003    0.000    0.000    0.000    0.001    0.001    0.001
   10000    0.000    0.001    0.001    0.001    0.001    0.000    0.000    0.001
   50000    0.002    0.005    0.004    0.002    0.002    0.000    0.000    0.000
  100000    0.005    0.008    0.004    0.004    0.005    0.001    0.001    0.001
  500000    0.027    0.021    0.016    0.019    0.021    0.004    0.003    0.003
 1000000    0.054    0.035    0.030    0.037    0.041    0.006    0.007    0.008
 2000000    0.104    0.069    0.063    0.074    0.084    0.013    0.013    0.015
 3000000    0.162    0.105    0.091    0.109    0.122    0.017    0.018    0.017
 4000000    0.219    0.137    0.122    0.148    0.167    0.024    0.016    0.024
 5000000    0.275    0.170    0.156    0.182    0.204    0.030    0.028    0.017
 6000000    0.336    0.210    0.181    0.220    0.247    0.030    0.039    0.040
 7000000    0.447    0.283    0.242    0.292    0.329    0.056    0.055    0.043
 8000000    0.503    0.309    0.248    0.313    0.338    0.051    0.049    0.052
 9000000    0.592    0.359    0.323    0.353    0.421    0.065    0.047    0.082
10000000    0.643    0.398    0.357    0.412    0.433    0.061    0.070    0.047
11000000    0.700    0.387    0.365    0.454    0.459    0.087    0.075    0.077
History
Date User Action Args
2018-05-07 13:25:03corona10setnosy: + corona10
messages: + msg316270
2018-05-06 10:43:45steven.dapranosetmessages: + msg316233
2018-05-06 06:30:25corona10setpull_requests: + pull_request6408
2018-01-14 01:33:08rhettingersetnosy: + rhettinger
messages: + msg309909
2018-01-14 00:00:54python-devsetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request5029
2017-08-21 18:44:28samnaughtonbsetpull_requests: + pull_request3207
2016-01-03 11:58:03jtaylorsetmessages: + msg257409
2016-01-03 03:33:01upendra-k14setmessages: + msg257395
2016-01-03 03:28:47upendra-k14setfiles: + effmedian.py
nosy: + upendra-k14
messages: + msg257394

2014-06-07 14:17:15steven.dapranosetmessages: + msg219940
2014-06-07 13:02:52jtaylorsetmessages: + msg219934
2014-06-02 11:54:09thomasahlesetfiles: + select2.py

messages: + msg219569
2014-06-01 12:20:30jtaylorsetnosy: + jtaylor
messages: + msg219498
2014-06-01 10:55:55scodersetfiles: + select.html
2014-06-01 10:54:37scodersetfiles: + select.pxd

messages: + msg219496
2014-06-01 07:57:19scodersetmessages: + msg219493
2014-06-01 07:28:32scodersetfiles: + select.html
2014-06-01 07:27:54scodersetfiles: + select.pxd
nosy: + scoder
messages: + msg219492

2014-06-01 04:39:07alexsetnosy: + alex
messages: + msg219485
2014-06-01 04:13:55steven.dapranosetfiles: + select.py

messages: + msg219483
2014-05-31 16:18:20ezio.melottisetnosy: + ezio.melotti
messages: + msg219458
2014-05-31 09:15:27thomasahlesetmessages: + msg219448
2014-05-31 02:53:09tim.peterssetnosy: + tim.peters
messages: + msg219437
2014-05-30 19:44:14terry.reedysetnosy: + terry.reedy
messages: + msg219419
2014-05-30 12:43:38thomasahlesetmessages: + msg219387
2014-05-29 03:25:38vajraskysetnosy: + vajrasky
messages: + msg219329
2014-05-28 23:49:25steven.dapranosetmessages: + msg219324
2014-05-28 14:43:29thomasahlesetmessages: + msg219271
2014-05-28 13:10:29zach.waresetnosy: + steven.daprano

stage: needs patch
2014-05-28 12:11:37thomasahlecreate