Message 219483 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	ezio.melotti, steven.daprano, terry.reedy, thomasahle, tim.peters, vajrasky
Date	2014-06-01.04:13:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1401596035.13.0.754766209333.issue21592@psf.upfronthosting.co.za>
In-reply-to

Content
I've run some performance tests on six variations of the O(N) select algorithm, based on Tim Peters' and Thomas Ahle's code, comparing them to the naive O(N log N) "sort first" algorithm, and sorting is consistently faster up to the limit I tested. About the tests I ran: - I tested four versions of Tim's median-of-median-of-k algorithm, for k = 7, 23, 47 and 97. - Thomas' "select" function, which is a median-of-median-of-3. - Thomas' "select2" function, which uses two pivots. - Data was randomly shuffled. - Functions were permitted to modify the data in place, and were not required to make a copy of the data first. E.g. I used alist.sort() rather than sorted(alist). - I ran two separate sets of tests. The first tested individual calls to the various selection functions, on random data. Each function got its own copy of the shuffled data. - The second set of tests called the selection function three times in a row, using different ranks, and used the average of the three times. My test suite is attached if anyone wants to critique it or run it themselves. Results: == Single call mode == N sort select7 select23 select47 select97 select select2 -------- -------- -------- -------- -------- -------- -------- -------- 5000 0.001 0.027 0.004 0.003 0.003 0.005 0.002 10000 0.002 0.008 0.006 0.005 0.005 0.007 0.006 50000 0.014 0.041 0.029 0.027 0.028 0.039 0.035 100000 0.035 0.088 0.069 0.065 0.067 0.132 0.067 500000 0.248 0.492 0.352 0.349 0.345 0.378 0.433 1000000 0.551 1.008 0.768 0.669 0.723 1.007 0.627 2000000 1.173 2.004 1.791 1.335 1.376 3.049 1.108 3000000 1.992 3.282 2.291 2.256 2.299 2.451 1.756 4000000 2.576 4.135 3.130 2.960 2.937 5.022 3.318 5000000 3.568 5.233 3.914 3.504 3.629 4.912 4.458 6000000 4.237 6.233 4.710 4.323 4.514 5.066 3.876 7000000 4.962 7.403 5.447 5.037 5.129 7.053 7.774 8000000 5.854 8.696 6.151 5.963 5.908 8.704 5.836 9000000 6.749 9.540 7.078 6.869 6.985 6.354 3.834 10000000 7.667 10.944 7.621 7.322 7.439 10.092 7.112 11000000 8.400 11.966 8.566 8.284 8.112 10.511 8.184 Total elapsed time: 23.84 minutes My conclusions from single calls: Thomas' select() and Tim's select7() as pure Python functions are too slow for serious contention. [Aside: I wonder how PyPy would go with them?] There's not much difference in performance between the various median-of-median-of-k functions for larger k, but it seems to me that overall k=47 is marginally faster than either k=23 or k=97. Overall, sorting is as good or better (and usually much better) than any of the pure-Python functions for the values of N tested, at least on my computer. C versions may be worth testing, but I'm afraid that is beyond me. Thomas' select2 using dual pivots seems like the most promising. There are a couple of anomalous results where select2 unexpectedly (to me!) does much, much better than sorting, e.g. for N=9 million. Pure chance perhaps? The overall trend seems to me to suggest that a pure-Python version of select2 may become reliably faster than sorting from N=10 million or so, at least with random data on my computer. YMMV, and I would expect that will non-random partially sorted data, the results may be considerably different. == Average of three calls mode == N sort select7 select23 select47 select97 select select2 -------- -------- -------- -------- -------- -------- -------- -------- 5000 0.001 0.012 0.007 0.008 0.007 0.022 0.007 10000 0.002 0.022 0.015 0.015 0.015 0.041 0.016 50000 0.016 0.125 0.086 0.080 0.085 0.259 0.073 100000 0.037 0.258 0.181 0.155 0.156 0.650 0.137 500000 0.242 1.374 0.950 0.963 1.075 4.828 1.135 1000000 0.564 2.892 1.998 1.952 2.100 5.055 1.721 2000000 1.227 5.822 4.084 3.876 4.070 18.535 3.379 3000000 2.034 8.825 6.264 6.256 5.798 29.206 4.851 4000000 2.761 12.275 8.209 7.767 9.111 38.186 8.899 5000000 3.587 14.829 10.289 10.385 10.685 53.101 8.149 6000000 4.320 17.926 12.925 12.455 12.639 73.876 10.336 7000000 5.237 21.504 15.221 14.740 16.167 87.315 12.254 8000000 6.145 24.503 16.918 15.761 18.430 103.394 16.923 9000000 6.947 26.801 19.993 18.755 20.676 106.303 16.444 10000000 8.113 30.933 21.352 20.341 20.417 102.421 16.987 11000000 9.031 33.912 24.676 23.624 22.448 114.279 18.698 Total elapsed time: 81.39 minutes In this set of tests, each function is called three times on the same set of data. As expected, once the list is sorted on the first call, sorting it again on the second call is very fast, and so the "sort" column is quite similar to the previous set of tests. What I didn't expect is just how badly the various other selection functions cope with being called three times on the same list with different ranks. The extreme case is Thomas' select() function. Total time to call it three times on a list of 11 million items is 342 seconds (3114), compared to 10 seconds to call it once. I expected that having partially ordered the data on the first call, the second and third calls would take less time rather than more. Was I ever wrong. Unless my analysis is wrong, something bad is happening here, and I don't know what it is. [Aside: this suggests that, unlike sort() which can take advantage of partially ordered data to be more efficient, the other selection functions are hurt by partially ordered data. Is this analogous to simple versions of Quicksort which degrade to O(N*2) if the data is already sorted?] What is abundantly clear is that if you want to make more than one selection from a list, you ought to sort it first. Given these results, I do not believe that a pure-python implementation of any of these selection algorithms can be justified on performance grounds for CPython. Thanks to Tim Peters and Thomas Ahle for their valuable assistance in writing the selection functions in the first place.

I've run some performance tests on six variations of the O(N) select algorithm, based on Tim Peters' and Thomas Ahle's code, comparing them to the naive O(N log N) "sort first" algorithm, and sorting is consistently faster up to the limit I tested.

About the tests I ran:

- I tested four versions of Tim's median-of-median-of-k 
  algorithm, for k = 7, 23, 47 and 97.

- Thomas' "select" function, which is a median-of-median-of-3.

- Thomas' "select2" function, which uses two pivots.

- Data was randomly shuffled.

- Functions were permitted to modify the data in place, and 
  were not required to make a copy of the data first. E.g. 
  I used alist.sort() rather than sorted(alist).

- I ran two separate sets of tests. The first tested individual
  calls to the various selection functions, on random data. Each
  function got its own copy of the shuffled data. 

- The second set of tests called the selection function three
  times in a row, using different ranks, and used the average
  of the three times.
  
My test suite is attached if anyone wants to critique it or run it themselves.

Results:


== Single call mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.001    0.027    0.004    0.003    0.003    0.005    0.002
   10000    0.002    0.008    0.006    0.005    0.005    0.007    0.006
   50000    0.014    0.041    0.029    0.027    0.028    0.039    0.035
  100000    0.035    0.088    0.069    0.065    0.067    0.132    0.067
  500000    0.248    0.492    0.352    0.349    0.345    0.378    0.433
 1000000    0.551    1.008    0.768    0.669    0.723    1.007    0.627
 2000000    1.173    2.004    1.791    1.335    1.376    3.049    1.108
 3000000    1.992    3.282    2.291    2.256    2.299    2.451    1.756
 4000000    2.576    4.135    3.130    2.960    2.937    5.022    3.318
 5000000    3.568    5.233    3.914    3.504    3.629    4.912    4.458
 6000000    4.237    6.233    4.710    4.323    4.514    5.066    3.876
 7000000    4.962    7.403    5.447    5.037    5.129    7.053    7.774
 8000000    5.854    8.696    6.151    5.963    5.908    8.704    5.836
 9000000    6.749    9.540    7.078    6.869    6.985    6.354    3.834
10000000    7.667   10.944    7.621    7.322    7.439   10.092    7.112
11000000    8.400   11.966    8.566    8.284    8.112   10.511    8.184
Total elapsed time: 23.84 minutes


My conclusions from single calls:

Thomas' select() and Tim's select7() as pure Python functions are too slow for serious contention. [Aside: I wonder how PyPy would go with them?] 

There's not much difference in performance between the various median-of-median-of-k functions for larger k, but it seems to me that overall k=47 is marginally faster than either k=23 or k=97.

Overall, sorting is as good or better (and usually *much better*) than any of the pure-Python functions for the values of N tested, at least on my computer. C versions may be worth testing, but I'm afraid that is beyond me. Thomas' select2 using dual pivots seems like the most promising.

There are a couple of anomalous results where select2 unexpectedly (to me!) does much, much better than sorting, e.g. for N=9 million. Pure chance perhaps?

The overall trend seems to me to suggest that a pure-Python version of select2 may become reliably faster than sorting from N=10 million or so, at least with random data on my computer. YMMV, and I would expect that will non-random partially sorted data, the results may be considerably different.


== Average of three calls mode ==
N        sort     select7  select23 select47 select97 select   select2
-------- -------- -------- -------- -------- -------- -------- --------
    5000    0.001    0.012    0.007    0.008    0.007    0.022    0.007
   10000    0.002    0.022    0.015    0.015    0.015    0.041    0.016
   50000    0.016    0.125    0.086    0.080    0.085    0.259    0.073
  100000    0.037    0.258    0.181    0.155    0.156    0.650    0.137
  500000    0.242    1.374    0.950    0.963    1.075    4.828    1.135
 1000000    0.564    2.892    1.998    1.952    2.100    5.055    1.721
 2000000    1.227    5.822    4.084    3.876    4.070   18.535    3.379
 3000000    2.034    8.825    6.264    6.256    5.798   29.206    4.851
 4000000    2.761   12.275    8.209    7.767    9.111   38.186    8.899
 5000000    3.587   14.829   10.289   10.385   10.685   53.101    8.149
 6000000    4.320   17.926   12.925   12.455   12.639   73.876   10.336
 7000000    5.237   21.504   15.221   14.740   16.167   87.315   12.254
 8000000    6.145   24.503   16.918   15.761   18.430  103.394   16.923
 9000000    6.947   26.801   19.993   18.755   20.676  106.303   16.444
10000000    8.113   30.933   21.352   20.341   20.417  102.421   16.987
11000000    9.031   33.912   24.676   23.624   22.448  114.279   18.698
Total elapsed time: 81.39 minutes


In this set of tests, each function is called three times on the same set of data. As expected, once the list is sorted on the first call, sorting it again on the second call is very fast, and so the "sort" column is quite similar to the previous set of tests.

What I didn't expect is just how badly the various other selection functions cope with being called three times on the same list with different ranks. The extreme case is Thomas' select() function. Total time to call it three times on a list of 11 million items is 342 seconds (3*114), compared to 10 seconds to call it once. I expected that having partially ordered the data on the first call, the second and third calls would take less time rather than more. Was I ever wrong. Unless my analysis is wrong, something bad is happening here, and I don't know what it is.

[Aside: this suggests that, unlike sort() which can take advantage of partially ordered data to be more efficient, the other selection functions are hurt by partially ordered data. Is this analogous to simple versions of Quicksort which degrade to O(N**2) if the data is already sorted?]

What is abundantly clear is that if you want to make more than one selection from a list, you ought to sort it first.


Given these results, I do not believe that a pure-python implementation of any of these selection algorithms can be justified on performance grounds for CPython.

Thanks to Tim Peters and Thomas Ahle for their valuable assistance in writing the selection functions in the first place.

History
Date	User	Action	Args
2014-06-01 04:13:56	steven.daprano	set	recipients: + steven.daprano, tim.peters, terry.reedy, ezio.melotti, thomasahle, vajrasky
2014-06-01 04:13:55	steven.daprano	set	messageid: <1401596035.13.0.754766209333.issue21592@psf.upfronthosting.co.za>
2014-06-01 04:13:55	steven.daprano	link	issue21592 messages
2014-06-01 04:13:49	steven.daprano	create