Message 40672 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients
Date	2002-07-31.06:02:48
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Logged In: YES user_id=31435 ~sort gets more mysterious all the time: the mystery now is why it's not much slower everywhere! Here are the exact # of compares ~sort does: i n sort msort %ch lg(n!) -- ------ ------- ------- ----- -------- 15 32768 130484 188720 44.63 444255 16 65536 260019 377634 45.23 954037 17 131072 555035 755476 36.11 2039137 18 262144 1107826 1511174 36.41 4340409 19 524288 2218562 3022584 36.24 9205096 20 1048576 4430616 6045418 36.45 19458756 The last column is the information-theoretic lower bound for sorting random arrays of this size (no comparison-based algorithm can do better than than on average), showing that sort() and msort() are both getting a lot of good out of the duplicates. But sort()'s special case for equal elements is extremely effective on ~sort's specific data pattern, and msort just isn't going to get close to that (it does better than sort() on skewed distributions with lots of duplicates, though). The only thing I can think of that could transform what "should be" highly significant slowdowns into highly significant speedups on some boxes are catastrophic cache effects in samplesort. But knowing something about how both algorithms work <wink>, that's not screaming "oh, of course".

Logged In: YES 
user_id=31435

~sort gets more mysterious all the time:  the mystery now 
is why it's *not* much slower everywhere!  Here are the 
exact # of compares ~sort does:

i        n     sort    msort    %ch    lg(n!)
--  ------  -------  -------  -----  --------
15   32768   130484   188720  44.63    444255
16   65536   260019   377634  45.23    954037
17  131072   555035   755476  36.11   2039137
18  262144  1107826  1511174  36.41   4340409
19  524288  2218562  3022584  36.24   9205096
20 1048576  4430616  6045418  36.45  19458756

The last column is the information-theoretic lower bound for 
sorting random arrays of this size (no comparison-based 
algorithm can do better than than on average), showing 
that sort() and msort() are both getting a lot of good out of 
the duplicates.  But sort()'s special case for equal 
elements is extremely effective on ~sort's specific data 
pattern, and msort just isn't going to get close to that (it 
does better than sort() on skewed distributions with lots of 
duplicates, though).

The only thing I can think of that could transform 
what "should be" highly significant slowdowns into highly 
significant speedups on some boxes are catastrophic 
cache effects in samplesort.  But knowing something about 
how both algorithms work <wink>, that's not screaming "oh, 
of course".

History
Date	User	Action	Args
2007-08-23 15:14:17	admin	link	issue587076 messages
2007-08-23 15:14:17	admin	create