Issue 9915: speeding up sorting with a key

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54124

classification

Title:	speeding up sorting with a key
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.2

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	stutzbach	Nosy List:	amaury.forgeotdarc, collinwinter, eric.smith, mark.dickinson, pitrou, rhettinger, stutzbach, terry.reedy, tim.peters
Priority:	normal	Keywords:	patch

Created on 2010-09-21 20:03 by stutzbach, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
sort-key-locality.diff	stutzbach, 2010-09-21 20:03
speed_test.zip	stutzbach, 2010-09-22 17:14
sort-speed-test.patch	stutzbach, 2010-11-22 23:50	Patch to add a script for comparing sort-speed
sort-faster.patch	stutzbach, 2010-11-22 23:51	Revised patch to Objects/listobject.c
detailed-speed-results.txt	stutzbach, 2010-11-22 23:52	Detailed performance results for sorting random integers without a key
sort-faster-3.patch	stutzbach, 2010-12-01 20:21	Minimalist version of the patch

Messages (35)
msg117097 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-09-21 20:03
(I've made an educated guess about who to add to the Nosy list) The attached patch substantially speeds up sorting using the "key" parameter. It is purely a performance patch; the language and libraries are not changed in any other way from the users point of view. I measured a reduction in execution time of at least 15% in many cases and more than 40% for large n. I performed measurements on an Intel 64-bit Linux system using gcc 4.3.2 and on an Intel 32-bit Windows XP system using Visual C Express Edition 2009. Previously, "key" was implemented by a creating a sortwrapperobject, which is a PyObject storing a key and a value and using only the key for comparison. With the patch, sortwrapperobject is removed entirely. Instead, the sort uses two arrays: one for keys and one for values. Comparisons use the keys array. Whenever a swap is performed, the swap is performed on both arrays. If the "keys" parameter is not provided to the sort, the second swap is skipped (since the keys are also the values). Compared to the sortwrapperobject approach, speed is improved by: - Requiring only 1 memory allocation for an array of PyObject *'s, instead of n memory allocations for n sortwrapperobjects - Removes the need to dereference through sortwrapperobjects, which were scattered about in memory (i.e., had poor cache locality) - Substantially smaller memory overhead, further improving cache performance When the "key" parameter is not used, the code still needs to check to see if it should be performing a second swap. However, the additional instructions are cache and branch-prediction friendly and do not have a noticeable impact on performance. I conducted enough experiments to establish a 95% confidence interval that excluded a slowdown of more than 1% (but did not exclude a slowdown of 0% - which is good). A handful of results: # No key, same speed otto:~$ py3k-git-base/python -m timeit -s 'x = list(range(5))' -s 'f = x.sort' 'f()' 1000000 loops, best of 3: 0.276 usec per loop otto:~$ py3k-git/python -m timeit -s 'x = list(range(5))' -s 'f = x.sort' 'f()' 1000000 loops, best of 3: 0.276 usec per loop # With a key, patched version is faster otto:~$ py3k-git-base/python -m timeit -s 'x = list(range(5))' -s 'f = x.sort' 'f(key=int)' 1000000 loops, best of 3: 1.76 usec per loop otto:~$ py3k-git/python -m timeit -s 'x = list(range(5))' -s 'f = x.sort' 'f(key=int)' 1000000 loops, best of 3: 1.5 usec per loop # Results are more dramatic with large n otto:~$ py3k-git-base/python -m timeit -s 'x = list(range(100000))' -s 'f = x.sort' 'f(key=int)' 10 loops, best of 3: 35.2 msec per loop otto:~$ py3k-git/python -m timeit -s 'x = list(range(100000))' -s 'f = x.sort' 'f(key=int)' 10 loops, best of 3: 22.4 msec per loop I have been using a script for running a large battery of experiments with different values of n and different conditions (random data, sorted data, reverse-sorted data, key, no key, etc.). The script works, but it's clunky to use. I'm working on cleaning that up and hope to attach it to this issue within the next few days.
msg117098 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-21 20:37
Looks rather nice. I don't think there's any point in micro-optimizations such as stack_keys.
msg117099 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-09-21 20:43
Conceptually, this is a reasonable approach. I originally put in the sortwrapper as a straight-forward technique of tackling the 2.x API which allowed a key-function, or a cmp-function, or both, or neither. IOW, the original motivation is now gone. The only remaining advantage of the sortwrapper is that it is independent of the main code for the timsort, so both are more readable and maintainable in their current form. I've only had a cursory look at the patch. A couple of thoughts: * The memmove, memcpy functions are tricky to time because of varying performance across various architectures and libraries. Code like "dest++ = pb++;" is hard to beat, especially for short runs. * Is the code slower for the common case where a key function is not provided? The patch seems to add a level of indirection throughout the code (lots of "lo.values" instead of just "lo"). * A more extensive timing suite would be helpful; the ones listed in the first post are simplistic.
msg117104 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-09-21 21:40
Antoine said: > I don't think there's any point in micro-optimizations such as > stack_keys. Good point. I'll try taking that out and see how it impacts the timing. Raymond said: > The memmove, memcpy functions are tricky to time because of varying > performance across various architectures and libraries. Code like > "dest++ = pb++;" is hard to beat, especially for short runs. Anything that was a memmove or memcpy remains a memmove or memcpy. Anything that was equivalent but implemented "by hand" remains that way. (Although in either case the details have been moved into a sortslice_* static inline function.) I have avoided changing any memcpy/memmove to "by hand" (or vise versa), because I know that it might be faster on my system but slower on someone else's. > Is the code slower for the common case where a key function is not > provided? The short answer is "no". The long answer is that I conducted enough experiments for the 95% confidence interval of the ratio of execution times (patched/trunk) to be 0.994735131656 +/- 0.00540792612332, for the case where no key function is provided. > The patch seems to add a level of indirection throughout the code > (lots of "lo.values" instead of just "lo"). The compiler can exactly compute the stack offset of "lo.values", so it should require the same number and type of instructions to access as just "lo". > A more extensive timing suite would be helpful; the ones listed in > the first post are simplistic. I have one, but it's clunky and requires modifying the script to change important parameters. I'm refactoring it to use argparse and will attach it to this issue when it's ready.
msg117147 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-09-22 17:14
Attached is my script for running a more comprehensive battery of speed tests. The script itself requires Python 2.6 with argparse installed or Python 2.7 (which includes argparse). For obvious reasons, please make sure that your unpatched and patched versions of python are otherwise identical (same source, same compiler, same configure and compiler settings, etc.). The script will create subdirectories to store data, so please run it in the speed_test directory. Example usage: otto:~/speed_test$ ./speed_test.py ../py3k-unpatched/ ../py3k-patched/ 'sort random' otto:~/speed_test$ ./speed_test.py --help usage: speed_test.py [-h] [--minn MINN] [--maxn MAXN] [-r, --repetitions REPETITIONS] [--graphs] control experiment [tests [tests ...]] Compare the speed of two list implementations positional arguments: control control/python experiment experiment/python tests Names of tests to conduct optional arguments: -h, --help show this help message and exit --minn MINN Minimum list size --maxn MAXN Maximum list size -r, --repetitions REPETITIONS Repetitions; how many times to repeat each experiment --graphs Generate performance graphs as a function of n; requires gnuplot Available experiments: sort random tuples sort random key sort reversed sort random objects sort sorted sort sorted key sort random sort sorted objects sort reversed key
msg117332 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-09-24 21:18
The posted experiments on sorted data do not do any sorting. They only test the difference in setup and comparison speed and not sorting/swapping speed. Please post something with large arrays of random data. Since the patch is intended to speed up 3.2 and your posted experiments were run on that, I am puzzled that you would post a test script to run under late 2.x instead of 3.1+.
msg117334 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-09-24 21:30
> Since the patch is intended to speed up 3.2 and your posted > experiments were run on that, I am puzzled that you would post a > test script to run under late 2.x instead of 3.1+. I had originally written the test script was originally written for something else and repurposed it. Actually, it's been repurposed at least twice so it has a long history and an unfortunate amount of cruft. I will work on porting it to run on py3k. My original examples did not use random data because timeit is challenging to use to time sorting of random data and I wanted to post something relatively concise. Sometime next week I plan to post extensive timing results on random data. I have been working on some fine-tuning of my patch as well as to my timing script.
msg117335 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-09-24 21:49
Does this help any? >>> import random >>> from timeit import timeit >>> testlist = list(range(100000)) >>> random.shuffle(testlist) >>> timeit('t=list(testlist); t.sort()', 'from __main__ import testlist', number=1) 0.1467957519740679 I just learned the import trick from the tracker issue suggesting that it be made more prominent.
msg117336 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-09-24 22:02
> Does this help any? No :-) The problem is that the random data you run in interpreter 1 won't be the same data you run in interpreter 2, so the results are not directly comparable. One of the sets of random data may be more easily sortable than the other. That leaves two options: 1. save the random data to a file and use it in both interpreters, or 2. run a sufficiently large number of tests, with new random data for each test, such that you get a good measurement of the time required to sort average random data I have been using approach #2.
msg117337 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-09-24 22:05
> the random data you run in interpreter 1 won't be the same data > you run in interpreter 2 what about adding a simple "random.seed(12345)"
msg117338 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-09-24 22:12
> what about adding a simple "random.seed(12345)" That's an excellent suggestion! In fact, I'm embarrassed that it never occurred to me (especially since I have used it in other projects). I will have a revised speed_test along with more extensive measurement results sometime next week.
msg122178 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-11-22 23:50
I'm starting to get settled in here at Google and finding time to follow up on everything that got put on hold while moving. Based on the feedback everyone gave me (thanks!), I greatly revised my script for comparing the speed of sort(). It's now compares speeds more quickly and more accurately. I discovered a few things along the way: 1) Don't try to run timings on a virtual server with crummy clock resolution. 2) Make sure that CPU frequency scaling is turned off. It increases the standard deviation of timings by an order of magnitude or two. On Linux systems, the new version of my script will check to see if CPU frequency scaling is turned on and provide suggestions on how to turn it off. On other operating systems, you are on your own. ;) Usage: filfre:~/py3k-patched$ ./python Tools/sort_speed/sort_speed.py -r 3 --maxn 100000 ../py3k-pristine . sort_random Using my new-and-improved timing script and running on real hardware with CPU frequency scaling turned off, I was able to detect a slight slowdown on sorting random integers. Here's a summary for sorting random integers: Original patch (increasing execution time is bad): n in [700 .. 100,000]: 1% to 2% increase in execution time n in [70 to 700]: 2% to 3% increase n in [10 to 70]: 3% to 4% increase n in [5 to 7]: 7% to 9% decrease n in [0 to 3]: 18+% decrease I went back to the code and managed to squeeze out a bit more performance. Here's a summary of the new patch: New patch (increasing execution time is bad): n in [30,000 .. 100,000]: < 1% increase in execution time n in [35 .. 30,000]: 1% to 2% increase n in [20 .. 35]: about the same n in [10 .. 20]: 1 to 2% decrease n in [4 .. 7]: 15+% decrease n in [0 .. 3]: 22+% decrease All of the above results are for sorting without a key. For sorting with a key, there's a big performance boost across the board (15% to 45% decrease in execution time, when the key function is simply "int"). Overall, the patch shrinks Objects/listobject.c by 10 lines. Executive Summary ----------------- Good: - Dramatically cuts execution time of sorting with a key (15% to 45%) - Significantly cuts execution time of sorting small lists without a key (15+%) Bad: - Very slightly increases execution of sorting large lists without a key (1% to 2%; less than 1% for very large lists) Worthwhile trade?
msg122179 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-11-22 23:58
> Worthwhile trade? +1 obviously. Why don't you contribute a list sorting benchmark to the suite in http://hg.python.org/benchmarks/?
msg122180 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-11-23 00:10
Antoine Pitrou <report@bugs.python.org> wrote: > Why don't you contribute a list sorting benchmark to the suite in http://hg.python.org/benchmarks/? I considered that, but I want to separately benchmark sorting different kinds and quantities of data. AFAIK, there isn't an easy way to do that with perf.py.
msg122181 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-11-23 00:11
Le mardi 23 novembre 2010 à 00:10 +0000, Daniel Stutzbach a écrit : > Daniel Stutzbach <stutzbach@google.com> added the comment: > > Antoine Pitrou <report@bugs.python.org> wrote: > > Why don't you contribute a list sorting benchmark to the suite in http://hg.python.org/benchmarks/? > > I considered that, but I want to separately benchmark sorting > different kinds and quantities of data. AFAIK, there isn't an easy > way to do that with perf.py. Right, that wouldn't suit your present purposes. But apparently you are proposing to add a list sorting benchmark to the Tools directory, with lots of duplicated code from that repo...
msg122182 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-11-23 00:23
Antoine Pitrou <report@bugs.python.org> wrote: > Right, that wouldn't suit your present purposes. But apparently you > are proposing to add a list sorting benchmark to the Tools directory, > with lots of duplicated code from that repo... Oh, I just stuck that under Tools because it was convenient for testing. The timing script is in a separate patch because I'm -0 on committing the timing script itself. The perf.py that my timing script uses is based on the one from benchmarks/, but with 90% of the code removed, so the amount of duplicated code is less than it may at first appear. My script could go in benchmarks/ if there was a convenient way for it to import perf.py. I wouldn't want to clutter the top-level directory with my script though. Any suggestions?
msg122183 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-11-23 01:16
Thanks for the revisions and timing updates. I'm heartened that the common-case of sorting without a key function isn't negatively impacted. That result is surprising though -- I thought the concept was manipulate the key and value arrays at the same time instead of just the keys -- did you do more than this, perhaps changing the logic or pattern of comparisons? If so, I would be much more comfortable if Tim reviewed this. The one part of the current code that would be missed is that it cleanly separates/decouples the key-function logic from the Timsort logic. Now those are heavily intertwined -- I find the code harder to follow. Why did the variable names change, pa/pb to ssa/ssb, etc.? I'm hoping that I'll have a chance to go through the details of the patch in the next couple of weeks. Unfortunately, the patch is huge and it looks like it mixes in a number of optimizations beyond just moving keys and values in parallel, variable names are changing, comment lines are rewrapped, etc.
msg122186 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-11-23 01:53
Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment: > That result is surprising though -- I thought the concept was > manipulate the key and value arrays at the same time instead of just > the keys If the "key" parameter was not used, then the values pointer is a null pointer. Each time elements must be moved, the code needs to check if the values pointer is NULL. In other words, the overhead is one conditional branch per move or group of moves (we only have to check once for memmove()). Since the branch will always be the same throughout any given call to sort(), CPU branch prediction is effective making the branches fairly inexpensive. > -- did you do more than this, perhaps changing the logic or > pattern of comparisons? In the original version of the patch, I tried hard to avoid unrelated changes. In the new version, I made many small not-strictly-related optimizations. However, I have not changed the order in which comparisons occur. > Why did the variable names change, pa/pb to ssa/ssb, etc.? I took "pa" to mean "pointer to array A", but it's now a sortslice structure instead of a pointer. > comment lines are rewrapped The comment rewrapping aren't completely spurious. The comments really have changed and in some cases the constraints on the inputs to a function have changed slightly. For what it's worth, my methodology when working on this patch went like this: 1) Make an improvement 2) Measure the performance relative to the previous reference point 3) If it was a statistically significant an improvement, commit the change to my local DVCS 4) Goto 1 I can upload the patch to Rietveld to facilitate review. If I can get Rietveld to show each of my local commits, would it be helpful to see the patch in incremental pieces?
msg122187 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-11-23 02:12
> If the "key" parameter was not used, then the values pointer > is a null pointer. . . . > Since the branch will always be the same throughout any given > call to sort(), CPU branch prediction is effective making the > branches fairly inexpensive. I see how the branch can be optimized away but not how it gets cheaper than when there was no branch at all (and no lo.keys and lo.values extra layer of direction). How did it get faster than the original (in the case with no key-function)? >> Why did the variable names change, pa/pb to ssa/ssb, etc.? > I took "pa" to mean "pointer to array A", but it's now > a sortslice structure instead of a pointer. That makes sense. > If I can get Rietveld to show each of my local commits, > would it be helpful to see the patch in incremental pieces? At this point, it may be best for me to read the patch as-is. Unfortunately, it's a big patch and will likely take a lot of time to read in detail. Is there any chance you can persuade Uncle Timmy to review this? This is all his code and he's likely to have some good insights. Also, is there anyone else who is knowledgeable about Python on less common platforms? ISTM part of the optimization is dependent on the branch-prediction and other nuances that vary from environment to environment. That being said, doing fewer memory allocations is always a win.
msg122190 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-11-23 04:32
> How did it get faster than the original (in the case with no > key-function)? I was able to shave off some instructions in countrun(), binarysort(), and the setup and cleanup code in listsort() proper. For small n, these made a difference. > Is there any chance you can persuade Uncle Timmy to review this? This > is all his code and he's likely to have some good insights. Any suggestions on how he might be persuaded? ;-) > Also, is there anyone else who is knowledgeable about Python on less > common platforms? ISTM part of the optimization is dependent on the > branch-prediction and other nuances that vary from environment to > environment. That being said, doing fewer memory allocations is > always a win. I do have a Windows machine I can test on. It's not exactly a "less common" platform, but at least it's a completely different compiler. I'll post the results once I have them. The Rietveld issue is here: http://codereview.appspot.com/3269041 I ended up loading my incremental patches in, but it's easy enough to diff the base with the last patch. If for some reasons it doesn't work as conveniently as I expect, let me know and I will upload it to Rietveld again as one big patch. If there's anything else I can do to make reviewing easier, please let me know. For that matter, if there's other code you'd like me to review in exchange or straightforward bugs you'd like to pawn off, I would be happy to help out.
msg122245 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-11-23 21:30
> The Rietveld issue is here: > http://codereview.appspot.com/3269041 > > I ended up loading my incremental patches in, but it's easy enough to > diff the base with the last patch. If for some reasons it doesn't > work as conveniently as I expect, let me know and I will upload it to > Rietveld again as one big patch. I've started reviewing, and I must say that incremental patches would have made more sense if they didn't mutate each other's changes. Still reviewing though, thanks for the upload.
msg122247 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-11-23 21:52
> > I ended up loading my incremental patches in, but it's easy enough to > > diff the base with the last patch. If for some reasons it doesn't > > work as conveniently as I expect, let me know and I will upload it to > > Rietveld again as one big patch. > > I've started reviewing, and I must say that incremental patches would > have made more sense if they didn't mutate each other's changes. Still > reviewing though, thanks for the upload. Ok, things are even worse: comments I've made to intermediate patches have wrong URLs in the summary e-mail, so they don't point to the right line numbers. Certainly a bug in Rietveld, but I'm not willing to do it, sorry. Bottom line is that you're making too many gratuitous changes, including style changes, and probably useless changes (Py_LOCAL_INLINE everywhere, nitpicking over ISLT / IFLT...). Also, there's some strange complication in some places which deserves comments and justification. Next time, please upload a single patch. Really.
msg122248 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-11-23 22:22
Antoine Pitrou wrote: > Next time, please upload a single patch. Really. I haven't used Rietveld that much yet, and I'm still learning best-practices. I apologize for the painful experience. For anyone else planning to take a look at this, here it is as one big patch: http://codereview.appspot.com/3290041
msg122249 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-11-23 22:29
Antoine, My original patch was much more focused, but had a slightly larger performance penalty for sorting random keys (see http://bugs.python.org/msg122178). Do you think the performance tradeoff there was still worthwhile? Ihave uploaded my original patch (as one patch :) ) here: http://codereview.appspot.com/3291041 If you think the original performance looks like a worthwhile tradeoff, could you spend about 1 minute looking over that version of patch and give your general impression?
msg122250 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-11-23 23:03
> My original patch was much more focused, but had a slightly larger > performance penalty for sorting random keys (see > http://bugs.python.org/msg122178). Do you think the performance > tradeoff there was still worthwhile? I am not objecting against the performance tweaks but the other more or less gratuitous changes (IFLT vs ISLT, Py_LOCAL_INLINE sprinkled all over, weird complicated static "keys" instead of the initial stack_keys, "else" style change...). You can still try to salvage my comments from the review I've posted, but it seems Rietveld makes the thing tediously annoying to navigate (best thing may be to browse all 14 side-by-side diffs to look for the comments).
msg122941 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-11-30 22:22
+1 on the basic idea of moving elements in the keys and values arrays at the same time thereby eliminating the fragmented memory overhead of the sortwrapper indirection. I would like the patch to be restricted to just that change. The other tweaks are not convincing (relying on compiler and processor specific optimizations that vary across platforms). Instead, try to minimize the patch, making as few changes as possible to manipulation both arrays. Also, try to follow the C style of the other code in the standard library -- the current patch has a different flavor to say the least ;-) Ideally, the code will have the same look and feel as it does now (so that the Timsort remains recognizable to Tim ;-). This code doesn't just need to run fast, it needs to serve as a readable model for anyone else trying to implement the algorithm. If you still want the other tweaks, I recommend putting them in a separate patch afterwards and consider deferring them to 3.3. It's a little late in the dev cycle to make lots of microscopic changes that may introduce a bug or unexpected behavior or perform weirdly on one of the less used platforms.
msg123008 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-12-01 20:21
Antoine and Raymond, thank you for the valuable feedback. Attached is a revised version of the patch, which restricts changes to those directly related to moving elements in the keys and values arrays at the same time. I apologize for having gotten a little carried away with optimizing. I think I eliminated all of the significant style differences as well. If you still see glaring style mismatches, please let me know. It's possible that the differences aren't visible to my eyes. ;-) http://codereview.appspot.com/3369042/diff/1/Objects/listobject.c Please let me know what you think of the revised version.
msg123021 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-12-01 23:16
Thanks. This nice, clean diff is much more reviewable and it looks like what I expected. The use of Py_LOCAL_INLINE is new to me since we usually use #define instead, but this has a cleaner look to it. I am unclear on whether all the our target compilers support an inline keyword. If you're sure it works everywhere, that's great. If not, consider going back to ugly defines -- those reliably work everywhere. Also note that this patch puts a lot of faith in branch prediction. If some target processor doesn't support it, or has limited ability to remember predictions, or mispredicts, then the code will be slower. That being said, I'm happy with the patch. You have a +1 from me.
msg123025 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-12-01 23:42
> The use of Py_LOCAL_INLINE is new to me since we usually use #define > instead, but this has a cleaner look to it. I am unclear on whether > all the our target compilers support an inline keyword. If you're > sure it works everywhere, that's great. I fixed ./configure to properly set up Py_LOCAL_INLINE in Issue5553. :-) It will expand to "static inline" under both MSVC and gcc. On older compilers, it may expand to "static __inline__", "static __inline", or whatever else is needed to get the job done. As a last resort, it will expand to simply "static", but I don't know of any 32-bit (or 64-bit) compilers where that would actually happen. > Also note that this patch puts a lot of faith in branch prediction. > If some target processor doesn't support it, or has limited ability > to remember predictions, or mispredicts, then the code will be slower. I think even a limited amount of memory dedicated to branch prediction should be sufficient. There are two cases: 1) Sorting a simple type, like an int: the comparison is lightweight, and the CPU should have plenty of memory to remember which branch to take in the sorting code. 2) Sorting a complex type (i.e., calling a __lt__ method written in Python): the processor might not be able to remember which branch to take, but the performance impact will be small (as a percentage) since most of the CPU is being consumed by the comparisons. Thanks for taking the time to review this.
msg123027 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-12-02 00:38
Just for the record, I wanted to highlight how little room there is for optimization here. The sort wrapper is very thin: sortwrapper_richcompare(sortwrapperobject a, sortwrapperobject b, int op) { if (!PyObject_TypeCheck(b, &PySortWrapper_Type)) { PyErr_SetString(PyExc_TypeError, "expected a sortwrapperobject"); return NULL; } return PyObject_RichCompare(a->key, b->key, op); } When a key function is defined, this is all you can possibly shave off the time for a comparison. When a key function is not defined, there was no overhead at all. With the patch, we're relying on branch prediction to minimize the cost to the regular case and adding a little indirection in the form of lo variables becoming lo.keys, etc. And the number of memmoves is doubled. To me, the main advantage of the patch is that it saves a little memory for each key: typedef struct { PyObject_HEAD PyObject key; PyObject value; } sortwrapperobject; Just wanted to post this so there weren't any illusions about the patch being a big win.
msg123028 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-12-02 01:06
> Just wanted to post this so there weren't any illusions about the > patch being a big win. > When a key function is defined, this is all you can possibly shave > off the time for a comparison. I don't want to argue whether the patch is a big win or not (I recognize that it is a tradeoff), but when using a key it does shave off more than the call to sortwrapper_richcompare. Stack with sortwrapper: long_richcompare do_richcompare PyObject_RichCompare sortwrapper_richcompare do_richcompare PyObject_RichCompare PyObject_RichCompareBool count_run list_sort Stack without: long_richcompare do_richcompare PyObject_RichCompare PyObject_RichCompareBool count_run list_sort
msg123070 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-12-02 13:25
> Just wanted to post this so there weren't any illusions about the > patch being a big win. Daniel has already posted benchmark numbers, I would trust them rather than any theoretical speculation about whether the patch is interesting or not.
msg123092 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-12-02 18:11
AP: I've already given my blessing to the patch. Just wanted to note what the existing code did. I also trust timings but recognize that they reflect a particular build configuration (compiler/processor/o.s)and the usage pattern for a particular benchmark.
msg123128 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-12-02 22:02
Committed as r86937. Thanks again for reviewing! Although I do not anticipate any problems, I will keep an eye on the buildbots just in case. Antoine, regarding "ms->alloced = (list_size + 1) / 2;", I ended up adding an extensive comment after all.
msg123132 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-12-02 22:08
Thank you!

History
Date	User	Action	Args
2022-04-11 14:57:06	admin	set	github: 54124
2010-12-02 22:08:25	pitrou	set	messages: + msg123132
2010-12-02 22:03:10	stutzbach	set	status: open -> closed nosy: tim.peters, collinwinter, rhettinger, terry.reedy, amaury.forgeotdarc, mark.dickinson, pitrou, eric.smith, stutzbach resolution: accepted components: + Interpreter Core, - Library (Lib) stage: patch review -> resolved
2010-12-02 22:02:16	stutzbach	set	messages: + msg123128
2010-12-02 18:11:46	rhettinger	set	messages: + msg123092
2010-12-02 13:25:43	pitrou	set	messages: + msg123070
2010-12-02 01:06:17	stutzbach	set	messages: + msg123028
2010-12-02 00:38:49	rhettinger	set	messages: + msg123027
2010-12-01 23:42:11	stutzbach	set	messages: + msg123025
2010-12-01 23:16:43	rhettinger	set	messages: + msg123021
2010-12-01 20:21:42	stutzbach	set	files: + sort-faster-3.patch messages: + msg123008
2010-11-30 22:22:06	rhettinger	set	messages: + msg122941
2010-11-25 12:54:11	mark.dickinson	set	nosy: + mark.dickinson
2010-11-23 23:03:57	pitrou	set	messages: + msg122250
2010-11-23 22:29:48	stutzbach	set	messages: + msg122249
2010-11-23 22:22:02	stutzbach	set	messages: + msg122248
2010-11-23 21:52:16	pitrou	set	messages: + msg122247
2010-11-23 21:30:45	pitrou	set	messages: + msg122245
2010-11-23 04:32:00	stutzbach	set	messages: + msg122190
2010-11-23 02:12:10	rhettinger	set	messages: + msg122187
2010-11-23 01:53:28	stutzbach	set	messages: + msg122186
2010-11-23 01:16:38	rhettinger	set	messages: + msg122183
2010-11-23 00:23:27	stutzbach	set	messages: + msg122182
2010-11-23 00:11:52	pitrou	set	messages: + msg122181
2010-11-23 00:10:44	stutzbach	set	messages: + msg122180
2010-11-22 23:58:24	pitrou	set	messages: + msg122179
2010-11-22 23:52:10	stutzbach	set	files: + detailed-speed-results.txt
2010-11-22 23:51:08	stutzbach	set	files: + sort-faster.patch
2010-11-22 23:50:38	stutzbach	set	files: + sort-speed-test.patch messages: + msg122178
2010-09-24 22:12:22	stutzbach	set	messages: + msg117338
2010-09-24 22:05:09	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg117337
2010-09-24 22:02:17	stutzbach	set	messages: + msg117336
2010-09-24 21:49:49	terry.reedy	set	messages: + msg117335
2010-09-24 21:30:33	stutzbach	set	messages: + msg117334
2010-09-24 21:18:49	terry.reedy	set	nosy: + terry.reedy messages: + msg117332
2010-09-22 17:14:31	stutzbach	set	files: + speed_test.zip messages: + msg117147
2010-09-21 22:24:57	eric.smith	set	nosy: + eric.smith
2010-09-21 21:40:35	stutzbach	set	messages: + msg117104
2010-09-21 20:43:55	rhettinger	set	messages: + msg117099
2010-09-21 20:37:51	pitrou	set	nosy: + pitrou messages: + msg117098
2010-09-21 20:03:31	stutzbach	create