Author elliot.gorokhovsky
Recipients elliot.gorokhovsky
Date 2016-11-13.21:19:39
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1479071980.11.0.232140584241.issue28685@psf.upfronthosting.co.za>
In-reply-to
Content
When python compares two objects, there are many safety checks that have to be performed before the actual comparison can take place. These include checking the types of the two objects, as well as checking various type-specific properties, like character width for strings or the number of machine-words that represent a long. 

Obviously, the vast majority of the work done during a list.sort() is spent in comparisons, so even small optimizations can be important. What I noticed is that since we have n objects, but we're doing O(nlogn) comparisons, it pays off to do all the safety checks in a single pass and then select an optimized comparison function that leverages the information gained to avoid as many sort-time checks as possible. I made the following assumptions:

1. In practice, lists to be sorted are almost always type-homogeneous.
2. Most lists to be sorted consist of homogeneous elements of the following types:
    a. Latin strings (i.e. strings whose characters are one machine-word wide)
    b. Longs that fit in a single machine-word
    c. Floats
    d. Tuples of the above
3. The above assumptions also hold for lists constructed by taking a list of tuples to be sorted and extracting the first elements.
4. The vast majority of tuple comparisons never get past the first element (i.e. the first elements of tuples are rarely equal).

My patch adds the following routine to list.sort():
1. Go through the list and see which of the assumptions, if any, hold (except (4), which can't be checked in advance).
2. Replace PyObject_RichCompareBool() with an optimized compare function that is able to ignore some safety checks by applying the assumption we've already verified to be true.

There are then two questions: when none of the assumptions hold, how expensive is (1)? And when we do get a "hit", how much time do we save by applying (2)?

Those questions, of course, can only be answered by benchmarks. So here they are, computed on an isolated CPU core, on two python interpreters, both compiled with ./configure --with-optimizations:

# This is a runnable script. Just build the reference interpreter in python-ref and the patched interpreter in python-dev.

# Pathological cases (where we pay for (1) but don't get any savings from (2)): what kind of losses do we suffer?

# How expensive is (1) for empty lists?
python-dev/python -m perf timeit -s "l=[]" "sorted(l)" --rigorous 
python-ref/python -m perf timeit -s "l=[]" "sorted(l)" --rigorous  
# Median +- std dev: 212 ns +- 9 ns
# Median +- std dev: 216 ns +- 10 ns
# (difference is within std dev)

# How expensive is (1) for singleton lists?
python-dev/python -m perf timeit -s "l=[None]" "sorted(l)" --rigorous 
python-ref/python -m perf timeit -s "l=[None]" "sorted(l)" --rigorous
# Median +- std dev: 235 ns +- 16 ns
# Median +- std dev: 238 ns +- 15 ns
# (difference is within std dev)

# How expensive is (1) for non-type-homogeneous lists?
# (we use small lists because as n gets large, the fraction time spent in (1) gets smaller)
python-dev/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for _ in range(0,10)]+[1]" "sorted(l)" --rigorous 
python-ref/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for _ in range(0,10)]+[1]" "sorted(l)" --rigorous 
# Median +- std dev: 784 ns +- 35 ns
# Median +- std dev: 707 ns +- 51 ns
# 10% slower. While this is unfortunate, this case almost never occurs in practice, and the losses are certainly offset by the gains we'll see below.
# Also, note that for large lists, the difference would be *much* smaller, since the time spent in (1) gets smaller like 1/log(n).
# So basically, we only pay a penalty for non-type-homogeneous small lists.

# OK, now for the cases that actually occur in practice:

# What kind of gains do we get for floats?
python-dev/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[random.uniform(-1,1) for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 3.63 us +- 0.20 us
# Median +- std dev: 8.81 us +- 0.37 us
# Wow! 59% faster! And if we used a large list, we would've gotten even better numbers.

# What kind of gains do we get for latin strings?
python-dev/python -m perf timeit -s "import random; l=[str(random.uniform(-1,1)) for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[str(random.uniform(-1,1)) for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 9.51 us +- 0.28 us
# Median +- std dev: 15.9 us +- 0.7 us
# 40% faster! 

# What kind of gains do we get for non-latin strings (which I imagine aren't that common to sort in practice anyway)
python-dev/python -m perf timeit -s "import random; l=[str(random.uniform(-1,1))+'\uffff' for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[str(random.uniform(-1,1))+'\uffff' for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 12.2 us +- 0.4 us
# Median +- std dev: 14.2 us +- 0.6 us
# 14%. Not as impressive, but again, this is a bit of a pathological case.

# What kind of gains do we get for ints that fit in a machine word? (I'll keep them in (-2^15,2^15) to be safe)
python-dev/python -m perf timeit -s "import random; l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[int(random.uniform(-1,1)*2**15) for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 4.92 us +- 0.35 us
# Median +- std dev: 9.59 us +- 0.75 us
# 49% faster. Pretty cool stuff!

# What kind of gains do we get for pathologically large ints?
python-dev/python -m perf timeit -s "import random; l=[int(random.uniform(-1,1)*2**100) for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[int(random.uniform(-1,1)*2**100) for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 6.93 us +- 0.26 us
# Median +- std dev: 8.93 us +- 0.23 us
# 22% faster. The same comparison function is used as in the pathological string case, but the numbers are different because comparing strings
# is more expensive than comparing ints, so we save less time from skipping the type checks. Regardless, 22% is still pretty cool. And I can't
# imagine it's that common to sort huge ints anyway, unless you're doing some sort of math conjecture testing or something.

# What kind of gains do we get for tuples whose first elements are floats?
python-dev/python -m perf timeit -s "import random; l=[(random.uniform(-1,1), 'whatever') for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[(random.uniform(-1,1), 'whatever') for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 5.83 us +- 0.50 us
# Median +- std dev: 21.5 us +- 0.5 us
# WOW! 73% faster!
# A note: I know of at least one application where this is relevant. The DEAP evolutionary algorithms library, which is very popular, has to sort
# tuples of floats when it's selecting individuals for the next generation. It spends a non-trivial amount of time sorting those tuples of floats,
# and this patch cuts that non-trivial time by 73%! Pretty cool! Now we can evolve more individuals more better!

# What kind of gains do we get for tuples whose first elements are latin strings?
python-dev/python -m perf timeit -s "import random; l=[(str(random.uniform(-1,1)), 42) for _ in range(0,100)]" "sorted(l)" --rigorous
python-ref/python -m perf timeit -s "import random; l=[(str(random.uniform(-1,1)), 42) for _ in range(0,100)]" "sorted(l)" --rigorous
# Median +- std dev: 14.9 us +- 0.8 us
# Median +- std dev: 31.2 us +- 1.3 us
# 52%. Not too shabby!

# What kind of gains do we get for tuples of other stuff?
# Obviously there are lots of combinations possible, I won't waste your time documenting all of them.
# Suffice it to say that the gains for lists tuples of x is always greater than the gains for lists of x, since we bypass
# two layers of safety checks: checks on the tuples, and then checks on the x.

# What kind of gains do we for arbitrary lists of objects of the same type?
# See the benchmarks for large ints and non-latin strings. It will be different for each object, since it depends on the ratio between
# the cost of type-check and the cost of comparison. Regardless, it is always non-trivial, unless the cost of comparison is huge.

# End of script #

TL;DR: This patch makes common list.sort() cases 40-75% faster, and makes very uncommon pathological cases at worst 15% slower.
It accomplishes this by performing the endless, but necessary, safety checks involved in comparison up front, and then using
optimized compares that ignore the already-performed checks during the actual sort.
History
Date User Action Args
2016-11-13 21:19:41elliot.gorokhovskysetrecipients: + elliot.gorokhovsky
2016-11-13 21:19:40elliot.gorokhovskysetmessageid: <1479071980.11.0.232140584241.issue28685@psf.upfronthosting.co.za>
2016-11-13 21:19:40elliot.gorokhovskylinkissue28685 messages
2016-11-13 21:19:39elliot.gorokhovskycreate