Title: Experiment usage of likely/unlikely in CPython core
Type: performance Stage: resolved
Components: Versions: Python 3.7
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, haypo, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-02-06 09:41 by haypo, last changed 2017-06-28 00:59 by haypo. This issue is now closed.

File name Uploaded Description Edit
likely.patch haypo, 2017-02-06 09:41 review
Messages (8)
msg287112 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-06 09:41
Attached patch is an attempt to use the GCC/Clang __builtin_expect() attribute to provide the compiler with branch prediction information, it adds _Py_likely() and _Py_unlikely() macros, similar to the Linux kernel macros.

I always wanted to experiment this change. I opened this issue to collect information and benchmark results to take a decision. I doubt that the change will really make Python faster, there is a risk of modifying a lot of code for no value, or worse: make Python slower. Extract of GCC doc:

  "as programmers are notoriously bad at predicting how their programs actually perform."

My patch marks error cases as unlikely. A profiler like Linux perf should be used to check which branches are less likely, but I tried to use my guesses to expeeriment a first change.

Since there is a risk that using such macro makes the code slower, or has no effect, I tried to restrict changes to the hot code and the most common functions/macros:

* Py_EnterRecursiveCall()
* _PyArg_NoKeywords()
* Py_DECREF()... not sure about this case, is it really more likely that refcnt != 0 than refcnt==0?
* Functions to call functions: _PyObject_FastCallKeywords(), fast_function(), etc.
* etc.

I'm not sure about the methodology to benchmark such patch neither. Is it worth it to benchmark a Python compiled without PGO?

By the way, does GCC/Clang use __builtin_expect() information when Python is compiled with PGO? If PGO ignores this information, it would be better to not use it, since I now strongly recommend to use PGO. Otherwise, Python performance is too random because of the joy of code placement.

Some links on __builtin_expect():

msg287113 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-06 09:45
Copy of msg287111 (issue #29460), Serhiy Storchaka (serhiy.storchaka):

> The next question might it: would it be worth it to try using unlikely()
> macro on (kwargs == NULL) test in the macro? ;-)

Perhaps this may help in some critical tight loops (in implementations of 
dict, string operations or long integer arithmetic), but I doubt it can have 
any measurable effect when used once per a call of a function calling 
PyArg_Parse* and using many other stuff.
msg287115 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-06 10:01

Using __builtin_expect() in a very long loop 10^9 iteratos (1,000,000,000) makes the loop 15% faster (2.67 sec => 2.28 sec), *but*  using PGO avoids the need of using __builtin_expect() explicitly and makes the code 27% faster (2.67 sec => 1.95 sec):

"This optimized version runs significantly faster (1.95 versus 2.28 seconds) than our version that used __builtin_expect(). This is because, in addition to the branching in the if statement, the branching in the for loops was also optimized."

The article also confirms that if __builtin_expect() is misused, it makes the code 5% slower (2.67 sec => 2.79 sec).


Another story related to micro-optimization in the Linux kernel.

The Linux kernel used explicit prefetch in some tiny loops. After many benchmarks, it was concluded that letting the developer uses prefetch makes the code slower, and so the micro-optimization was removed:

“So the conclusion is: prefetches are absolutely toxic, even if the NULL ones are excluded.”
msg287118 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-06 10:40
Benchmarks with Python compiled by gcc -O3 (without PGO).

haypo@smithers$ ./python -m perf timeit 'len("abc")' --duplicate=1000 --compare-to=../default-ref/python
Median +- std dev: [ref] 40.4 ns +- 0.8 ns -> [likely] 40.8 ns +- 2.1 ns: 1.01x slower (+1%)

haypo@smithers$ ./python -m perf timeit 'sum(())' --duplicate=1000 --compare-to=../default-ref/python -p3
Median +- std dev: [ref] 86.4 ns +- 2.8 ns -> [likely] 86.3 ns +- 0.3 ns: 1.00x faster (-0%)
Not significant!

haypo@smithers$ ./python -m perf timeit -s 's=list("abc")' 'sorted(s)' --duplicate=100 --compare-to=../default-ref/python -p3
Median +- std dev: [ref] 224 ns +- 3 ns -> [likely] 222 ns +- 1 ns: 1.01x faster (-1%)
msg287144 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-02-06 15:51
I had read that likely/unlikely were hard to use well and often not worth the trouble.   IIRC, they are used extensively in the Linux kernel but people were finding zero measurable effect when enabling or disabling the constructs.
msg287145 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2017-02-06 15:58
I would expect that PGO/LGO builds give better improvements with less coding bloat. Did you compare a PGO likely/unlikely build against a standard PGO build?
msg287148 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-06 16:04
I also always wanted to experiment this change. But I was afraid that providing such macros can encourage of overusing it. I don't want to wrap any test for NULL or -1 with these macros.

If we expect some benefit from using these macros, I would play with them in hot loops. For example in dict lookup function (unlikely comparing keys raise an exception or dict is modified in process of searching), in codecs (unlikely illegal sequence is occurred), etc.
msg297094 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-06-28 00:59
So, I played with likely/unlikely and... nothing, nothing interesting. Please use PGO compilation ;-)
Date User Action Args
2017-06-28 00:59:55hayposetstatus: open -> closed
resolution: out of date
messages: + msg297094

stage: resolved
2017-02-06 16:04:11serhiy.storchakasetmessages: + msg287148
2017-02-06 15:58:17christian.heimessetmessages: + msg287145
2017-02-06 15:51:24rhettingersetnosy: + rhettinger
messages: + msg287144
2017-02-06 10:40:01hayposetmessages: + msg287118
2017-02-06 10:01:13hayposetmessages: + msg287115
2017-02-06 09:52:52christian.heimessetnosy: + christian.heimes
2017-02-06 09:45:46hayposetmessages: + msg287113
2017-02-06 09:41:28haypocreate