2017-02-06
Using __builtin_expect() in a very long loop 10^9 iteratos (1,000,000,000) makes the loop 15% faster (2.67 sec => 2.28 sec), *but*  using PGO avoids the need of using __builtin_expect() explicitly and makes the code 27% faster (2.67 sec => 1.95 sec):

"This optimized version runs significantly faster (1.95 versus 2.28 seconds) than our version that used __builtin_expect(). This is because, in addition to the branching in the if statement, the branching in the for loops was also optimized."

The article also confirms that if __builtin_expect() is misused, it makes the code 5% slower (2.67 sec => 2.79 sec).


Another story related to micro-optimization in the Linux kernel.

The Linux kernel used explicit prefetch in some tiny loops. After many benchmarks, it was concluded that letting the developer uses prefetch makes the code slower, and so the micro-optimization was removed:

“So the conclusion is: prefetches are absolutely toxic, even if the NULL ones are excluded.”
