Title: Use __builtin_clzl for bits_in_digit if available
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.7
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: louielu, mark.dickinson, niklasf, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2017-03-10 10:53 by niklasf, last changed 2017-04-22 21:57 by vstinner. This issue is now closed.

File name Uploaded Description Edit vstinner, 2017-03-27 15:36
Pull Requests
URL Status Linked Edit
PR 594 closed niklasf, 2017-03-10 10:58
Messages (16)
msg289353 - (view) Author: Niklas Fiekas (niklasf) * Date: 2017-03-10 10:53
Baseline performance (9e6ac83acae):
$ ./python -m timeit "12345678 == 12345678.0"
5000000 loops, best of 5: 40 nsec per loop
$ ./python -m timeit "1 == 1.0"
10000000 loops, best of 5: 38.8 nsec per loop
$ ./python -m timeit "(1234578987654321).bit_length()"
10000000 loops, best of 5: 39.4 nsec per loop

Upcoming PR:

$ ./python -m timeit "12345678 == 12345678.0"
10000000 loops, best of 5: 34.3 nsec per loop
$ ./python -m timeit "1 == 1.0"
10000000 loops, best of 5: 34.4 nsec per loop
$ ./python -m timeit "(1234578987654321).bit_length()"
10000000 loops, best of 5: 36.4 nsec per loop
msg289400 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2017-03-10 18:25
Thanks for the idea, and the PR!

To be useful, this would need a bit of tweaking: we can't assume that a digit is always `unsigned long` (in fact, I'd expect that to be rare on 64-bit non-Windows systems, where `unsigned long` typically has 64 bits, and `digit` should be using a 32-bit type), so we'd need to identify and use the most appropriate variant of clz/clzl/clzll somehow.

I think it would also make sense to move the detection of the existence of `__builtin_clz` and friends into the configuration machinery: these builtins aren't just restricted to GCC (clang supports them as well), and that would allow other platforms to provide their own substitutes by modifying `pyport.h`. Ideally, the configuration machinery #defines a HAVE_CLZ variable, and then in longobject.c we do an #ifdef HAVE_CLZ ...

We also need to trade-off the additional complication in the implementation against the benefits: though we do certainly care about the speed, speed at all costs isn't the target here; readability, portability and maintainability of the source counts, too. It'll probably be easier to weigh those factors once there's an implementation that addresses the above points.
msg289401 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-10 18:52
I think we can assume that digit is no larger than unsigned long, otherwise PyLong_AsLong() and like wouldn't work.
msg289404 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2017-03-10 19:49
True enough. It looks as though someone (*cough*) already documented that restriction, too:
msg289406 - (view) Author: Niklas Fiekas (niklasf) * Date: 2017-03-10 20:09
Thanks for the review.

I pushed a change to check if clz can be used (`sizeof(digit) <= sizeof(unsigned int)`). Otherwise use clzl. I believe the latter should be the most common, since unsigned long has 32bits. As you say unsigned long long should never be needed.

Btw. mathmodule.c currently duplicates the function: It might be worth factoring it out, but I don't know where to put it.
msg289407 - (view) Author: Niklas Fiekas (niklasf) * Date: 2017-03-10 20:12
Oops. Actually clz should commonly be enough. And platforms where clz and clzl are different (<=> unsigned int and unsigned long are different) should be rare.
msg290620 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-27 15:36
The change is an optimization, so it requires a benchmark, think you provided. Ok. But the speedup is a few nanoseconds on a function which takes currently 40 nanoseconds. It's not really what I would call significant:

$ ./python -m perf compare_to ref.json patch.json  --table
| Benchmark           | ref     | patch                       |
| int==float 8 digits | 39.2 ns | 37.9 ns: 1.03x faster (-3%) |
| int==float 1 digit  | 38.0 ns | 37.9 ns: 1.00x faster (-0%) |
| int.bit_length()    | 42.0 ns | 41.2 ns: 1.02x faster (-2%) |

(See attached script.)

So I'm not really convinced that the change is useful. Is bit_length() used in hot loops?
msg290621 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-27 15:41
> $ ./python -m timeit "12345678 == 12345678.0"
> 5000000 loops, best of 5: 40 nsec per loop

By the way, I check the bytecode to make sure that the compiler doesn't optimize that. I'm suprised that it's not replaced with True!

Is there a reason to perform the test at runtime?
msg290622 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-27 15:45
There are not good reasons to optimize this case at compile time. This is very obscure way of writing True.
msg290973 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-04-01 09:26
I don't think a 3% improvement on a micro-benchmark is worth it (this will translate into 0% improvement on real-world workloads).  Are there other uses of _Py_bit_length() that show bigger benefits?
msg292090 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-21 23:58
I concur with Antoine and suggest to reject this issue.
msg292105 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-22 05:32
In some microbenchmarks this can give up to 15%.

$ ./python.patched -m perf timeit -q --compare-to=./python.default -s "a = list(map(float, range(10000)))" "12345 in a"
Mean +- std dev: [python.default] 1.28 ms +- 0.11 ms -> [python.patched] 1.12 ms +- 0.07 ms: 1.15x faster (-13%)
msg292106 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-22 05:40
But even 15% speed up in particular microbenchmarks looks too small to me for such complex change.
msg292114 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2017-04-22 12:14
msg292115 - (view) Author: Louie Lu (louielu) * Date: 2017-04-22 12:19
If this issue is closed by "not a big performance improvement", maybe the XXX in mathmoudle.c should be take off?

/* XXX: This routine does more or less the same thing as
 * bits_in_digit() in Objects/longobject.c.  Someday it would be nice to
 * consolidate them.  On BSD, there's a library function called fls()
 * that we could use, and GCC provides __builtin_clz().
msg292139 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-04-22 21:57
I'm in favor of documenting in comments decisions to not micro-optimize
such code. I did something similar in ceval.c for 1+1.
Date User Action Args
2017-04-22 21:57:14vstinnersetmessages: + msg292139
2017-04-22 12:19:45louielusetnosy: + louielu
messages: + msg292115
2017-04-22 12:14:11mark.dickinsonsetstatus: open -> closed
resolution: rejected
messages: + msg292114

stage: resolved
2017-04-22 05:40:58serhiy.storchakasetmessages: + msg292106
2017-04-22 05:32:49serhiy.storchakasetmessages: + msg292105
2017-04-21 23:58:49vstinnersetmessages: + msg292090
2017-04-01 09:26:05pitrousetnosy: + pitrou
messages: + msg290973
2017-03-27 15:45:40serhiy.storchakasetmessages: + msg290622
2017-03-27 15:41:13vstinnersetmessages: + msg290621
2017-03-27 15:36:58vstinnersetfiles: +
nosy: + vstinner
messages: + msg290620

2017-03-10 20:12:03niklasfsetmessages: + msg289407
2017-03-10 20:09:01niklasfsetmessages: + msg289406
2017-03-10 19:49:59mark.dickinsonsetmessages: + msg289404
2017-03-10 18:52:27serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg289401
2017-03-10 18:25:57mark.dickinsonsetmessages: + msg289400
2017-03-10 11:10:48xiang.zhangsetnosy: + mark.dickinson
2017-03-10 10:58:59niklasfsetpull_requests: + pull_request490
2017-03-10 10:53:57niklasfcreate