Message409360
> So which of xor-popcount and add-up-up-trailing-zero-counts is faster may well depend on platform.
I ran some timings for comb(k, 67) on my macOS / Intel MacBook Pro, using timeit to time calls to a function that looked like this:
def f(comb):
for k in range(68):
for _ in range(256):
comb(k, 67)
comb(k, 67)
... # 64 repetitions of comb(k, 67) in all
Based on 200 timings of this script with each of the popcount approach and the uint8_t-table-of-trailing-zero-counts approach (interleaved), the popcount approach won, but just barely, at around 1.3% faster. The result was statistically significant (SciPy gave me a result of Ttest_indResult(statistic=19.929941828072433, pvalue=8.570975609117687e-62)).
Interestingly, the default build on macOS/Intel is _not_ using the dedicated POPCNT instruction that arrived with the Nehalem architecture, presumably because it wants to produce builds that will still be useable on pre-Nehalem machines. It uses Clang's __builtin_popcount, but that gets translated to the same SIMD-within-a-register approach that we have already in pycore_bitutils.h.
If I recompile with -msse4.2, then the POPCNT instruction *is* used, and I get an even more marginal improvement: a 1.7% speedup over the lookup-table-based version. |
|
Date |
User |
Action |
Args |
2021-12-30 11:24:05 | mark.dickinson | set | recipients:
+ mark.dickinson, tim.peters, rhettinger, serhiy.storchaka, PedanticHacker, mcognetta, Stefan Pochmann |
2021-12-30 11:24:05 | mark.dickinson | set | messageid: <1640863445.78.0.928280710492.issue37295@roundup.psfhosted.org> |
2021-12-30 11:24:05 | mark.dickinson | link | issue37295 messages |
2021-12-30 11:24:05 | mark.dickinson | create | |
|