Author mark.dickinson
Recipients PedanticHacker, Stefan Pochmann, mark.dickinson, mcognetta, rhettinger, serhiy.storchaka, tim.peters
Date 2021-12-30.11:24:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1640863445.78.0.928280710492.issue37295@roundup.psfhosted.org>
In-reply-to
Content
> So which of xor-popcount and add-up-up-trailing-zero-counts is faster may well depend on platform.

I ran some timings for comb(k, 67) on my macOS / Intel MacBook Pro, using timeit to time calls to a function that looked like this:

def f(comb):
    for k in range(68):
        for _ in range(256):
            comb(k, 67)
            comb(k, 67)
            ... # 64 repetitions of comb(k, 67) in all

Based on 200 timings of this script with each of the popcount approach and the uint8_t-table-of-trailing-zero-counts approach (interleaved), the popcount approach won, but just barely, at around 1.3% faster. The result was statistically significant (SciPy gave me a result of Ttest_indResult(statistic=19.929941828072433, pvalue=8.570975609117687e-62)).

Interestingly, the default build on macOS/Intel is _not_ using the dedicated POPCNT instruction that arrived with the Nehalem architecture, presumably because it wants to produce builds that will still be useable on pre-Nehalem machines. It uses Clang's __builtin_popcount, but that gets translated to the same SIMD-within-a-register approach that we have already in pycore_bitutils.h.

If I recompile with -msse4.2, then the POPCNT instruction *is* used, and I get an even more marginal improvement: a 1.7% speedup over the lookup-table-based version.
History
Date User Action Args
2021-12-30 11:24:05mark.dickinsonsetrecipients: + mark.dickinson, tim.peters, rhettinger, serhiy.storchaka, PedanticHacker, mcognetta, Stefan Pochmann
2021-12-30 11:24:05mark.dickinsonsetmessageid: <1640863445.78.0.928280710492.issue37295@roundup.psfhosted.org>
2021-12-30 11:24:05mark.dickinsonlinkissue37295 messages
2021-12-30 11:24:05mark.dickinsoncreate