Message 409360 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mark.dickinson
Recipients	PedanticHacker, Stefan Pochmann, mark.dickinson, mcognetta, rhettinger, serhiy.storchaka, tim.peters
Date	2021-12-30.11:24:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1640863445.78.0.928280710492.issue37295@roundup.psfhosted.org>
In-reply-to

Content
> So which of xor-popcount and add-up-up-trailing-zero-counts is faster may well depend on platform. I ran some timings for comb(k, 67) on my macOS / Intel MacBook Pro, using timeit to time calls to a function that looked like this: def f(comb): for k in range(68): for _ in range(256): comb(k, 67) comb(k, 67) ... # 64 repetitions of comb(k, 67) in all Based on 200 timings of this script with each of the popcount approach and the uint8_t-table-of-trailing-zero-counts approach (interleaved), the popcount approach won, but just barely, at around 1.3% faster. The result was statistically significant (SciPy gave me a result of Ttest_indResult(statistic=19.929941828072433, pvalue=8.570975609117687e-62)). Interestingly, the default build on macOS/Intel is _not_ using the dedicated POPCNT instruction that arrived with the Nehalem architecture, presumably because it wants to produce builds that will still be useable on pre-Nehalem machines. It uses Clang's __builtin_popcount, but that gets translated to the same SIMD-within-a-register approach that we have already in pycore_bitutils.h. If I recompile with -msse4.2, then the POPCNT instruction is used, and I get an even more marginal improvement: a 1.7% speedup over the lookup-table-based version.

> So which of xor-popcount and add-up-up-trailing-zero-counts is faster may well depend on platform.

I ran some timings for comb(k, 67) on my macOS / Intel MacBook Pro, using timeit to time calls to a function that looked like this:

def f(comb):
    for k in range(68):
        for _ in range(256):
            comb(k, 67)
            comb(k, 67)
            ... # 64 repetitions of comb(k, 67) in all

Based on 200 timings of this script with each of the popcount approach and the uint8_t-table-of-trailing-zero-counts approach (interleaved), the popcount approach won, but just barely, at around 1.3% faster. The result was statistically significant (SciPy gave me a result of Ttest_indResult(statistic=19.929941828072433, pvalue=8.570975609117687e-62)).

Interestingly, the default build on macOS/Intel is _not_ using the dedicated POPCNT instruction that arrived with the Nehalem architecture, presumably because it wants to produce builds that will still be useable on pre-Nehalem machines. It uses Clang's __builtin_popcount, but that gets translated to the same SIMD-within-a-register approach that we have already in pycore_bitutils.h.

If I recompile with -msse4.2, then the POPCNT instruction *is* used, and I get an even more marginal improvement: a 1.7% speedup over the lookup-table-based version.

History
Date	User	Action	Args
2021-12-30 11:24:05	mark.dickinson	set	recipients: + mark.dickinson, tim.peters, rhettinger, serhiy.storchaka, PedanticHacker, mcognetta, Stefan Pochmann
2021-12-30 11:24:05	mark.dickinson	set	messageid: <1640863445.78.0.928280710492.issue37295@roundup.psfhosted.org>
2021-12-30 11:24:05	mark.dickinson	link	issue37295 messages
2021-12-30 11:24:05	mark.dickinson	create