Issue 31834: BLAKE2: the (pure) SSE2 impl forced on x86_64 is slower than reference

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76015

classification

Title:	BLAKE2: the (pure) SSE2 impl forced on x86_64 is slower than reference
Type:	performance	Stage:	resolved
Components:	Extension Modules	Versions:	Python 3.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, christian.heimes, mgorny, vstinner
Priority:	normal	Keywords:	patch

Created on 2017-10-21 07:57 by mgorny, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 4066	merged	mgorny, 2017-10-21 09:18

Messages (6)
msg304696 - (view)	Author: Michał Górny (mgorny) *	Date: 2017-10-21 07:57
The setup.py file for Python states: if (not cross_compiling and os.uname().machine == "x86_64" and sys.maxsize > 2*32): # Every x86_64 machine has at least SSE2. Check for sys.maxsize # in case that kernel is 64-bit but userspace is 32-bit. blake2_macros.append(('BLAKE2_USE_SSE', '1')) While the assertion about having SSE2 is true, it doesn't mean that it's worthwhile to use. I've tested pure (i.e. without SSSE3 and so on) on three different machines, getting the following results: Athlon64 X2 (SSE2 is the best supported variant), 540 MiB of data: SSE2: [5.189988004000043, 5.070812243997352] ref: [2.0161159170020255, 2.0475422790041193] Core i3, same data file: SSE2: [1.924425926999902, 1.92461746999993, 1.9298037500000191] ref: [1.7940209749999667, 1.7900855569999976, 1.7835538760000418] Xeon E5630 server, 230 MiB data file: SSE2: [0.7671358410007088, 0.7797677099879365, 0.7648976119962754] ref: [0.5784736709902063, 0.5717909929953748, 0.5717219939979259] So in all the tested cases, pure SSE2 implementation is slower* than the reference implementation. SSSE3 and other variants are faster and AFAIU they are enabled automatically based on CFLAGS, so it doesn't matter for most of the systems. However, for old CPUs that do not support SSSE3, the choice of SSE2 makes the algorithm prohibitively slow -- it's 2.5 times slower than the reference implementation!
msg304865 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2017-10-24 06:54
New changeset 1aa00ff383c43335e4a5044274617dbf59bc839e by Benjamin Peterson (Michał Górny) in branch 'master': fixes bpo-31834: Use optimized code for BLAKE2 only with SSSE3+ (#4066) https://github.com/python/cpython/commit/1aa00ff383c43335e4a5044274617dbf59bc839e
msg304867 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2017-10-24 07:25
I'm pretty sure that your PR has disabled all SSE optimizations. AFAIK gcc does not enable SSE3 and SSE4 on X86_64 by default. $ gcc -dM -E - < /dev/null \| grep SSE #define __SSE2_MATH__ 1 #define __SSE_MATH__ 1 #define __SSE2__ 1 #define __SSE__ 1 You have to set a compiler flag like -msse4 $ gcc -msse4 -dM -E - < /dev/null \| grep SSE #define __SSE4_1__ 1 #define __SSE4_2__ 1 #define __SSE2_MATH__ 1 #define __SSE_MATH__ 1 #define __SSE2__ 1 #define __SSSE3__ 1 #define __SSE__ 1 #define __SSE3__ 1
msg304870 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-24 08:04
> AFAIK gcc does not enable SSE3 and SSE4 on X86_64 by default. Linux now supports multiple variants of the same function, one variant per CPU type, the binding is done when a library is loaded. But I don't know how to implement that :-( There is target_clones("sse4.1,avx") the function attribute in GCC for example. It compiles a function twice, once for generic CPU, once for SSE4.1. See also ifunc: "indirect function", "CPU dispatch" or "function resolver".
msg304911 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2017-10-24 14:34
On Tue, Oct 24, 2017, at 00:25, Christian Heimes wrote: > > Christian Heimes <lists@cheimes.de> added the comment: > > I'm pretty sure that your PR has disabled all SSE optimizations. AFAIK > gcc does not enable SSE3 and SSE4 on X86_64 by default. > > $ gcc -dM -E - < /dev/null \| grep SSE > #define __SSE2_MATH__ 1 > #define __SSE_MATH__ 1 > #define __SSE2__ 1 > #define __SSE__ 1 Before this patch, this would cause blake2b.c to use slow SSE2 only instruction, though, right? It seems to me this represents an improvement or the status quo in all cases. With no extra GCC flags, the reference implementation is used rather than a slow SSE2 implementation. If extra -m flags are in CFLAGS, the fastest implementation for the target is used.
msg305808 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2017-11-08 05:51
@tiran, can we close this again?

History
Date	User	Action	Args
2022-04-11 14:58:53	admin	set	github: 76015
2017-11-20 07:32:35	christian.heimes	set	status: open -> closed resolution: fixed stage: resolved
2017-11-08 05:51:49	benjamin.peterson	set	messages: + msg305808
2017-10-24 14:34:24	benjamin.peterson	set	messages: + msg304911
2017-10-24 08:04:06	vstinner	set	nosy: + vstinner messages: + msg304870
2017-10-24 07:25:04	christian.heimes	set	status: closed -> open nosy: + christian.heimes messages: + msg304867 resolution: fixed -> (no value) stage: resolved -> (no value)
2017-10-24 06:54:21	benjamin.peterson	set	status: open -> closed nosy: + benjamin.peterson messages: + msg304865 resolution: fixed stage: patch review -> resolved
2017-10-21 09:18:49	mgorny	set	keywords: + patch stage: patch review pull_requests: + pull_request4036
2017-10-21 07:57:11	mgorny	create