Author mgorny
Recipients mgorny
Date 2017-10-21.07:57:10
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1508572631.92.0.213398074469.issue31834@psf.upfronthosting.co.za>
In-reply-to
Content
The setup.py file for Python states:

        if (not cross_compiling and
                os.uname().machine == "x86_64" and
                sys.maxsize >  2**32):
            # Every x86_64 machine has at least SSE2.  Check for sys.maxsize
            # in case that kernel is 64-bit but userspace is 32-bit.
            blake2_macros.append(('BLAKE2_USE_SSE', '1'))

While the assertion about having SSE2 is true, it doesn't mean that it's worthwhile to use. I've tested pure (i.e. without SSSE3 and so on) on three different machines, getting the following results:

Athlon64 X2 (SSE2 is the best supported variant), 540 MiB of data:

SSE2: [5.189988004000043, 5.070812243997352]
ref:  [2.0161159170020255, 2.0475422790041193]

Core i3, same data file:

SSE2: [1.924425926999902, 1.92461746999993, 1.9298037500000191]
ref:  [1.7940209749999667, 1.7900855569999976, 1.7835538760000418]

Xeon E5630 server, 230 MiB data file:

SSE2: [0.7671358410007088, 0.7797677099879365, 0.7648976119962754]
ref:  [0.5784736709902063, 0.5717909929953748, 0.5717219939979259]

So in all the tested cases, pure SSE2 implementation is *slower* than the reference implementation. SSSE3 and other variants are faster and AFAIU they are enabled automatically based on CFLAGS, so it doesn't matter for most of the systems.

However, for old CPUs that do not support SSSE3, the choice of SSE2 makes the algorithm prohibitively slow -- it's 2.5 times slower than the reference implementation!
History
Date User Action Args
2017-10-21 07:57:11mgornysetrecipients: + mgorny
2017-10-21 07:57:11mgornysetmessageid: <1508572631.92.0.213398074469.issue31834@psf.upfronthosting.co.za>
2017-10-21 07:57:11mgornylinkissue31834 messages
2017-10-21 07:57:10mgornycreate