Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAKE2: the (pure) SSE2 impl forced on x86_64 is slower than reference #76015

Closed
mgorny mannequin opened this issue Oct 21, 2017 · 6 comments
Closed

BLAKE2: the (pure) SSE2 impl forced on x86_64 is slower than reference #76015

mgorny mannequin opened this issue Oct 21, 2017 · 6 comments
Labels
extension-modules C modules in the Modules dir performance Performance or resource usage

Comments

@mgorny
Copy link
Mannequin

mgorny mannequin commented Oct 21, 2017

BPO 31834
Nosy @vstinner, @tiran, @benjaminp, @mgorny
PRs
  • bpo-31834: Use optimized code for BLAKE2 only with SSSE3+ #4066
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-11-20.07:32:35.182>
    created_at = <Date 2017-10-21.07:57:11.872>
    labels = ['extension-modules', 'performance']
    title = 'BLAKE2: the (pure) SSE2 impl forced on x86_64 is slower than reference'
    updated_at = <Date 2017-11-20.07:32:35.181>
    user = 'https://github.com/mgorny'

    bugs.python.org fields:

    activity = <Date 2017-11-20.07:32:35.181>
    actor = 'christian.heimes'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-11-20.07:32:35.182>
    closer = 'christian.heimes'
    components = ['Extension Modules']
    creation = <Date 2017-10-21.07:57:11.872>
    creator = 'mgorny'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 31834
    keywords = ['patch']
    message_count = 6.0
    messages = ['304696', '304865', '304867', '304870', '304911', '305808']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'christian.heimes', 'benjamin.peterson', 'mgorny']
    pr_nums = ['4066']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue31834'
    versions = ['Python 3.6']

    @mgorny
    Copy link
    Mannequin Author

    mgorny mannequin commented Oct 21, 2017

    The setup.py file for Python states:

        if (not cross_compiling and
                os.uname().machine == "x86_64" and
                sys.maxsize >  2**32):
            # Every x86_64 machine has at least SSE2.  Check for sys.maxsize
            # in case that kernel is 64-bit but userspace is 32-bit.
            blake2_macros.append(('BLAKE2_USE_SSE', '1'))
    

    While the assertion about having SSE2 is true, it doesn't mean that it's worthwhile to use. I've tested pure (i.e. without SSSE3 and so on) on three different machines, getting the following results:

    Athlon64 X2 (SSE2 is the best supported variant), 540 MiB of data:

    SSE2: [5.189988004000043, 5.070812243997352]
    ref: [2.0161159170020255, 2.0475422790041193]

    Core i3, same data file:

    SSE2: [1.924425926999902, 1.92461746999993, 1.9298037500000191]
    ref: [1.7940209749999667, 1.7900855569999976, 1.7835538760000418]

    Xeon E5630 server, 230 MiB data file:

    SSE2: [0.7671358410007088, 0.7797677099879365, 0.7648976119962754]
    ref: [0.5784736709902063, 0.5717909929953748, 0.5717219939979259]

    So in all the tested cases, pure SSE2 implementation is *slower* than the reference implementation. SSSE3 and other variants are faster and AFAIU they are enabled automatically based on CFLAGS, so it doesn't matter for most of the systems.

    However, for old CPUs that do not support SSSE3, the choice of SSE2 makes the algorithm prohibitively slow -- it's 2.5 times slower than the reference implementation!

    @mgorny mgorny mannequin added extension-modules C modules in the Modules dir performance Performance or resource usage labels Oct 21, 2017
    @benjaminp
    Copy link
    Contributor

    New changeset 1aa00ff by Benjamin Peterson (Michał Górny) in branch 'master':
    fixes bpo-31834: Use optimized code for BLAKE2 only with SSSE3+ (bpo-4066)
    1aa00ff

    @tiran
    Copy link
    Member

    tiran commented Oct 24, 2017

    I'm pretty sure that your PR has disabled all SSE optimizations. AFAIK gcc does not enable SSE3 and SSE4 on X86_64 by default.

    $ gcc -dM -E - < /dev/null | grep SSE
    #define __SSE2_MATH__ 1
    #define __SSE_MATH__ 1
    #define __SSE2__ 1
    #define __SSE__ 1

    You have to set a compiler flag like -msse4

    $ gcc -msse4 -dM -E - < /dev/null | grep SSE
    #define __SSE4_1__ 1
    #define __SSE4_2__ 1
    #define __SSE2_MATH__ 1
    #define __SSE_MATH__ 1
    #define __SSE2__ 1
    #define __SSSE3__ 1
    #define __SSE__ 1
    #define __SSE3__ 1

    @tiran tiran reopened this Oct 24, 2017
    @vstinner
    Copy link
    Member

    AFAIK gcc does not enable SSE3 and SSE4 on X86_64 by default.

    Linux now supports multiple variants of the same function, one variant per CPU type, the binding is done when a library is loaded. But I don't know how to implement that :-(

    There is target_clones("sse4.1,avx") the function attribute in GCC for example. It compiles a function twice, once for generic CPU, once for SSE4.1.

    See also ifunc: "indirect function", "CPU dispatch" or "function resolver".

    @benjaminp
    Copy link
    Contributor

    On Tue, Oct 24, 2017, at 00:25, Christian Heimes wrote:

    Christian Heimes <lists@cheimes.de> added the comment:

    I'm pretty sure that your PR has disabled all SSE optimizations. AFAIK
    gcc does not enable SSE3 and SSE4 on X86_64 by default.

    $ gcc -dM -E - < /dev/null | grep SSE
    #define __SSE2_MATH__ 1
    #define __SSE_MATH__ 1
    #define __SSE2__ 1
    #define __SSE__ 1

    Before this patch, this would cause blake2b.c to use slow SSE2 only
    instruction, though, right?

    It seems to me this represents an improvement or the status quo in all
    cases. With no extra GCC flags, the reference implementation is used
    rather than a slow SSE2 implementation. If extra -m flags are in CFLAGS,
    the fastest implementation for the target is used.

    @benjaminp
    Copy link
    Contributor

    @tiran, can we close this again?

    @tiran tiran closed this as completed Nov 20, 2017
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants