Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use calloc() instead of malloc() for int << int (lshift) #65618

Closed
vstinner opened this issue May 2, 2014 · 12 comments
Closed

Use calloc() instead of malloc() for int << int (lshift) #65618

vstinner opened this issue May 2, 2014 · 12 comments
Labels
performance Performance or resource usage

Comments

@vstinner
Copy link
Member

vstinner commented May 2, 2014

BPO 21419
Nosy @pitrou, @vstinner, @MojoVampire
Files
  • long_lshift.patch
  • bench_long_rshift.py
  • long_lshift2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-05-12.20:48:01.569>
    created_at = <Date 2014-05-02.20:51:47.608>
    labels = ['performance']
    title = 'Use calloc() instead of malloc() for int << int (lshift)'
    updated_at = <Date 2014-05-12.20:48:01.568>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2014-05-12.20:48:01.568>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-05-12.20:48:01.569>
    closer = 'vstinner'
    components = []
    creation = <Date 2014-05-02.20:51:47.608>
    creator = 'vstinner'
    dependencies = []
    files = ['35135', '35136', '35137']
    hgrepos = []
    issue_num = 21419
    keywords = ['patch']
    message_count = 12.0
    messages = ['217787', '217788', '217789', '217790', '217791', '217792', '217793', '217795', '217796', '217798', '217799', '218361']
    nosy_count = 3.0
    nosy_names = ['pitrou', 'vstinner', 'josh.r']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue21419'
    versions = ['Python 3.5']

    @vstinner
    Copy link
    Member Author

    vstinner commented May 2, 2014

    Attached patch modifies long_lshift() to allocate the result using calloc() instead of malloc(). The goal is to avoid allocating physical memory for the least significat digits (zeros). According to my tests in issue bpo-21233 (calloc), Linux and Windows have an optimized calloc() functions and at least Linux avoids allocating memory for pages initialized by zeros until pages at modified.

    @vstinner vstinner added the performance Performance or resource usage label May 2, 2014
    @vstinner
    Copy link
    Member Author

    vstinner commented May 2, 2014

    bench_long_rshift.py: Microbenchmark for int << int (lshift) operation. Results on Linux:

    Common platform:
    Bits: int=32, long=64, long long=64, size_t=64, void*=64
    Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
    Python unicode implementation: PEP-393
    Timer: time.perf_counter
    CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
    Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug
    CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes

    Platform of campaign orig:
    SCM: hg revision=5b0fda8f5718 tag=tip branch=default date="2014-05-02 22:31 +0200"
    Python version: 3.5.0a0 (default:5b0fda8f5718, May 2 2014, 22:47:06) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
    Date: 2014-05-02 22:47:12
    Timer precision: 46 ns

    Platform of campaign calloc:
    Timer precision: 49 ns
    Python version: 3.5.0a0 (default:5b0fda8f5718+, May 2 2014, 22:45:58) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
    Date: 2014-05-02 22:46:42
    SCM: hg revision=5b0fda8f5718+ tag=tip branch=default date="2014-05-02 22:31 +0200"

    ----------------+-------------+----------------
    Tests           |        orig |          calloc
    ----------------+-------------+----------------
    1 << (2 ** 0)   |   39 ns () |           38 ns
    1 << (2 ** 1)   |   38 ns (
    ) |           37 ns
    1 << (2 ** 3)   |   38 ns () |           37 ns
    1 << (2 ** 5)   |   42 ns (
    ) |     39 ns (-8%)
    1 << (2 ** 10)  |   43 ns () |    37 ns (-14%)
    1 << (2 ** 15)  |  157 ns (
    ) |    90 ns (-43%)
    1 << (2 ** 20)  | 3.91 us () |    74 ns (-98%)
    1 << (2 ** 25)  |  206 us (
    ) |   86 ns (-100%)
    1 << (2 ** 28)  | 4.06 ms () |  4.1 us (-100%)
    123 << 1        |   11 ns (
    ) |           12 ns
    12345678 << 1   |   11 ns () |           12 ns
    12345678 << 100 |   11 ns (
    ) |           11 ns
    ----------------+-------------+----------------
    Total           | 4.27 ms (*) | 4.57 us (-100%)
    ----------------+-------------+----------------

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented May 2, 2014

    Looks like you forgot to actually use the use_calloc parameter you put in the long_alloc prototype. It accepts it, but it's never used or passed to another function. So right now, the only difference in behavior is that there is an extra layer of function calls involved in _PyLong_New.

    On that note, might it make sense to change all other calls to _PyLong_New within longobject.c to use long_alloc(size, 0) to at least make long object's themselves avoid the overhead of the extra function call? Maybe I'm overestimating how much of a difference an extra C level function call will make, but it seems like PyLongs make new PyLongs an awful lot.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented May 2, 2014

    Given your benchmarks show improvements (you posted while I was typing my last comment), I'm guessing it's just the posted patch that's wrong, and your local changes actually use use_calloc?

    @vstinner
    Copy link
    Member Author

    vstinner commented May 2, 2014

    Without the patch, 1 << (2**29) allocates 69.9 MB. With the patch, 1 << (2**29) allocates 0.1 MB (104 KB).

    Without the patch,

    $ ./python
    Python 3.5.0a0 (default:5b0fda8f5718, May  2 2014, 22:47:06) 
    [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux
    >>> import os
    >>> os.system("grep RSS /proc/%s/status" % os.getpid())
    VmRSS:	    6164 kB
    >>> x=1 << (2**29)
    >>> os.system("grep RSS /proc/%s/status" % os.getpid())
    VmRSS:	   76064 kB

    With the patch:

    haypo@selma$ ./python
    Python 3.5.0a0 (default:5b0fda8f5718+, May  2 2014, 22:55:47) 
    [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import os
    >>> os.system("grep RSS /proc/%s/status" % os.getpid())
    VmRSS:	    5864 kB
    >>> x=1 << (2**29)
    >>> os.system("grep RSS /proc/%s/status" % os.getpid())
    VmRSS:	    5968 kB

    @vstinner
    Copy link
    Member Author

    vstinner commented May 2, 2014

    Looks like you forgot to actually use the use_calloc parameter you put in the long_alloc prototype. It accepts it, but it's never used or passed to another function.

    Oh f###, you're right. See new patch.

    I ran again the new benchmark: my (updated) patch makes most operations slower, except for "1 << (2 ** 28)". For 1 << (2 ** 28), it's much faster, but it is worth to modify long_lshift() for this rare use case?

    Updated benchmark:

    Common platform:
    Timer: time.perf_counter
    Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
    Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug
    CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
    Python unicode implementation: PEP-393
    Bits: int=32, long=64, long long=64, size_t=64, void*=64
    CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes

    Platform of campaign orig:
    Python version: 3.5.0a0 (default:5b0fda8f5718, May 2 2014, 22:47:06) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
    SCM: hg revision=5b0fda8f5718 tag=tip branch=default date="2014-05-02 22:31 +0200"
    Timer precision: 46 ns
    Date: 2014-05-02 22:47:12

    Platform of campaign calloc2:
    Python version: 3.5.0a0 (default:5b0fda8f5718+, May 2 2014, 23:04:59) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)]
    Date: 2014-05-02 23:05:23
    SCM: hg revision=5b0fda8f5718+ tag=tip branch=default date="2014-05-02 22:31 +0200"
    Timer precision: 45 ns

    ----------------+-------------+----------------
    Tests           |        orig |         calloc2
    ----------------+-------------+----------------
    1 << (2 ** 0)   |   39 ns () |    50 ns (+30%)
    1 << (2 ** 1)   |   38 ns (
    ) |    47 ns (+22%)
    1 << (2 ** 3)   |   38 ns () |    48 ns (+26%)
    1 << (2 ** 5)   |   42 ns (
    ) |    50 ns (+17%)
    1 << (2 ** 10)  |   43 ns () |    53 ns (+23%)
    1 << (2 ** 15)  |  157 ns (
    ) |   172 ns (+10%)
    1 << (2 ** 20)  | 3.91 us () |         3.95 us
    1 << (2 ** 25)  |  206 us (
    ) |    218 us (+5%)
    1 << (2 ** 28)  | 4.06 ms () | 4.29 us (-100%)
    123 << 1        |   11 ns (
    ) |           11 ns
    12345678 << 1   |   11 ns () |           11 ns
    12345678 << 100 |   11 ns (
    ) |           11 ns
    ----------------+-------------+----------------
    Total           | 4.27 ms (*) |   226 us (-95%)
    ----------------+-------------+----------------

    @vstinner
    Copy link
    Member Author

    vstinner commented May 2, 2014

    Without the patch, 1 << (2**29) allocates 69.9 MB. With the patch, 1 << (2**29) allocates 0.1 MB (104 KB).

    This is still true with long_lshift2.patch (except that in my new test, it allocates 196 kB).

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented May 2, 2014

    While you're doing this, might it make sense to add a special case to long_pow so it identifies cases where a (digit-sized) value with an absolute value equal to a power of 2 is being raised to a positive integer exponent, and convert said cases to equivalent left shifts?

    Identifying powers of 2 can be done in constant time with no memory overhead for register sized values (see http://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2 ); if you've already handled the case for 0, then checking if it's a power of 2 is just:

    (v & (v - 1)) == 0
    

    It adds a little complexity, but even just handling 2 alone specially would at least make it so the semi-common case of making a large power of 2 doesn't have significantly different performance using 2 ** (numbits - 1) instead of 1 << (numbits - 1).

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented May 2, 2014

    I swear, I need to refresh before I post a long comment.

    If this is slowing everything down a little just to make 1 << (2 ** 29) faster (and did you really mean 1 << (1 << 29) ? :-) ), then I'd say drop it.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented May 2, 2014

    One possible way to salvage it: Have you considered declaring long_alloc as Py_LOCAL_INLINE, or, now that I've checked bpo-5553, a macro for long_alloc, so it gets inlined, and doesn't add the check overhead to everything (since properly inlining with a constant argument to use_calloc should make the optimizer trim the unused code path at compile time)?

    Probably not very PEP-7 friendly to macro-ize something that long I'm guessing...

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented May 2, 2014

    And now that I'm thinking about it, the probable cause of the slowdown is that, for any PyLongObject that's smaller than PAGESIZE (give or take), using Calloc means calloc is just calling memset to zero the PyLongObject header and the "used" part of the high order bits along with everything else, where the original Malloc code avoided performing two passes over the memory, initializing each byte in order, and writing each one exactly once. The Calloc approach uses two passes, and the second one isn't wholly sequential, so it may be getting bit by the CPU cache not keeping up.

    @vstinner
    Copy link
    Member Author

    The optimization of 2**n looks to be only useful for very large value of n, result larger than 1 MB. This use case is very rare, and you should probably use another library (GMP, numpy, or something else) for such large numbers. I close the issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant