Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster index range checks #72583

Closed
serhiy-storchaka opened this issue Oct 9, 2016 · 22 comments
Closed

Faster index range checks #72583

serhiy-storchaka opened this issue Oct 9, 2016 · 22 comments
Labels
3.7 (EOL) end of life extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@serhiy-storchaka
Copy link
Member

BPO 28397
Nosy @rhettinger, @vstinner, @skrah, @socketpair, @serhiy-storchaka, @sir-sigurd
PRs
  • Micro-optimize list index range checks #9784
  • Files
  • check_index.patch
  • check_index.cocci
  • issue28397-2.c
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-10-11.14:25:25.607>
    created_at = <Date 2016-10-09.12:32:05.573>
    labels = ['extension-modules', 'interpreter-core', '3.7', 'performance']
    title = 'Faster index range checks'
    updated_at = <Date 2018-10-10.12:03:32.312>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2018-10-10.12:03:32.312>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-10-11.14:25:25.607>
    closer = 'serhiy.storchaka'
    components = ['Extension Modules', 'Interpreter Core']
    creation = <Date 2016-10-09.12:32:05.573>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['45034', '45035', '45055']
    hgrepos = []
    issue_num = 28397
    keywords = ['patch']
    message_count = 22.0
    messages = ['278355', '278397', '278408', '278410', '278469', '278471', '278472', '278473', '278474', '278475', '278476', '278477', '278478', '278479', '278481', '278484', '278485', '278486', '278489', '278491', '278492', '327467']
    nosy_count = 6.0
    nosy_names = ['rhettinger', 'vstinner', 'skrah', 'socketpair', 'serhiy.storchaka', 'sir-sigurd']
    pr_nums = ['9784']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue28397'
    versions = ['Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    The expression for testing that some index is in half-open range from 0 to limit can be written as

    index >= 0 && index < limit
    

    or as

    (size_t)index < (size_t)limit
    

    The latter form generates simpler machine code. This is idiomatic code, it is used in many C and C++ libraries (including C++ stdlib implementations). It already is used in CPython (in deque implementation).

    Proposed patch rewrites index range checks in more efficient way. The patch was generated automatically by coccinelle script, and then manually cleaned up.

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Oct 9, 2016
    @rhettinger
    Copy link
    Contributor

    Don't change the code in the collections module. While semantically valid, the change obfuscates the code. The meaning of maxlen < 0 is that there is no maximum length. I don't want this test hidden; otherwise, I would have changed it long ago as was done elsewhere in the module where it made sense.

    Elsewhere, consider making a macro with clear name and comment (as was done with NEEDS_TRIM) in the collections module. Otherwise, you're just leaving behind constipated and tricky code with no indication of why a signed variable is being coerced to unsigned.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 10, 2016

    Serhiy, could you please take out _decimal.c?

    I thought we had been through that before. :) :) :)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 10, 2016

    The same applies to memoryview.c and _testbuffer.c -- I doubt that there's any measurable speed benefit and the clarity is reduced (I also don't want macros there).

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 11, 2016

    In the following program, with gcc-5.3 doit() is significantly faster than doit2() in 64-bit Linux:

    ================================================================

    #include <stdint.h>
    
    int
    doit(int64_t index, int64_t nitems)
    {
        return index < 0 || index >= nitems;
    }
    
    int
    doit2(int64_t index, int64_t nitems)
    {
        return (uint64_t)index >= (uint64_t)nitems;
    }
    
    int
    main(void)
    {
        int count, i;
    
        for (i = 0; i < 1000000000; i++) {
            count += doit(832921, i);
        }
    
        return count;
    }

    ================================================================

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 11, 2016

    The difference in favor of doit() is even more pronounced with this
    loop (also sorry for the uninitialized variable, but that does not
    make a difference for benchmarking):

    =====================================

    for (i = 0; i < 10000; i++) {
            for (j = 0; j < 10000; j++) {
                count += doit(i, j);
            }
        }

    ======================================

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Oct 11, 2016

    I have tested. performace differs in about of two times.
    gcc -O3

    0.22 nanoseconds per comparison.
    vs
    0.58 nanoseconds per comparison.

    Does it cost a time that we spent to discuss here ?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 11, 2016

    Which version is faster in your tests?

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Oct 11, 2016

    Much more conveniet way is to use unsigned variables in appropriate places.

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Oct 11, 2016

    oh shi....doit() i.e.

    return index < 0 || index >= nitems;

    is faster!

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Oct 11, 2016

    Without optimisation in compiler (-O0) speed is the same.

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Oct 11, 2016

    -O2 -- the same speed too!
    -O3 really helps.

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Oct 11, 2016

    $ gcc --version
    gcc (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 11, 2016

    That matches my results as well:

    -O2 gives about the same speed, with -O3 doit() has a huge advantage.

    I'm not sure this is an optimization at all.

    @serhiy-storchaka
    Copy link
    Member Author

    In your example functions are inlined. If prohibit inlining, the second function is faster.

    $ gcc -O3 -o issue28397 issue28397-2.c 
    $ time ./issue28397 0

    real 0m8.097s
    user 0m7.992s
    sys 0m0.012s
    $ time ./issue28397 1

    real 0m5.467s
    user 0m5.436s
    sys 0m0.024s

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Oct 11, 2016

    $ gcc -O3 -DDOIT=doit ./zzz.c -o zzz && time ./zzz

    real 0m1.675s
    user 0m1.672s
    sys 0m0.000s

    $ gcc -O3 -DDOIT=doit2 ./zzz.c -o zzz && time ./zzz

    real 0m1.657s
    user 0m1.656s
    sys 0m0.000s

    ====================================================

    #include <stdint.h>
    
    static int __attribute__((noinline)) doit(int64_t index, int64_t nitems)
    {
        return index < 0 || index >= nitems;
    }
    
    static int __attribute__((noinline)) doit2(int64_t index, int64_t nitems)
    {
        return (uint64_t)index >= (uint64_t)nitems;
    }
    
    int main(void)
    {
        int count=0, i;
    
        for (i = 0; i < 1000000000; i++) {
            count += DOIT(832921, i);
        }
    
        return count;
    }

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Oct 11, 2016

    On 64-bit Linux there's no difference:

    $ ./usr/bin/gcc -O3 -o issue28397-2 issue28397-2.c
    $  time ./issue28397-2 0

    real 0m2.486s
    user 0m2.424s
    sys 0m0.014s
    $ time ./issue28397-2 1

    real 0m2.433s
    user 0m2.422s
    sys 0m0.008s

    Also, most of the time "index < 0 || index >= nitems" *is* inlined,
    and it was at least three times faster here.

    I guess the general point is that such micro-optimizations are
    unpredictable on modern architectures and modern compilers.

    Note that the fast inlined version used SSE instructions.

    @vstinner
    Copy link
    Member

    Serhiy: "The latter form generates simpler machine code."

    Since this issue is an optimization, can you please provide results of Python benchmarks? Maybe run the performance benchmark suite? I just released performance 0.3 with new benchmarks and less bugs! ;-)

    http://github.com/python/performance

    @serhiy-storchaka
    Copy link
    Member Author

    Unlikely this optimization have measurable affect on benchmarks.

    I opened this issue because this optimization already was applied to deque (bpo-23553), and it is easy to write a script for applying it to all code. But since there are doubts about this optimization, I withdraw my patch.

    @vstinner
    Copy link
    Member

    Serhiy Storchaka: "I opened this issue because this optimization already was applied to deque (bpo-23553)"

    Ah, change 1e89094998b2 written by Raymond Hettinger last year, Raymond who wrote (msg278397): "Don't change the code in the collections module. While semantically valid, the change obfuscates the code." :-)

    To stay consistent, I suggest to revert the useless micro-optimization 1e89094998b2. Python uses Py_ssize_t instead of size_t to support "i >= 0" checks", it's a deliberate choice to catch bugs.

    @serhiy-storchaka
    Copy link
    Member Author

    Raymond extracted his optimization into separate function and commented it.

    @serhiy-storchaka
    Copy link
    Member Author

    See also PR 9784 where Raymond shown assempler code generated for two variants.

    There is the similar difference on 64-bit platform with GCC 7.3:

    $ gcc -O3 -o issue28397 issue28397-2.c 
    $ time ./issue28397-2 0

    real 0m2,740s
    user 0m2,739s
    sys 0m0,000s
    $ time ./issue28397-2 1

    real 0m2,449s
    user 0m2,449s
    sys 0m0,000s

    But with GCC 8.2 there is not a difference.

    $ time ./issue28397-2 0

    real 0m2,498s
    user 0m2,498s
    sys 0m0,000s
    $ time ./issue28397-2 1

    real 0m2,500s
    user 0m2,496s
    sys 0m0,004s

    Both versions generate the same code for tested functions, there are only minor differences in the main() function (some independent instructions are reordered). I don't know what is the cause of such difference.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants