classification
Title: Faster index range checks
Type: performance Stage: resolved
Components: Extension Modules, Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: rhettinger, serhiy.storchaka, sir-sigurd, skrah, socketpair, vstinner
Priority: normal Keywords: patch

Created on 2016-10-09 12:32 by serhiy.storchaka, last changed 2018-10-10 12:03 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
check_index.patch serhiy.storchaka, 2016-10-09 12:32 review
check_index.cocci serhiy.storchaka, 2016-10-09 12:33
issue28397-2.c serhiy.storchaka, 2016-10-11 13:05
Pull Requests
URL Status Linked Edit
PR 9784 rhettinger, 2018-10-10 04:31
Messages (22)
msg278355 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-09 12:32
The expression for testing that some index is in half-open range from 0 to limit can be written as

    index >= 0 && index < limit

or as

    (size_t)index < (size_t)limit

The latter form generates simpler machine code. This is idiomatic code, it is used in many C and C++ libraries (including C++ stdlib implementations). It already is used in CPython (in deque implementation).

Proposed patch rewrites index range checks in more efficient way. The patch was generated automatically by coccinelle script, and then manually cleaned up.
msg278397 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-10-10 02:17
Don't change the code in the collections module.  While semantically valid, the change obfuscates the code.  The meaning of maxlen < 0 is that there is no maximum length.  I don't want this test hidden; otherwise, I would have changed it long ago as was done elsewhere in the module where it made sense.

Elsewhere, consider making a macro with clear name and comment (as was done with NEEDS_TRIM) in the collections module.  Otherwise, you're just leaving behind constipated and tricky code with no indication of why a signed variable is being coerced to unsigned.
msg278408 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-10-10 10:05
Serhiy, could you please take out _decimal.c?

I thought we had been through that before. :) :) :)
msg278410 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-10-10 11:55
The same applies to memoryview.c and _testbuffer.c -- I doubt that there's any measurable speed benefit and the clarity is reduced (I also don't want macros there).
msg278469 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-10-11 11:05
In the following program, with gcc-5.3 doit() is significantly faster than doit2() in 64-bit Linux:

================================================================
#include <stdint.h>

int
doit(int64_t index, int64_t nitems)
{
    return index < 0 || index >= nitems;
}

int
doit2(int64_t index, int64_t nitems)
{
    return (uint64_t)index >= (uint64_t)nitems;
}

int
main(void)
{
    int count, i;

    for (i = 0; i < 1000000000; i++) {
        count += doit(832921, i);
    }

    return count;
}
================================================================
msg278471 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-10-11 11:45
The difference in favor of doit() is even more pronounced with this
loop (also sorry for the uninitialized variable, but that does not
make a difference for benchmarking):

=====================================
for (i = 0; i < 10000; i++) {
        for (j = 0; j < 10000; j++) {
            count += doit(i, j);
        }
    }
======================================
msg278472 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-10-11 11:45
I have tested. performace differs in about of two times.
gcc -O3

0.22 nanoseconds per comparison.
vs
0.58 nanoseconds per comparison.

Does it cost a time that we spent to discuss here ?
msg278473 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-10-11 11:47
Which version is faster in your tests?
msg278474 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-10-11 11:47
Much more conveniet way is to use unsigned variables in appropriate places.
msg278475 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-10-11 11:49
oh shi....doit() i.e.

return index < 0 || index >= nitems;

is faster!
msg278476 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-10-11 11:50
Without optimisation in compiler (-O0) speed is the same.
msg278477 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-10-11 11:51
-O2 -- the same speed too!
-O3 really helps.
msg278478 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-10-11 11:52
$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609
msg278479 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-10-11 11:54
That matches my results as well:

-O2 gives about the same speed, with -O3 doit() has a huge advantage.


I'm not sure this is an optimization at all.
msg278481 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-11 13:05
In your example functions are inlined. If prohibit inlining, the second function is faster.

$ gcc -O3 -o issue28397 issue28397-2.c 
$ time ./issue28397 0

real    0m8.097s
user    0m7.992s
sys     0m0.012s
$ time ./issue28397 1

real    0m5.467s
user    0m5.436s
sys     0m0.024s
msg278484 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-10-11 13:24
$ gcc -O3 -DDOIT=doit ./zzz.c -o zzz && time ./zzz

real	0m1.675s
user	0m1.672s
sys	0m0.000s

$ gcc -O3 -DDOIT=doit2 ./zzz.c -o zzz && time ./zzz

real	0m1.657s
user	0m1.656s
sys	0m0.000s

====================================================

#include <stdint.h>

static int __attribute__((noinline)) doit(int64_t index, int64_t nitems)
{
    return index < 0 || index >= nitems;
}

static int __attribute__((noinline)) doit2(int64_t index, int64_t nitems)
{
    return (uint64_t)index >= (uint64_t)nitems;
}

int main(void)
{
    int count=0, i;

    for (i = 0; i < 1000000000; i++) {
        count += DOIT(832921, i);
    }

    return count;
}
msg278485 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-10-11 13:30
On 64-bit Linux there's no difference:

$ ./usr/bin/gcc -O3 -o issue28397-2 issue28397-2.c
$  time ./issue28397-2 0

real    0m2.486s
user    0m2.424s
sys     0m0.014s
$ time ./issue28397-2 1

real    0m2.433s
user    0m2.422s
sys     0m0.008s



Also, most of the time "index < 0 || index >= nitems" *is* inlined,
and it was at least three times faster here.


I guess the general point is that such micro-optimizations are
unpredictable on modern architectures and modern compilers.

Note that the fast inlined version used SSE instructions.
msg278486 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-11 13:32
Serhiy: "The latter form generates simpler machine code."

Since this issue is an optimization, can you please provide results of Python benchmarks? Maybe run the performance benchmark suite? I just released performance 0.3 with new benchmarks and less bugs! ;-)

http://github.com/python/performance
msg278489 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-11 14:25
Unlikely this optimization have measurable affect on benchmarks.

I opened this issue because this optimization already was applied to deque (issue23553), and it is easy to write a script for applying it to all code. But since there are doubts about this optimization, I withdraw my patch.
msg278491 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-10-11 14:39
Serhiy Storchaka: "I opened this issue because this optimization already was applied to deque (issue23553)"

Ah, change 1e89094998b2 written by Raymond Hettinger last year, Raymond who wrote (msg278397): "Don't change the code in the collections module.  While semantically valid, the change obfuscates the code." :-)

To stay consistent, I suggest to revert the useless micro-optimization 1e89094998b2. Python uses Py_ssize_t instead of size_t to support "i >= 0" checks", it's a deliberate choice to catch bugs.
msg278492 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-10-11 14:50
Raymond extracted his optimization into separate function and commented it.
msg327467 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-10 12:03
See also PR 9784 where Raymond shown assempler code generated for two variants.

There is the similar difference on 64-bit platform with GCC 7.3:

$ gcc -O3 -o issue28397 issue28397-2.c 
$ time ./issue28397-2 0

real    0m2,740s
user    0m2,739s
sys     0m0,000s
$ time ./issue28397-2 1

real    0m2,449s
user    0m2,449s
sys     0m0,000s

But with GCC 8.2 there is not a difference.

$ time ./issue28397-2 0

real    0m2,498s
user    0m2,498s
sys     0m0,000s
$ time ./issue28397-2 1

real    0m2,500s
user    0m2,496s
sys     0m0,004s

Both versions generate the same code for tested functions, there are only minor differences in the main() function (some independent instructions are reordered). I don't know what is the cause of such difference.
History
Date User Action Args
2018-10-10 12:03:32serhiy.storchakasetmessages: + msg327467
2018-10-10 09:55:34sir-sigurdsetnosy: + sir-sigurd
2018-10-10 04:31:40rhettingersetpull_requests: + pull_request9170
2016-10-11 14:50:06serhiy.storchakasetmessages: + msg278492
2016-10-11 14:39:06vstinnersetmessages: + msg278491
2016-10-11 14:25:25serhiy.storchakasetstatus: open -> closed
resolution: rejected
messages: + msg278489

stage: patch review -> resolved
2016-10-11 13:32:50vstinnersetnosy: + vstinner
messages: + msg278486
2016-10-11 13:30:58skrahsetmessages: + msg278485
2016-10-11 13:24:31socketpairsetmessages: + msg278484
2016-10-11 13:05:46serhiy.storchakasetfiles: + issue28397-2.c

messages: + msg278481
2016-10-11 11:54:46skrahsetmessages: + msg278479
2016-10-11 11:52:11socketpairsetmessages: + msg278478
2016-10-11 11:51:41socketpairsetmessages: + msg278477
2016-10-11 11:50:14socketpairsetmessages: + msg278476
2016-10-11 11:49:03socketpairsetmessages: + msg278475
2016-10-11 11:47:37socketpairsetmessages: + msg278474
2016-10-11 11:47:03skrahsetmessages: + msg278473
2016-10-11 11:45:29socketpairsetnosy: + socketpair
messages: + msg278472
2016-10-11 11:45:08skrahsetmessages: + msg278471
2016-10-11 11:05:20skrahsetmessages: + msg278469
2016-10-10 11:55:00skrahsetmessages: + msg278410
2016-10-10 10:05:39skrahsetnosy: + skrah
messages: + msg278408
2016-10-10 02:17:39rhettingersetnosy: + rhettinger
messages: + msg278397
2016-10-09 12:33:18serhiy.storchakasetfiles: + check_index.cocci
2016-10-09 12:32:05serhiy.storchakacreate