New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceval.c: implement fast path for integers with a single digit #66154
Comments
Python 2 has fast path in ceval.c for operations (a+b, a-b, etc.) on small integers ("int" type) if the operation does not overflow. We loose these fast-path in Python 3 when we dropped the int type in favor of the long type. Antoine Pitrou proposed a fast-path, but only for int singletons (integers in the range [-5; 255]): issue bpo-10044. His patch was rejected because it introduces undefined behaviour. I propose to reimplemenet Python 2 optimization for long with a single digit, which are the most common numbers. Pseudo-code for BINARY_ADD: if (PyLong_CheckExact(x) && Py_ABS(Py_SIZE(x)) == 1
&& PyLong_CheckExact(y) && Py_ABS(Py_SIZE(y)) == 1)
{
stwodigits a = ..., b = ...;
stwodigits c;
if (... a+b will not overflow ...) {
c = a + b;
return PyLong_FromLongLong(c);
}
}
/* fall back to PyNumber_Add() */ The code can be copied from longobject.c, there are already fast-path for single digit numbers. See for example long_mul(): /* fast path for single-digit multiplication */
if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
....
} As any other optimization, it should be proved to be faster with benchmarks. |
On: if (... a+b will not overflow ...) { Since you limited the optimization for addition to single digit numbers, at least for addition and subtraction, overflow is impossible. The signed twodigit you use for the result is guaranteed to be able to store far larger numbers than addition of single digits can produce. In fact, due to the extra wasted bit on large (30 bit) digits, if you used a fixed width 32 bit type for addition/subtraction, and a fixed width 64 bit type for multiplication, overflow would be impossible regardless of whether you used 15 or 30 bit digits. On a related note: Presumably you should check if the abs(size) <= 1 like in longobject.c, not == 1, or you omit the fast path for 0. Doesn't come up much, not worth paying extra to optimize, but it costs nothing to handle it. |
Let's try. As I understand, bpo-10044 was rejected because it complicates the code too much. May be new attempt will be more successful. |
Serhiy Storchaka added the comment:
I read that Mark rejected the issue bpo-10044 because it introduces an |
I'm not interested to work on this issue right now. If anyone is |
There also used to be a fast path for binary subscriptions with integer indexes. I would like to see that performance regression fixed if it can be done cleanly. |
So I'm trying something pretty similar to Victor's pseudo-code and just using timeit to look for speedups after: Small improvement. Haven't looked at optimizing BINARY_SUBSCR yet. |
Thank you Zach. I found even small regression. Before: $ ./python -m timeit -s "x = 10" "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
1000000 loops, best of 3: 1.51 usec per loop After: $ ./python -m timeit -s "x = 10" "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x"
1000000 loops, best of 3: 1.6 usec per loop |
bench_long.py: micro-benchmark for x+y. I confirm a slow down with 21955.patch. IMO you should at least inline PyLong_AsLong() which can be simplified if the number has 0 or 1 digit. Here is my patch "inline.patch" which is 21955.patch with PyLong_AsLong() inlined. Benchmark result (patch=21955.patch, inline=inline.patch): Common platform: Platform of campaign orig: Platform of campaign patch: Platform of campaign inline: --------------------+-------------+---------------+--------------- (I removed my message because I posted the wrong benchmark output, inline column was missing.) |
Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255). Before: $ ./python -m timeit -s "x=10" "x+x"
10000000 loops, best of 3: 0.143 usec per loop
$ ./python -m timeit -s "x=1000" "x+x"
1000000 loops, best of 3: 0.247 usec per loop After: $ ./python -m timeit -s "x=10" "x+x"
10000000 loops, best of 3: 0.117 usec per loop
$ ./python -m timeit -s "x=1000" "x+x"
1000000 loops, best of 3: 0.209 usec per loop All measures are made with modified timeit (bpo-21988). |
Well, dont' I feel silly. I confirmed both my regression and the inline speedup using the benchmark Victor added. I wonder if I got my binaries backwards in my first test... |
I did something similar to BINARY_SUBSCR after looking at the 2.7 source as Raymond suggested. Hopefully I got my binaries straight this time :) The new patch includes Victor's inlining and my new subscript changes. Platform of campaign orig: Platform of campaign patch: ---------------------+-------------+--------------- |
Please run the actual benchmark suite to get interesting numbers: http://hg.python.org/benchmarks |
I ran the whole benchmark suite. There are a few that are slower: call_method_slots, float, pickle_dict, and unpack_sequence. Report on Linux zach-vbox 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 ### 2to3 ### ### call_method_slots ### ### call_method_unknown ### ### call_simple ### ### chaos ### ### django_v2 ### ### fastpickle ### ### float ### ### json_dump_v2 ### ### mako ### ### mako_v2 ### ### meteor_contest ### ### nbody ### ### nqueens ### ### pickle_dict ### ### raytrace ### ### regex_effbot ### ### regex_v8 ### ### richards ### ### silent_logging ### ### simple_logging ### ### spectral_norm ### ### tornado_http ### ### unpack_sequence ### ### unpickle_list ### |
What's the status of this issue? |
I haven't looked at it since I posted the benchmark results for 21955_2.patch. |
Anybody still looking at this? I can take another stab at it if it's still in scope. |
There were some visible speedups from your patch -- I think we should merge this optimization. Can you figure why unpack_sequence and other benchmarks were slower? |
|
I'm assigning this patch to myself to commit it in 3.6 later. |
I took another look at this, and tried applying it to 3.6 and running the latest benchmarks. It applied cleanly, and the benchmark results were similar, this time unpack_sequence and spectral_norm were slower. Spectral norm makes sense, it's doing lots of FP addition. The unpack_sequence instruction looks like it already has optimizations for unpacking lists and tuples onto the stack, and running dis on the test showed that it's completely dominated calls to unpack_sequence, load_fast, and store_fast so I still don't know what's going on there. |
Any change that increases the cache or branch predictor footprint of the evaluation loop may make the interpreter slower, even if the change doesn't seem related to a particular benchmark. That may be the reason here. |
unpack_sequence contains 400 lines of this: "a, b, c, d, e, f, g, h, i, j = to_unpack". This code doesn't even touch BINARY_SUBSCR or BINARY_ADD. Zach, could you please run your benchmarks in rigorous mode (perf.py -r)? I'd also suggest to experiment with putting the baseline cpython as a first arg and as a second -- maybe your machine runs the second interpreter slightly faster. |
I ran 6 benchmarks on my work machine(not the same one as the last set) overnight. unpack_sequence is nowhere to be seen and spectral_norm is faster now... |
Attaching a new patch -- rewritten to optimize -, *, +, -=, *= and +=. I also removed the optimization of [] operator -- that should be done in a separate patch and in a separate issue. Some nano-benchmarks (best of 3): python -m timeit "sum([x + x + 1 for x in range(100)])" python -m timeit "sum([x - x - 1 for x in range(100)])" python -m timeit "sum([x * x * 1 for x in range(100)])" Python 3.6 vs 3.5 (spectral_norm, rigorous run): Zach, thanks a lot for the research! I'm glad that unpack_sequence finally proved to be irrelevant. Could you please take a look at the updated patch? |
If you only want to benchmark x*y, x+y and list-comprehension, you |
I don't understand what this table means (why 4 columns?). Can you explain what you did? |
I agree that may be optimizing very few cases is better. We need to collect the statistics of using different operations with different types in long run of tests or benchmarks. If say division is used 100 times less than addition, we shouldn't complicate ceval loop to optimize it. |
Benchmark on inline-2.patch. No speedup, only slowdown. I'm now running benchmark on fastint5_4.patch. $ python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python.inline-2 Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64 ### json_load ### ### regex_v8 ### The following not significant results are hidden, use -v to show them: real 58m32.662s |
Benchmark on fastint5_4.patch. python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python_fastint5_4 Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64 ### django_v3 ### ### fastunpickle ### ### json_dump_v2 ### ### nbody ### ### regex_v8 ### The following not significant results are hidden, use -v to show them: |
I think this is a random fluctuation, that benchmark (and re lib) doesn't use the operators too much. It can't be THAT slower just because of optimizing a few binops. |
You're also running a very small subset of all benchmarks available. Please try the '-b all' option. I'll also run benchmarks on my machines. |
Alright, I ran a few benchmarks myself. In rigorous mode regex_v8 has the same performance on my 2013 Macbook Pro and an 8-years old i7 CPU (Linux). Here're results of "perf.py -b raytrace,spectral_norm,meteor_contest,nbody ../cpython/python.exe ../cpython-git/python.exe -r" fastint5: ### nbody ### ### spectral_norm ### The following not significant results are hidden, use -v to show them: ====== inline-2: ### raytrace ### ### spectral_norm ### The following not significant results are hidden, use -v to show them: |
From what I can see there is no negative impact of the patch on stable macro benchmarks. There is quite a detectable positive impact on most of integer and float operations from my patch. 13-16% on nbody and spectral_norm benchmarks is still impressive. And you can see a huge improvement in various timeit micro-benchmarks. fastint5 is a very compact patch, that only touches the ceval.c file. It doesn't complicate the code, as the macro is very straightforward. Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it. |
Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics. |
But what you will use to gather statistics data? Test suite isn't representative, and we already know what will benchmarks suite show. I can assist with writing some code for stats, but what's the plan? |
Can I suggest the mpmath test suite as a good benchmark? I've used it to test the various optimizations in gmpy2 and it has always been a valuable real-world benchmark. And it is slower in Python 3 than Python 2.... |
Be careful with test suites: first, they might exercise code that would never be a critical point for performance in a real-world application; second and most important, unittest seems to have gotten slower between 2.x and 3.x, so you would really be comparing apples to oranges. |
Attaching another patch - fastint6.patch that only optimizes longs (no FP fast path).
What benchmark did you use? What were the numbers? I'm asking because before you benchmarked different patches that are conceptually similar to fastint5, and the result was that decimal was 5% faster with fast paths for just longs, and 6% slower with fast paths for longs & floats. Also, some quick timeit results (quite stable from run to run): -m timeit -s "x=2" "x + 10 + x * 20 + x* 10 + 20 -x" -m timeit -s "x=2" "x*2.2 + 2 + x*2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)2 + (x+10)(x-30)" |
Yury Selivanov:
I'm disappointed by the results. In short, these patches have *no* impact on macro benchmarks, other than the two which are stressing the int and float types. Maybe we are just loosing our time on this issue... I understand that the patches are only useful to get xx% speedup (where xx% is smaller than 25%) if your whole application is blocked by numeric computations. If it's the case, I would suggest to move to PyPy, Numba, Cython, etc. I expect something more interesting than xx% faster, but a much more impressive speedup. http://speed.pypy.org/ : PyPy/CPython 2.7 for spectral_norm is 0.04: 25x faster. For nbody (nbody_modified), it's 0.09: 11x faster. With patches of this issue, the *best* speedup is only 1.16x faster... We are *very* far from 11x or 25x faster. It's not even 2x faster... Yury Selivanov:
According to my micro-benchmark msg259706, inline-2.patch is faster than fastint5_4.patch. I would suggest to "finish" the inline-2.patch to optimize other operations, and *maybe* add fast-path for float. On macrobenchmark, inline-2.patch is slower than fastint5_4.patch, but it was easy to expect since I only added fast-path for int-int and only for a few operators. The question is it is worth to get xx% speedup on one or two specific benchmarks where CPython really sucks compared to other languages and other implementations of Python... Stefan Krah:
How do you run your benchmark? Case Van Horsen:
Where can we find this benchmark? Case Van Horsen:
What do you mean by "real-world benchmark"? :-) |
mpmath is a library for arbitrary-precision floating-point arithmetic. It uses Python's native long type or gmpy2's mpz type for computations. It is available at https://pypi.python.org/pypi/mpmath. The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation: ~/src/mpmath-0.19/mpmath/tests$ time py36 runtests.py -local For example, I've tried to optimize gmpy2's handling of binary operations between its mpz type and short Python integers. I've found it to provide useful results: improvements that are significant on a micro-benchmark (say 20%) will usually cause a small but repeatable improvement. And some improvements that looked good on a micro-benchmark would slow down mpmath. I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%. I will run more tests tonight. |
Please try to test fastint5 too (fast paths for long & floats, whereas fastint6 is only focused on longs). |
I ran the mpmath test suite with the fastint6 and fastint5_4 patches. fastint6 results without gmpy: 0.25% faster fastint5_4 results without gmpy: 1.5% slower |
Case Van Horsen added the comment:
I'm more and more disappointed by this issue... If even a test Maybe we should just close the issue? |
I'll take a closer look at gmpy later. Please don't close. |
I extracted the slowest test (test_polyroots_legendre) and put it in a loop of 5 iterations: see attached mpmath_bench.py. I ran this benchmark on Linux with 4 isolated CPU (/sys/devices/system/cpu/isolated=2-3,6-7). On such setup, the benchmark looks stable. Example: Run #1/5: 12.28 sec test_polyroots_legendre (min of 5 runs):
I ran tests without GMP, to stress the Python int type. I guess that the benchmark is dominated by CPU time spent on computing operations on large Python int, not by the time spent in ceval.c. So the speedup is low (2%). Such use case doesn't seem to benefit of micro optimization discussed in this issue. mpmath is an arbitrary-precision floating-point arithmetic using Python int (or GMP if available). |
Maybe we should adopt a difference approach. There is something called "inline caching": put the cache between instructions, in the same memory block. Example of paper on CPython: "Efficient Inline Caching without Dynamic Translation" by Stefan Brunthaler (2009) Maybe we can build something on top of the issue bpo-26219 "implement per-opcode cache in ceval"? |
bpo-14757 has an implementation of inline caching, which at least seemed to slow down some use cases. Then again, whenever someone posts a new speedup suggestion, it seems to slow down things I'm working on. At least Case van Horsen independently verified the phenomenon in this issue. :) |
Between inline2.patch and fastint6.patch, it seems like inline2.patch is faster (between 9% and 12% faster than fastint6.patch). Microbenchmark on Python default (rev 554fb699af8c), compilation using LTO (./configure --with-lto), GCC 6.2.1 on Fedora 24, Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, perf 0.8.3 (dev version, just after 0.8.2). Commands: ./python -m perf timeit --name='x+y' -s 'x=1; y=2' 'x+y' --dup 1000 -v -o timeit-$branch.json Results: $ python3 -m perf compare_to timeit-master.json timeit-inline2.json
sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.14x faster
x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.29x faster
$ python3 -m perf compare_to timeit-master.json timeit-fastint6.json
sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-fastint6] 6.09 us +- 0.11 us: 1.02x faster
x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-fastint6] 12.7 ns +- 0.2 ns: 1.18x faster
$ python3 -m perf compare_to timeit-fastint6.json timeit-inline2.json
sum: Median +- std dev: [timeit-fastint6] 6.09 us +- 0.11 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.12x faster
x+y: Median +- std dev: [timeit-fastint6] 12.7 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.09x faster |
Result of performance 0.3.3 (and perf 0.8.3). No major benchmark is faster. A few benchmarks seem to be event slower using fastint6.patch (but I don't really trust pybench). == fastint6.patch == $ python3 -m perf compare_to master.json fastint6.json --group-by-speed --min-speed=5
Slower (3):
- pybench.ConcatUnicode: 52.7 ns +- 0.0 ns -> 56.1 ns +- 0.4 ns: 1.06x slower
- pybench.ConcatStrings: 52.7 ns +- 0.3 ns -> 56.1 ns +- 0.1 ns: 1.06x slower
- pybench.CompareInternedStrings: 16.5 ns +- 0.0 ns -> 17.4 ns +- 0.0 ns: 1.05x slower Faster (4):
Benchmark hidden because not significant (114): 2to3, call_method, (...) == inline2.patch == haypo@selma$ python3 -m perf compare_to master.json inline2.json --group-by-speed --min-speed=5
Benchmark hidden because not significant (119): 2to3, call_method, (...) |
fastint6_inline2_json.tar.gz: archive of JSON files
|
The fatest patch (inline2.patch) has a negligible impact on benchmarks. The purpose of an optimization is to make Python faster, it's not the case here, so I close the issue. Using timeit, the largest speedup is 1.29x faster. Using performance, spectral_norm is 1.07x faster and pybench.SimpleLongArithmetic is 1.06x faster. I consider that spectral_norm and pybench.SimpleLongArithmetic are microbenchmarks and so not representative of a real application. The issue was fun, thank you for playing with me the game of micro-optimization ;-) Let's move to more interesting optimizations having a larger impact on more realistic workloads, like cache global variables, optimizing method calls, fastcalls, etc. |
New changeset 61fcb12a9873 by Victor Stinner in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: