Issue 21955: ceval.c: implement fast path for integers with a single digit

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/66154

classification

Title:	ceval.c: implement fast path for integers with a single digit
Type:	performance	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.6

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	yselivanov	Nosy List:	Yury.Selivanov, casevh, josh.r, lemburg, mark.dickinson, pitrou, python-dev, rhettinger, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Priority:	normal	Keywords:	patch

Created on 2014-07-11 09:10 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
21955.patch	zbyrne, 2014-07-16 00:29		review
bench_long.py	vstinner, 2014-07-16 08:13
inline.patch	vstinner, 2014-07-16 08:13		review
21955_2.patch	zbyrne, 2014-07-22 02:34		review
bench_results.txt	zbyrne, 2016-02-03 16:40
fastint1.patch	yselivanov, 2016-02-03 17:00		review
fastint2.patch	yselivanov, 2016-02-04 06:02		review
fastint_alt.patch	serhiy.storchaka, 2016-02-04 10:30		review
fastintfloat_alt.patch	serhiy.storchaka, 2016-02-04 16:36		review
fastint4.patch	yselivanov, 2016-02-05 01:37		review
fastint5.patch	yselivanov, 2016-02-05 04:04		review
bench_long2.py	vstinner, 2016-02-05 15:58
compare.txt	vstinner, 2016-02-05 15:58
compare_to.txt	vstinner, 2016-02-05 15:58
fastint5_2.patch	yselivanov, 2016-02-06 00:10		review
fastint5_3.patch	yselivanov, 2016-02-06 00:45		review
fastint5_4.patch	yselivanov, 2016-02-06 01:29		review
inline-2.patch	vstinner, 2016-02-06 01:31		review
fastint6.patch	yselivanov, 2016-02-07 21:32		review
mpmath_bench.py	vstinner, 2016-02-10 10:02
fastint6_inline2_json.tar.gz	vstinner, 2016-10-20 09:31

Pull Requests
URL	Status	Linked	Edit
PR 22481	merged	vstinner, 2020-10-01 16:34

Messages (111)
msg222731 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-07-11 09:10
Python 2 has fast path in ceval.c for operations (a+b, a-b, etc.) on small integers ("int" type) if the operation does not overflow. We loose these fast-path in Python 3 when we dropped the int type in favor of the long type. Antoine Pitrou proposed a fast-path, but only for int singletons (integers in the range [-5; 255]): issue #10044. His patch was rejected because it introduces undefined behaviour. I propose to reimplemenet Python 2 optimization for long with a single digit, which are the most common numbers. Pseudo-code for BINARY_ADD: --- if (PyLong_CheckExact(x) && Py_ABS(Py_SIZE(x)) == 1 && PyLong_CheckExact(y) && Py_ABS(Py_SIZE(y)) == 1) { stwodigits a = ..., b = ...; stwodigits c; if (... a+b will not overflow ...) { c = a + b; return PyLong_FromLongLong(c); } } /* fall back to PyNumber_Add() / --- The code can be copied from longobject.c, there are already fast-path for single digit numbers. See for example long_mul(): --- / fast path for single-digit multiplication */ if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) { .... } --- As any other optimization, it should be proved to be faster with benchmarks.
msg222804 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-07-11 22:23
On: if (... a+b will not overflow ...) { Since you limited the optimization for addition to single digit numbers, at least for addition and subtraction, overflow is impossible. The signed twodigit you use for the result is guaranteed to be able to store far larger numbers than addition of single digits can produce. In fact, due to the extra wasted bit on large (30 bit) digits, if you used a fixed width 32 bit type for addition/subtraction, and a fixed width 64 bit type for multiplication, overflow would be impossible regardless of whether you used 15 or 30 bit digits. On a related note: Presumably you should check if the abs(size) <= 1 like in longobject.c, not == 1, or you omit the fast path for 0. Doesn't come up much, not worth paying extra to optimize, but it costs nothing to handle it.
msg222824 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-07-12 07:19
Let's try. As I understand, issue10044 was rejected because it complicates the code too much. May be new attempt will be more successful.
msg222829 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-07-12 09:01
Serhiy Storchaka added the comment: > Let's try. As I understand, issue10044 was rejected because it complicates the code too much. May be new attempt will be more successful. I read that Mark rejected the issue #10044 because it introduces an undefined behaviour.
msg222830 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-07-12 09:01
I'm not interested to work on this issue right now. If anyone is interested, please go ahead!
msg222985 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2014-07-14 00:42
There also used to be a fast path for binary subscriptions with integer indexes. I would like to see that performance regression fixed if it can be done cleanly.
msg223162 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2014-07-16 00:29
So I'm trying something pretty similar to Victor's pseudo-code and just using timeit to look for speedups timeit('x+x', 'x=10', number=10000000) before: 1.1934231410000393 1.1988609210002323 1.1998214110003573 1.206968028999654 1.2065417159997196 after: 1.1698650090002047 1.1705158909999227 1.1752884750003432 1.1744818619999933 1.1741297110002051 1.1760422649999782 Small improvement. Haven't looked at optimizing BINARY_SUBSCR yet.
msg223177 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-07-16 06:23
Thank you Zach. I found even small regression. Before: $ ./python -m timeit -s "x = 10" "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x" 1000000 loops, best of 3: 1.51 usec per loop After: $ ./python -m timeit -s "x = 10" "x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x; x+x" 1000000 loops, best of 3: 1.6 usec per loop
msg223180 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-07-16 08:14
bench_long.py: micro-benchmark for x+y. I confirm a slow down with 21955.patch. IMO you should at least inline PyLong_AsLong() which can be simplified if the number has 0 or 1 digit. Here is my patch "inline.patch" which is 21955.patch with PyLong_AsLong() inlined. Benchmark result (patch=21955.patch, inline=inline.patch): Common platform: Platform: Linux-3.14.8-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Bits: int=32, long=64, long long=64, size_t=64, void=64 CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Python unicode implementation: PEP 393 Timer: time.perf_counter Platform of campaign orig: Date: 2014-07-16 10:04:27 Python version: 3.5.0a0 (default:08b3ee523577, Jul 16 2014, 10:04:23) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577 tag=tip branch=default date="2014-07-15 13:23 +0300" Timer precision: 40 ns Platform of campaign patch: Timer precision: 40 ns Date: 2014-07-16 10:04:01 Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:02:12) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300" Platform of campaign inline: Timer precision: 31 ns Date: 2014-07-16 10:11:21 Python version: 3.5.0a0 (default:08b3ee523577+, Jul 16 2014, 10:10:48) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] SCM: hg revision=08b3ee523577+ tag=tip branch=default date="2014-07-15 13:23 +0300" --------------------+-------------+---------------+--------------- Tests \| orig \| patch \| inline --------------------+-------------+---------------+--------------- 1+2 \| 23 ns () \| 24 ns \| 21 ns (-12%) "1+2" ran 100 times \| 1.61 us () \| 1.74 us (+7%) \| 1.39 us (-14%) --------------------+-------------+---------------+--------------- Total \| 1.64 us () \| 1.76 us (+7%) \| 1.41 us (-14%) --------------------+-------------+---------------+--------------- (I removed my message because I posted the wrong benchmark output, inline column was missing.)
msg223186 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-07-16 09:28
Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255). Before: $ ./python -m timeit -s "x=10" "x+x" 10000000 loops, best of 3: 0.143 usec per loop $ ./python -m timeit -s "x=1000" "x+x" 1000000 loops, best of 3: 0.247 usec per loop After: $ ./python -m timeit -s "x=10" "x+x" 10000000 loops, best of 3: 0.117 usec per loop $ ./python -m timeit -s "x=1000" "x+x" 1000000 loops, best of 3: 0.209 usec per loop All measures are made with modified timeit (issue21988).
msg223214 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2014-07-16 14:40
Well, dont' I feel silly. I confirmed both my regression and the inline speedup using the benchmark Victor added. I wonder if I got my binaries backwards in my first test...
msg223623 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2014-07-22 02:34
I did something similar to BINARY_SUBSCR after looking at the 2.7 source as Raymond suggested. Hopefully I got my binaries straight this time :) The new patch includes Victor's inlining and my new subscript changes. Platform of campaign orig: Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 15 2014, 18:11:28) [GCC 4.6.3] Timer precision: 6 ns Date: 2014-07-21 20:28:30 Platform of campaign patch: Python version: 3.5.0a0 (default:c8ce5bca0fcd+, Jul 21 2014, 20:21:20) [GCC 4.6.3] Timer precision: 20 ns Date: 2014-07-21 20:28:39 ---------------------+-------------+--------------- Tests \| orig \| patch ---------------------+-------------+--------------- 1+2 \| 118 ns () \| 103 ns (-13%) "1+2" ran 100 times \| 7.28 us () \| 5.93 us (-19%) x[1] \| 120 ns () \| 98 ns (-19%) "x[1]" ran 100 times \| 7.35 us () \| 5.31 us (-28%) ---------------------+-------------+--------------- Total \| 14.9 us (*) \| 11.4 us (-23%) ---------------------+-------------+---------------
msg223711 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-07-23 01:20
Please run the actual benchmark suite to get interesting numbers: http://hg.python.org/benchmarks
msg223726 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2014-07-23 06:00
I ran the whole benchmark suite. There are a few that are slower: call_method_slots, float, pickle_dict, and unpack_sequence. Report on Linux zach-vbox 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 Total CPU cores: 1 ### 2to3 ### 24.789549 -> 24.809551: 1.00x slower ### call_method_slots ### Min: 1.743554 -> 1.780807: 1.02x slower Avg: 1.751735 -> 1.792814: 1.02x slower Significant (t=-26.32) Stddev: 0.00576 -> 0.01823: 3.1660x larger ### call_method_unknown ### Min: 1.828094 -> 1.739625: 1.05x faster Avg: 1.852225 -> 1.806721: 1.03x faster Significant (t=2.28) Stddev: 0.01874 -> 0.24320: 12.9783x larger ### call_simple ### Min: 1.353581 -> 1.263386: 1.07x faster Avg: 1.397946 -> 1.302046: 1.07x faster Significant (t=24.28) Stddev: 0.03667 -> 0.03154: 1.1629x smaller ### chaos ### Min: 1.199377 -> 1.115550: 1.08x faster Avg: 1.230859 -> 1.146573: 1.07x faster Significant (t=16.24) Stddev: 0.02663 -> 0.02525: 1.0544x smaller ### django_v2 ### Min: 2.682884 -> 2.633110: 1.02x faster Avg: 2.747521 -> 2.690486: 1.02x faster Significant (t=9.90) Stddev: 0.02744 -> 0.03010: 1.0970x larger ### fastpickle ### Min: 1.751475 -> 1.597340: 1.10x faster Avg: 1.771805 -> 1.613533: 1.10x faster Significant (t=64.81) Stddev: 0.01177 -> 0.01263: 1.0727x larger ### float ### Min: 1.254858 -> 1.293067: 1.03x slower Avg: 1.336045 -> 1.365787: 1.02x slower Significant (t=-3.30) Stddev: 0.04851 -> 0.04135: 1.1730x smaller ### json_dump_v2 ### Min: 17.871819 -> 16.968647: 1.05x faster Avg: 18.428747 -> 17.483397: 1.05x faster Significant (t=4.10) Stddev: 1.60617 -> 0.27655: 5.8078x smaller ### mako ### Min: 0.241614 -> 0.231678: 1.04x faster Avg: 0.253730 -> 0.240585: 1.05x faster Significant (t=8.93) Stddev: 0.01912 -> 0.01327: 1.4417x smaller ### mako_v2 ### Min: 0.225664 -> 0.213179: 1.06x faster Avg: 0.234850 -> 0.225984: 1.04x faster Significant (t=10.12) Stddev: 0.01379 -> 0.01391: 1.0090x larger ### meteor_contest ### Min: 0.777612 -> 0.758924: 1.02x faster Avg: 0.799580 -> 0.780897: 1.02x faster Significant (t=3.97) Stddev: 0.02482 -> 0.02212: 1.1221x smaller ### nbody ### Min: 0.969724 -> 0.883935: 1.10x faster Avg: 0.996416 -> 0.918375: 1.08x faster Significant (t=12.65) Stddev: 0.02426 -> 0.03627: 1.4951x larger ### nqueens ### Min: 1.142745 -> 1.128195: 1.01x faster Avg: 1.296659 -> 1.162443: 1.12x faster Significant (t=2.75) Stddev: 0.34462 -> 0.02680: 12.8578x smaller ### pickle_dict ### Min: 1.433264 -> 1.467394: 1.02x slower Avg: 1.468122 -> 1.506908: 1.03x slower Significant (t=-7.20) Stddev: 0.02695 -> 0.02691: 1.0013x smaller ### raytrace ### Min: 5.454853 -> 5.538799: 1.02x slower Avg: 5.530943 -> 5.676983: 1.03x slower Significant (t=-8.64) Stddev: 0.05152 -> 0.10791: 2.0947x larger ### regex_effbot ### Min: 0.205875 -> 0.194776: 1.06x faster Avg: 0.211118 -> 0.198759: 1.06x faster Significant (t=5.10) Stddev: 0.01305 -> 0.01112: 1.1736x smaller ### regex_v8 ### Min: 0.141628 -> 0.133819: 1.06x faster Avg: 0.147024 -> 0.140053: 1.05x faster Significant (t=2.72) Stddev: 0.01163 -> 0.01388: 1.1933x larger ### richards ### Min: 0.734472 -> 0.727501: 1.01x faster Avg: 0.760795 -> 0.743484: 1.02x faster Significant (t=3.50) Stddev: 0.02778 -> 0.02127: 1.3061x smaller ### silent_logging ### Min: 0.344678 -> 0.336087: 1.03x faster Avg: 0.357982 -> 0.347361: 1.03x faster Significant (t=2.76) Stddev: 0.01992 -> 0.01852: 1.0755x smaller ### simple_logging ### Min: 1.104831 -> 1.072921: 1.03x faster Avg: 1.146844 -> 1.117068: 1.03x faster Significant (t=4.02) Stddev: 0.03552 -> 0.03848: 1.0833x larger ### spectral_norm ### Min: 1.710336 -> 1.688910: 1.01x faster Avg: 1.872578 -> 1.738698: 1.08x faster Significant (t=2.35) Stddev: 0.40095 -> 0.03331: 12.0356x smaller ### tornado_http ### Min: 0.849374 -> 0.852209: 1.00x slower Avg: 0.955472 -> 0.916075: 1.04x faster Significant (t=4.82) Stddev: 0.07059 -> 0.04119: 1.7139x smaller ### unpack_sequence ### Min: 0.000030 -> 0.000020: 1.52x faster Avg: 0.000164 -> 0.000174: 1.06x slower Significant (t=-13.11) Stddev: 0.00011 -> 0.00013: 1.2256x larger ### unpickle_list ### Min: 1.333952 -> 1.212805: 1.10x faster Avg: 1.373228 -> 1.266677: 1.08x faster Significant (t=16.32) Stddev: 0.02894 -> 0.03597: 1.2428x larger
msg238437 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-03-18 13:31
What's the status of this issue?
msg238455 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2015-03-18 15:53
I haven't looked at it since I posted the benchmark results for 21955_2.patch.
msg258057 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2016-01-12 02:10
Anybody still looking at this? I can take another stab at it if it's still in scope.
msg258060 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-01-12 02:41
> Anybody still looking at this? I can take another stab at it if it's still in scope. There were some visible speedups from your patch -- I think we should merge this optimization. Can you figure why unpack_sequence and other benchmarks were slower?
msg258062 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2016-01-12 03:37
> Can you figure why unpack_sequence and other benchmarks were slower? I didn't look really closely, A few of the slower ones were floating point heavy, which would incur the slow path penalty, but I can dig into unpack_sequence.
msg259417 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-02 18:55
I'm assigning this patch to myself to commit it in 3.6 later.
msg259428 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2016-02-02 20:37
I took another look at this, and tried applying it to 3.6 and running the latest benchmarks. It applied cleanly, and the benchmark results were similar, this time unpack_sequence and spectral_norm were slower. Spectral norm makes sense, it's doing lots of FP addition. The unpack_sequence instruction looks like it already has optimizations for unpacking lists and tuples onto the stack, and running dis on the test showed that it's completely dominated calls to unpack_sequence, load_fast, and store_fast so I still don't know what's going on there.
msg259429 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-02 20:56
Any change that increases the cache or branch predictor footprint of the evaluation loop may make the interpreter slower, even if the change doesn't seem related to a particular benchmark. That may be the reason here.
msg259431 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-02 21:06
unpack_sequence contains 400 lines of this: "a, b, c, d, e, f, g, h, i, j = to_unpack". This code doesn't even touch BINARY_SUBSCR or BINARY_ADD. Zach, could you please run your benchmarks in rigorous mode (perf.py -r)? I'd also suggest to experiment with putting the baseline cpython as a first arg and as a second -- maybe your machine runs the second interpreter slightly faster.
msg259490 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2016-02-03 16:40
I ran 6 benchmarks on my work machine(not the same one as the last set) overnight. Two with just the BINARY_ADD change, two with the BINARY_SUBSCR change, and two with both. I'm attaching the output from all my benchmark runs, but here are the highlights In this table I've flipped the results for running the modified build as the reference, but in the new attachment, slower in the right column means faster, I think :) \|------------------\|---------------------------------------\|-----------------------------------\| \|Build \| Baseline Reference \| Modified Reference \| \|------------------\|--------------------\|------------------\|--------------------\|--------------\| \| \| Faster \| Slower \| Faster \| Slower \| \|------------------\|--------------------\|------------------\|--------------------\|--------------\| \|BINARY_ADD \| chameleon_v2 \| etree_parse \| chameleon_v2 \| call_simple \| \| \| chaos \| nbody \| fannkuch \| nbody \| \| \| django \| normal_startup \| normal_startup \| pickle_dict \| \| \| etree_generate \| pickle_dict \| nqueens \| regex_v8 \| \| \| fannkuch \| pickle_list \| regex_compile \| \| \| \| formatted_logging \| regex_effbot \| spectral_norm \| \| \| \| go \| \| unpickle_list \| \| \| \| json_load \| \| \| \| \| \| regex_compile \| \| \| \| \| \| simple_logging \| \| \| \| \| \| spectral_norm \| \| \| \| \|------------------\|--------------------\|------------------\|--------------------\|--------------\| \|BINARY_SUBSCR \| chameleon_v2 \| call_simple \| 2to3 \| etree_parse \| \| \| chaos \| go \| call_method_slots \| json_dump_v2 \| \| \| etree_generate \| pickle_list \| chaos \| pickle_dict \| \| \| fannkuch \| telco \| fannkuch \| \| \| \| fastpickle \| \| formatted_logging \| \| \| \| hexiom2 \| \| go \| \| \| \| json_load \| \| hexiom2 \| \| \| \| mako_v2 \| \| mako_v2 \| \| \| \| meteor_contest \| \| meteor_contest \| \| \| \| nbody \| \| nbody \| \| \| \| regex_v8 \| \| normal_startup \| \| \| \| spectral_norm \| \| nqueens \| \| \| \| \| \| pickle_list \| \| \| \| \| \| simple_logging \| \| \| \| \| \| spectral_norm \| \| \| \| \| \| telco \| \| \|------------------\|--------------------\|------------------\|--------------------\|--------------\| \|BOTH \| chameleon_v2 \| call_simple \| chameleon_v2 \| fastpickle \| \| \| chaos \| etree_parse \| choas \| pickle_dict \| \| \| etree_generate \| pathlib \| etree_generate \| pickle_list \| \| \| etree_process \| pickle_list \| etree_process \| telco \| \| \| fannkuch \| \| fannkuch \| \| \| \| fastunpickle \| \| float \| \| \| \| float \| \| formatted_logging \| \| \| \| formatted_logging \| \| go \| \| \| \| hexiom2 \| \| hexiom2 \| \| \| \| nbody \| \| nbody \| \| \| \| nqueens \| \| normal_startup \| \| \| \| regex_v8 \| \| nqueens \| \| \| \| spectral_norm \| \| simple_logging \| \| \| \| unpickle_list \| \| spectral_norm \| \| \|------------------\|--------------------\|------------------\|--------------------\|--------------\| unpack_sequence is nowhere to be seen and spectral_norm is faster now...
msg259491 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-03 17:00
Attaching a new patch -- rewritten to optimize -, , +, -=, = and +=. I also removed the optimization of [] operator -- that should be done in a separate patch and in a separate issue. Some nano-benchmarks (best of 3): python -m timeit "sum([x + x + 1 for x in range(100)])" 2.7: 7.71 3.5: 8.54 3.6: 7.33 python -m timeit "sum([x - x - 1 for x in range(100)])" 2.7: 7.81 3.5: 8.59 3.6: 7.57 python -m timeit "sum([x * x * 1 for x in range(100)])" 2.7: 9.28 3.5: 10.6 3.6: 9.44 Python 3.6 vs 3.5 (spectral_norm, rigorous run): Min: 0.315917 -> 0.276785: 1.14x faster Avg: 0.321006 -> 0.284909: 1.13x faster Zach, thanks a lot for the research! I'm glad that unpack_sequence finally proved to be irrelevant. Could you please take a look at the updated patch?
msg259493 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-03 17:05
> python -m timeit "sum([x * x * 1 for x in range(100)])" If you only want to benchmark x*y, x+y and list-comprehension, you should use a tuple for the iterator.
msg259494 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-03 17:07
> In this table I've flipped the results for running the modified build > as the reference, but in the new attachment, slower in the right > column means faster, I think :) I don't understand what this table means (why 4 columns?). Can you explain what you did?
msg259495 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2016-02-03 17:15
> I don't understand what this table means (why 4 columns?). Can you explain what you did? Yury suggested running perf.py twice with the binaries swapped So "faster" and "slower" underneath "Baseline Reference" are runs where the unmodified python binary was the first argument to perf, and the "Modified Reference" is where the patched binary is the first argument. ie. "perf.py -r -b all python patched_python" vs "perf.py -r -b all patched_python python" bench_results.txt has the actual output in it, and the "slower in the right column" comment was referring to the contents of that file, not the table. Sorry for the confusion.
msg259496 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-03 17:21
> Yury suggested running perf.py twice with the binaries swapped Yeah, I had some experience with perf.py when its results were skewed depending on what you test first. Hopefully Victor's new patch will fix that http://bugs.python.org/issue26275
msg259497 - (view)	Author: Zach Byrne (zbyrne) *	Date: 2016-02-03 17:47
> Could you please take a look at the updated patch? Looks ok to me, for whatever that's worth.
msg259499 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-03 17:52
Le 03/02/2016 18:21, Yury Selivanov a écrit : > > Yury Selivanov added the comment: > >> Yury suggested running perf.py twice with the binaries swapped > > Yeah, I had some experience with perf.py when its results were > skewed depending on what you test first. Have you tried disabling turbo on your CPU? (or any kind of power management that would change the CPU clock depending on the current workload)
msg259500 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-02-03 18:50
On 03.02.2016 18:05, STINNER Victor wrote: > >> python -m timeit "sum([x * x * 1 for x in range(100)])" > > If you only want to benchmark xy, x+y and list-comprehension, you > should use a tuple for the iterator. ... and precalculate that in the setup: python -m timeit -s "loops=tuple(range(100))" "sum([x x * 1 for x in loops])" # python -m timeit "sum([x * x * 1 for x in range(100)])" 100000 loops, best of 3: 5.74 usec per loop # python -m timeit -s "loops=tuple(range(100))" "sum([x * x * 1 for x in loops])" 100000 loops, best of 3: 5.56 usec per loop (python = Python 2.7)
msg259502 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-03 19:04
Antoine, yeah, it's probably turbo boost related. There is no easy way to turn it off on mac os x, though. I hope Victor's patch to perf.py will help to mitigate this. Victor, Marc-Andre, Updated results of nano-bench (best of 10): -m timeit -s "loops=tuple(range(100))" "sum([x * x * 1 for x in loops])" 2.7 8.5 3.5 10.1 3.6 8.91 -m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])" 2.7 7.27 3.5 8.2 3.6 7.13 -m timeit -s "loops=tuple(range(100))" "sum([x - x - 1 for x in loops])" 2.7 7.01 3.5 8.1 3.6 6.95 Antoine, Serhiy, I'll upload a new patch soon. Probably Serhiy's idea of using a switch statement will make it slightly faster. I'll also add a fast path for integer division.
msg259503 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-03 19:19
Fast patch is already implemented in long_mul(). May be we should just use this function if both arguments are exact int, and apply the switch optimization inside.
msg259505 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-03 19:26
> Fast patch is already implemented in long_mul(). May be we should just use this function if both arguments are exact int, and apply the switch optimization inside. Agree. BTW, what do you think about using __int128 when available? That way we can also optimize twodigit PyLongs.
msg259506 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-03 19:29
I don't think. I run benchmarks (for __int128) :-)
msg259508 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-03 19:35
> I don't think. I run benchmarks (for __int128) :-) Never mind... Seems that __int128 is still an experimental feature and some versions of clang even had bugs with it.
msg259509 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-03 19:43
> BTW, what do you think about using __int128 when available? That way we can also optimize twodigit PyLongs. __int128 is not always available and it will add too much of complexity for possible less gain. There is many ways to optimize the code and we should to choose those of them that have the best gain/complexity ratio. Lets split the patch on smaller parts: 1) direct using long-specialized functions in ceval.c, and 2) optimize the fast path in these functions, and test them separately and combined. May be only one way will add a gain.
msg259530 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-04 06:02
Attaching a second version of the patch. (BTW, Serhiy, I tried your idea of using a switch statement to optimize branches (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no detectable speed improvement). I decided to add fast path for floats & single-digit longs and their combinations. +, -, , /, //, and their inplace versions are optimized now. I'll have a full result of macro-benchmarks run tomorrow morning, but here's a result for spectral_norm (rigorous run, best of 3): ### spectral_norm ### Min: 0.300269 -> 0.233037: 1.29x faster Avg: 0.301700 -> 0.234282: 1.29x faster Significant (t=399.89) Stddev: 0.00147 -> 0.00083: 1.7619x smaller Some nano-benchmarks (best of 3): -m timeit -s "loops=tuple(range(100))" "sum([x + x + 1 for x in loops])" 2.7 7.23 3.5 8.17 3.6 7.57 -m timeit -s "loops=tuple(range(100))" "sum([x + x + 1.0 for x in loops])" 2.7 9.08 3.5 11.7 3.6 7.22 -m timeit -s "loops=tuple(range(100))" "sum([x/2.2 + 2 + x2.5 + 1.0 for x in loops])" 2.7 17.9 3.5 24.3 3.6 11.8
msg259540 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-02-04 07:58
On 04.02.2016 07:02, Yury Selivanov wrote: > Attaching a second version of the patch. (BTW, Serhiy, I tried your idea of using a switch statement to optimize branches (https://github.com/1st1/cpython/blob/fastint2/Python/ceval.c#L5390) -- no detectable speed improvement). It would be better to consistently have the fast_() helpers return -1 in case of an error, instead of -1 or 1. Overall, I see two problems with doing too many of these fast paths: the ceval loop may no longer fit in to the CPU cache on systems with small cache sizes, since the compiler will likely inline all the fast_() functions (I guess it would be possible to simply eliminate all fast paths using a compile time flag) maintenance will get more difficult In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead. Since 'a'+'b' is a very often used instruction type in the latter type of applications, please make sure that this fast path gets more priority in your patch. Please also check the effects of the fast paths for cases where they don't trigger, e.g. 'a'+'b' or 'a'*2. Thanks, -- Marc-Andre Lemburg eGenix.com
msg259541 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 08:01
"In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead." Hum, I disagree. See benchmark results in other messages. Examples: ### django_v2 ### Min: 2.682884 -> 2.633110: 1.02x faster ### unpickle_list ### Min: 1.333952 -> 1.212805: 1.10x faster These benchmarks are not written for numeric, but are more "general" benchmarks. int is just a core feature of Python, simply used everywhere, as the str type.
msg259542 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 08:13
+ if (Py_SIZE(left) != 0) { + if (Py_SIZE(right) != 0) { + +#ifdef HAVE_LONG_LONG + mul = PyLong_FromLongLong( + (long long)SINGLE_DIGIT_LONG_AS_LONG(left) * + SINGLE_DIGIT_LONG_AS_LONG(right)); +#else + mul = PyNumber_Multiply(left, right); +#endif Why don't you use the same code than long_mul() (you need #include "longintrepr.h")? ---------------- stwodigits v = (stwodigits)(MEDIUM_VALUE(a)) * MEDIUM_VALUE(b); #ifdef HAVE_LONG_LONG return PyLong_FromLongLong((PY_LONG_LONG)v); #else /* if we don't have long long then we're almost certainly using 15-bit digits, so v will fit in a long. In the unlikely event that we're using 30-bit digits on a platform without long long, a large v will just cause us to fall through to the general multiplication code below. */ if (v >= LONG_MIN && v <= LONG_MAX) return PyLong_FromLong((long)v); #endif ---------------- I guess that long_mul() was always well optimized, no need to experiment something new.
msg259545 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-02-04 08:56
On 04.02.2016 09:01, STINNER Victor wrote: > > "In a numerics heavy application it's like that all fast paths will trigger somewhere, but those will likely be better off using numpy or numba. For a text heavy application such as a web server, only few fast paths will trigger and so the various checks only add overhead." > > Hum, I disagree. See benchmark results in other messages. Examples: > > ### django_v2 ### > Min: 2.682884 -> 2.633110: 1.02x faster > > ### unpickle_list ### > Min: 1.333952 -> 1.212805: 1.10x faster > > These benchmarks are not written for numeric, but are more "general" benchmarks. int is just a core feature of Python, simply used everywhere, as the str type. Sure, some integer math is used in text applications as well, e.g. for indexing, counting and slicing, but the patch puts more emphasis on numeric operations, e.g. fast_add() tests for integers and floats before then coming back to check for Unicode. It would be interesting to know how often these paths trigger or not in the various benchmarks. BTW: The django_v2 benchmark result does not really say anything much. Values of +/- 2% do not have much meaning in benchmark results :-)
msg259549 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-04 09:35
I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO. The point of optimizing integers is that they are used for many purposes, not only "math" (e.g. indexing).
msg259552 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 09:37
> I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO. Oh wait, I maybe misunderstood Marc-Andre comment. If the question is only on float: I'm ok to drop the fast-path for float. By the way, I would prefer to see PyLong_CheckExact() in the main loop, and only call fast_mul() if both operands are Python int.
msg259554 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-04 10:30
fastint2.patch adds small regression for string multiplication: $ ./python -m timeit -s "x = 'x'" -- "x2; x2; x2; x2; x2; x2; x2; x2; x2; x2; " Unpatched: 1.46 usec per loop Patched: 1.54 usec per loop Here is an alternative patch. It just uses existing specialized functions for integers: long_add, long_sub and long_mul. It doesn't add regression for above example with string multiplication, and it looks faster than fastint2.patch for integer multiplication. $ ./python -m timeit -s "x = 12345" -- "x2; x2; x2; x2; x2; x2; x2; x2; x2; x2; " Unpatched: 0.887 usec per loop fastint2.patch: 0.841 usec per loop fastint_alt.patch: 0.804 usec per loop
msg259560 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 12:50
I prefer fastint_alt.patch design, it's simpler. I added a comment on the review. My numbers, best of 5 timeit runs: $ ./python -m timeit -s "x = 12345" -- "x2; x2; x2; x2; x2; x2; x2; x2; x2; x2; " * original: 299 ns * fastint2.patch: 282 ns (-17 ns, -6%) * fastint_alt.patch: 267 ns (-32 ns, -11%)
msg259562 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-04 13:54
> I agree with Marc-Andre, people doing FP-heavy math in Python use Numpy (possibly with Numba, Cython or any other additional library). Micro-optimizing floating-point operations in the eval loop makes little sense IMO. I disagree. 30% faster floats (sic!) is a serious improvement, that shouldn't just be discarded. Many applications have floating point calculations one way or another, but don't use numpy because it's an overkill. Python 2 is much faster than Python 3 on any kind of numeric calculations. This point is being frequently brought up in every python2 vs 3 debate. I think it's unacceptable. > * the ceval loop may no longer fit in to the CPU cache on systems with small cache sizes, since the compiler will likely inline all the fast_() functions (I guess it would be possible to simply eliminate all fast paths using a compile time flag) That's a speculation. It may still fit. Or it had never really fitted, it's already huge. I tested the patch on a 8 year old desktop CPU, no performance degradation on our benchmarks. ### raytrace ### Avg: 1.858527 -> 1.652754: 1.12x faster ### nbody ### Avg: 0.310281 -> 0.285179: 1.09x faster ### float ### Avg: 0.392169 -> 0.358989: 1.09x faster ### chaos ### Avg: 0.355519 -> 0.326400: 1.09x faster ### spectral_norm ### Avg: 0.377147 -> 0.303928: 1.24x faster ### telco ### Avg: 0.012845 -> 0.013006: 1.01x slower The last benchmark (telco) is especially interesting. It uses decimals for calculation, that means that it uses overloaded numeric operators. Still no significant performance degradation. > maintenance will get more difficult Fast path for floats is easy to understand for every core dev that works with ceval, there is no rocket science there (if you want rocket science that is hard to maintain look at generators/yield from). If you don't like inlining floating point calculations, we can make float_add, float_sub, float_div, and float_mul exported and use them in ceval. Why not combine my patch and Serhiy's? First we check if left & right are both longs. Then we check if they are unicode (for +). And then we have a fastpath for floats.
msg259563 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 14:01
> Why not combine my patch and Serhiy's? First we check if left & right are both longs. Then we check if they are unicode (for +). And then we have a fastpath for floats. See my comment on Serhiy's patch. Maybe we can start by check that the type of both operands are the same, and then use PyLong_CheckExact and PyUnicode_CheckExact. Using such design, we may add a _PyFloat_Add(). But the next question is then the overhead on the "slow" path, which requires a benchmark too! For example, use a subtype of int.
msg259564 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-04 14:06
Le 04/02/2016 14:54, Yury Selivanov a écrit : > > 30% faster floats (sic!) is a serious improvement, that shouldn't > just be discarded. Many applications have floating point calculations one way > or another, but don't use numpy because it's an overkill. Can you give any example of such an application and how they would actually benefit from "faster floats"? I'm curious why anyone who wants fast FP calculations would use pure Python with CPython... Discarding Numpy because it's "overkill" sounds misguided to me. That's like discarding asyncio because it's "less overkill" to write your own select() loop. It's often far more productive to use the established, robust, optimized library rather than tweak your own low-level code. (and Numpy is easier to learn than asyncio ;-)) I'm not violently opposing the patch, but I think maintenance effort devoted to such micro-optimizations is a bit wasted. And once you add such a micro-optimization, trying to remove it often faces a barrage of FUD about making Python slower, even if the micro-optimization is practically worthless. > Python 2 is much faster than Python 3 on any kind of numeric > calculations. Actually, it shouldn't really be faster on FP calculations, since the float object hasn't changed (as opposed to int/long). So I'm skeptical of FP-heavy code that would have been made slower by Python 3 (unless there's also integer handling in that, e.g. indexing).
msg259565 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-04 14:18
>But the next question is then the overhead on the "slow" path, which requires a benchmark too! For example, use a subtype of int. telco is such a benchmark (although it's very unstable). It uses decimals extensively. I've tested it many times on three different CPUs, and it doesn't seem to become any slower. > Discarding Numpy because it's "overkill" sounds misguided to me. That's like discarding asyncio because it's "less overkill" to write your own select() loop. It's often far more productive to use the established, robust, optimized library rather than tweak your own low-level code. Don't get me wrong, numpy is simply amazing! But if you have a 100,000 lines application that happens to have a a few FP-related calculations here and there, you won't use numpy (unless you had experience with it before). My opinion on this: numeric operations in Python (and any general purpose language) should be as fast as we can make them. > Python 2 is much faster than Python 3 on any kind of numeric > calculations. > Actually, it shouldn't really be faster on FP calculations, since the float object hasn't changed (as opposed to int/long). So I'm skeptical of FP-heavy code that would have been made slower by Python 3 (unless there's also integer handling in that, e.g. indexing). But it is faster. That's visible on many benchmarks. Even simple timeit oneliners can show that. Probably it's because that such benchmarks usually combine floats and ints, i.e. "2 * smth" instead of "2.0 * smth".
msg259567 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-04 14:24
Le 04/02/2016 15:18, Yury Selivanov a écrit : > > But it is faster. That's visible on many benchmarks. Even simple timeit oneliners can show that. Probably it's because that such benchmarks usually combine floats and ints, i.e. "2 * smth" instead of "2.0 * smth". So it's not about FP-related calculations anymore. It's about ints having become slower ;-)
msg259568 - (view)	Author: Yury Selivanov (Yury.Selivanov) *	Date: 2016-02-04 14:27
>> But it is faster. That's visible on many benchmarks. Even simple > timeit oneliners can show that. Probably it's because that such > benchmarks usually combine floats and ints, i.e. "2 * smth" instead of > "2.0 * smth". > > So it's not about FP-related calculations anymore. It's about ints > having become slower ;-) I should have written 2 * smth_float vs 2.0 * smth_float
msg259571 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-02-04 15:40
It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C after the first cdecimal result, 5 repetitions or so). fastint2.patch speeds up floats enormously and slows down decimal by 6%. fastint_alt.patch slows down float and decimal (5% or so). Overall the status quo isn't that bad, but I concede that float benchmarks like that are useful for PR.
msg259573 - (view)	Author: Yury Selivanov (Yury.Selivanov) *	Date: 2016-02-04 15:56
> > Stefan Krah added the comment: > > It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C after the first cdecimal result, 5 repetitions or so). > > fastint2.patch speeds up floats enormously and slows down decimal by 6%. > fastint_alt.patch slows down float and decimal (5% or so). > > Overall the status quo isn't that bad, but I concede that float benchmarks like that are useful for PR. > Thanks Stefan! I'll update my patch to include Serhiy's ideas. The fact that fastint_alt slows down floats and decimals is not acceptable. I'm all for keeping cpython and ceval loop simple, but we should not pass on opportunities to improve some things in a significant way.
msg259574 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-04 16:36
It is easy to extend fastint_alt.patch to support floats too. Here is new patch. > It's instructive to run ./python Modules/_decimal/tests/bench.py (Hit Ctrl-C after the first cdecimal result, 5 repetitions or so). Note that this benchmark is not very stable. I ran it few times and the difference betweens runs was about 20%.
msg259577 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-02-04 16:42
I've never seen 20% fluctuation in that benchmark between runs. The benchmark is very stable if you take the average of 10 runs.
msg259578 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-02-04 16:44
I mean, if you run the benchmark 10 times and the unpatched result is always between 11.3 and 12.0 for floats while the patched result is between 12.3 and 12.9, for me the situation is clear.
msg259601 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-04 22:55
People should stop getting hung up about benchmarks numbers and instead should first think about what they are trying to achieve. FP performance in pure Python does not seem like an important goal in itself. Also, some benchmarks may show variations which are randomly correlated with a patch (e.g. before of different code placement by the compiler interfering with instruction cache wayness). It is important not to block a patch because some random benchmark on some random machine shows an unexpected slowdown. That said, both of Serhiy's patches are probably ok IMO.
msg259605 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 00:09
> People should stop getting hung up about benchmarks numbers and instead should first think about what they are trying to achieve. FP performance in pure Python does not seem like an important goal in itself. I'm not sure how to respond to that. Every performance aspect is important. numpy isn't shipped with CPython, not everyone uses it. In one of my programs I used colorsys extensively -- did I need to rewrite it using numpy? Probably I could, but that was a simple shell script without any dependencies. It also harms Python 3 adoption a little bit, since many benchmarks are still slower. Some of them are FP related. In any case, I think that if we can optimize something - we should. > Also, some benchmarks may show variations which are randomly correlated with a patch (e.g. before of different code placement by the compiler interfering with instruction cache wayness). 30-50% speed improvement is not a variation. It's just that a lot less code gets executed if we inline some operations. > It is important not to block a patch because some random benchmark on some random machine shows an unexpected slowdown. Nothing is blocked atm, we're just discussing various approaches. > That said, both of Serhiy's patches are probably ok IMO. Current Serhiy's patches are incomplete. In any case, I've been doing some research and will post another message shortly.
msg259607 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-05 01:05
Hi Yury, > I'm not sure how to respond to that. Every performance aspect is > important. Performance is not a religion (not any more than security or any other matter). It is not helpful to brandish results on benchmarks which have no relevance to real-world applications. It helps to define what we should achieve and why we want to achieve it. Once you start asking "why", the prospect of speeding up FP computations in the eval loop starts becoming dubious. > numpy isn't shipped with CPython, not everyone uses it. That's not the point. People doing FP-heavy computations should use Numpy or any of the packages that can make FP-heavy computations faster (Numba, Cython, Pythran, etc.). You should use the right tool for the job. There is no need to micro-optimize a hammer for driving screws when you could use a screwdriver instead. Lists or tuples of Python float objects are an awful representation for what should be vectorized native data. They eat more memory in addition to being massively slower (they will also be slower to serialize from/to disk, etc.). "Not using" Numpy when you would benefit from it is silly. Numpy is not only massively faster on array-wide tasks, it also makes it easier to write high-level, readable, reusable code instead of writing loops and iterating by hand. Because it has been designed explicitly for such use cases (which the Python core was not, despite the existence of the colorsys module ;-)). It also gives you access to a large ecosystem of third-party modules implementing various domain-specific operations, actively maintained by experts in the field. Really, the mindset of "people shouldn't need to use Numpy, they can do FP computations in the interpreter loop" is counter-productive. I understand that it's seductive to think that Python core should stand on its own, but it's also a dangerous fallacy. You should advocate people use Numpy for FP computations. It's an excellent library, and it's currently a major selling point for Python. Anyone doing FP-heavy computations with Python should learn to use Numpy, even if they only use it from time to time. Downplaying its importance, and pretending core Python is sufficient, is not helpful. > It also harms Python 3 adoption a little bit, since many benchmarks > are still slower. Some of them are FP related. The Python 3 migration is happening already. There is no need to worry about it... Even the diehard 3.x haters have stopped talking of releasing a 2.8 ;-) > In any case, I think that if we can optimize something - we should. That's not true. Some optimizations add maintenance overhead for no real benefit. Some may even hinder performance as they add conditional branches in a critical path (increasing the load on the CPU's branch predictors and making them potentially less efficient). Some optimizations are obviously good, like the method call optimization which caters to real-world use cases (and, by the way, kudos for that... you are doing much better than all previous attempts ;-)). But some are solutions waiting for a problem to solve.
msg259612 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 01:37
tl;dr I'm attaching a new patch - fastint4 -- the fastest of them all. It incorporates Serhiy's suggestion to export long/float functions and use them. I think it's reasonable complete -- please review it, and let's get it committed. == Benchmarks == spectral_norm (fastint_alt) -> 1.07x faster spectral_norm (fastintfloat) -> 1.08x faster spectral_norm (fastint3.patch) -> 1.29x faster spectral_norm (fastint4.patch) -> 1.16x faster spectral_norm (fastint.patch)-> 1.31x faster nbody (fastint.patch) -> 1.16x faster Where: - fastint3 - is my previous patch that nobody likes (it inlined a lot of logic from longobject/floatobject) - fastint4 - is the patch I'm attaching and ideally want to commit - fastint** - is a modification of fastint4. This is very interesting -- I started to profile different approaches, and found two bottlenecks, that really made Serhiy's and my other patches slower than fastint3. What I found is that PyLong_AsDouble can be significantly optimized, and PyLong_FloorDiv is super inefficient. PyLong_AsDouble can be sped up several times if we add a fastpath for 1-digit longs: // longobject.c: PyLong_AsDouble if (PyLong_CheckExact(v) && Py_ABS(Py_SIZE(v)) <= 1) { /* fast path; single digit will always fit decimal / return (double)MEDIUM_VALUE((PyLongObject )v); } PyLong_FloorDiv (fastint4 adds it) can be specialized for single digits, which gives it a tremendous boost. With those too optimizations, fastint4 becomes as fast as fastint3. I'll create separate issues for PyLong_AsDouble and FloorDiv. == Micro-benchmarks == Floats + ints: -m timeit -s "x=2" "x2.2 + 2 + x2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)2 + (x+10)(x-30)" 2.7: 0.42 (usec) 3.5: 0.619 fastint_alt 0.619 fastintfloat: 0.52 fastint3: 0.289 fastint4: 0.51 fastint*: 0.314 === Ints: -m timeit -s "x=2" "x + 10 + x 20 - x // 3 + x* 10 + 20 -x" 2.7: 0.151 (usec) 3.5: 0.19 fastint_alt: 0.136 fastintfloat: 0.135 fastint3: 0.135 fastint4: 0.122 fastint*: 0.122 P.S. I have another variant of fastint4 that uses fast_ functions in ceval loop, instead of a big macro. Its performance is slightly worse than with the macro.
msg259614 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 01:48
Antoine, FWIW I agree on most of your points :) And yes, numpy, scipy, numba, etc rock. Please take a look at my fastint4.patch. All tests pass, no performance regressions, no crazy inlining of floating point exceptions etc. And yet we have a nice improvement for both ints and floats.
msg259626 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 04:04
Attaching another approach -- fastint5.patch. Similar to what fastint4.patch does, but doesn't export any new APIs. Instead, similarly to abstract.c, it uses type slots directly.
msg259663 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 15:10
Unless there are any objections, I'll commit fastint5.patch in a day or two.
msg259664 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-05 15:14
> Unless there are any objections, I'll commit fastint5.patch in a day or two. Please don't. I would like to have time to benchmark all these patches (there are now 9 patches attached to the issue :-)) and I would like to hear Serhiy's feedback on your latest patches.
msg259666 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-02-05 15:26
On 05.02.2016 16:14, STINNER Victor wrote: > > Please don't. I would like to have time to benchmark all these patches (there are now 9 patches attached to the issue :-)) and I would like to hear Serhiy's feedback on your latest patches. Regardless of the performance, the fastint5.patch looks like the least invasive approach to me. It also doesn't incur as much maintenance overhead as the others do. I'd only rename the macro MAYBE_DISPATCH_FAST_NUM_OP to TRY_FAST_NUMOP_DISPATCH :-) BTW: I do wonder why this approach is as fast as the others. Have compilers grown smart enough to realize that the number slot functions will not change and can thus be inlined ?
msg259667 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 15:32
>> Unless there are any objections, I'll commit fastint5.patch in a day or two. > Please don't. I would like to have time to benchmark all these patches (there are now 9 patches attached to the issue :-)) and I would like to hear Serhiy's feedback on your latest patches. Sure, I'd very appreciate a review of fastint5. I can save you sometime on benchmarking -- it's really about fastint_alt vs fastint5. The latter optimizes ALL ops on longs AND floats. The former only optimizes some ops on longs. So please be sure you're comparing oranges to oranges ;)
msg259668 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 15:43
> Regardless of the performance, the fastint5.patch looks like the least invasive approach to me. It also doesn't incur as much maintenance overhead as the others do. Thanks. It's a result of an enlightenment that can only come after running benchmarks all day :) > I'd only rename the macro MAYBE_DISPATCH_FAST_NUM_OP to TRY_FAST_NUMOP_DISPATCH :-) Yeah, your name is better. > BTW: I do wonder why this approach is as fast as the others. Have compilers grown smart enough to realize that the number slot functions will not change and can thus be inlined ? Looks like so, I'm very impressed myself. I'd expect fastint3 (which just inlines a lot of logic directly in ceval.c) to be the fastest one. But it seems that compiler does an excellent job here. Victor, BTW, if you want to test fastint3 vs fastint5, don't forget to apply the patch from issue #26288 over fastint5 (fixes slow performance of PyLong_AsDouble)
msg259669 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-05 15:58
bench_long2.py: my updated microbenchmark to test many types and more operations. compare.txt: compare Python original, fastint_alt.patch, fastintfloat_alt.patch and fastint5.patch. "(*)" marks the minimum of the line, percents are relative to the minimum (if larger than +/-5%). compare_to.txt: similar to compare.txt, but percents are relative to the original Python.
msg259670 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-05 16:15
My analysis of benchmarks. Even using CPU isolation to run benchmarks, the results look unreliable for very short benchmarks like 3 ** 2.0: I don't think that fastint_alt can make the operation 16% slower since it doesn't touch this code, no? Well... as expected, speedup is quite small: the largest difference is on "3 * 2" ran 100 times: 18% faster with fastint_alt. We are talking about 1.82 us => 1.49 us: delta of 330 ns. I expect a much larger difference is you compile a function to machine code using Cython or a JIT like Numba and PyPy. Remember that we are running micro-benchmarks, so we should not push overkill optimizations except if the speedup is really impressive. It's quite obvious from the tables than fastint_alt.patch only optimize int (float is not optimized). If we choose to optimize float too, fastintfloat_alt.patch and fastint5.patch look to have the same speed. I don't see any overhead on Decimal + Decimal with any patch: good. -- Between fastintfloat_alt.patch and fastint5.patch, I prefer fastintfloat_alt.patch which is much easier to read, so probably much easier to debug. I hate huge macro when I have to debug code in gdb :-( I also like very much the idea of reusing existing functions, rather than duplicating code. Even if Antoine doesn't seem interested by optimizations on float, I think that it's ok to add a few lines for this type, fastintfloat_alt.patch is not so complex. What do you think? Why not optimizing ab? It's a common operation, especially 2k, no?
msg259671 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 16:18
> Between fastintfloat_alt.patch and fastint5.patch, I prefer fastintfloat_alt.patch which is much easier to read, so probably much easier to debug. I hate huge macro when I have to debug code in gdb :-( I also like very much the idea of reusing existing functions, rather than duplicating code. I disagree. fastintfloat_alt exports a lot of functions from longobject/floatobject, something that I really don't like. Lots of repetitive code in ceval.c also make it harder to make sure everything is correct.
msg259672 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 16:22
Anyways, if it's about macro vs non-macro, I can inline the macro by hand (which I think is an inferior approach here). But I'd like the final code to use my approach of using slots directly, instead of modifying longobject/floatobject to export lots of internal stuff.
msg259673 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 16:32
As to weather we want this patch committed or not, here's a mini-macro-something benchmark: $ ./python.exe -m timeit -s "x=2" "x + 10 + x * 20 + x* 10 + 20 -x" 10000000 loops, best of 3: 0.115 usec per loop $ python3.5 -m timeit -s "x=2" "x + 10 + x * 20 + x* 10 + 20 -x" 10000000 loops, best of 3: 0.141 usec per loop $ ./python.exe -m timeit -s "x=2" "x2.2 + 2 + x2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)2 + (x+10)(x-30)" 1000000 loops, best of 3: 0.308 usec per loop $ python3.5 -m timeit -s "x=2" "x2.2 + 2 + x2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)2 + (x+10)(x-30)" 1000000 loops, best of 3: 0.652 usec per loop Still, longs are faster 30-50%, FP are faster 100%. I think it's a very good result. Please don't block this patch.
msg259675 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-05 17:01
My patches were just samples. I'm glad that Yury incorporated the main idea and that this helps. If apply any patch I would prefer fastint5.patch. But I don't quite understand why it adds any gain. Is this just due to overhead of calling PyNumber_Add? Then we should test with other compilers and with the LTO option. fastint5.patch adds an overhead for type checks and increases the size of ceval loop. What is outweigh this overhead? As for tests, it would be more honest to test data that results out of small ints range (-5..256).
msg259678 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-05 17:16
Thanks, Serhiy, > But I don't quite understand why it adds any gain. Perhaps, and this is just a guess - the fast path does only couple of eq tests & one call for the actual op. If it's long+long then long_add will be called directly. PyNumber_Add has more overhead on: - at least one extra call - a few extra checks to guard against NotImplemented - abstract.c/binary_op1 has a few more checks/slot lookups So it look that there's just much less instructions to be executed. If this guess is correct, then an LTO build without fast paths will still be somewhat slower. > Is this just due to overhead of calling PyNumber_Add? Then we should test with other compilers and with the LTO option. I actually tried to compile CPython with LTO -- but couldn't. Almost all of C extension modules failed to link. Do we compile official binaries with LTO?
msg259695 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-05 22:37
Serhiy Storchaka: "My patches were just samples. I'm glad that Yury incorporated the main idea and that this helps." Oh, if even Serhiy prefers Yury's patches, I should read them again :-) -- I read fastint5.patch one more time and I finally understood the following macros: +#define NB_SLOT(slot) offsetof(PyNumberMethods, slot) +#define NB_BINOP(nb_methods, slot) \ + ((binaryfunc)(& ((char)nb_methods)[NB_SLOT(slot)])) +#define PY_LONG_CALL_BINOP(slot, left, right) \ + (NB_BINOP(PyLong_Type.tp_as_number, slot))(left, right) +#define PY_FLOAT_CALL_BINOP(slot, left, right) \ + (NB_BINOP(PyFloat_Type.tp_as_number, slot))(left, right) In short, a+b calls long_add(a, b) with that. At the first read, I understood that it casted objects to C long or C double (don't ask me why). I see a difference between fastint5.patch and fastintfloat_alt.patch: fastint5.patch resolves the address of long_add() at runtime, whereas fastintfloat_alt.patch gets a direct pointer to _PyLong_Add() at the compilation. I expected a sublte speedup, but I'm unable to see it on benchmarks (again, both patches have the same speed). The float path is simpler in fastint5.patch because it uses the same code if right is float or long, but it adds more checks for the slow-path. No patch looks to have a real impact on the slow-path. Is it worth to change the second if to PyFloat_CheckExact() and then check type of right in the if body to avoid other checks on the slow-path? (C checks look very cheap, so I think that I already replied to my own question :-)) -- fastint5.patch optimizes a+b, a-b, ab, a/b and a//b. Why not other operators? List of operators from my constant folding optimzation in fatoptimizer: * int, float: a+b, a-b, ab, a/b, +x, -x, ~x, a//b, a%b, ab int only: a<<b, a>>b, a&b, a\|b, a^b If we optimize a//b, I suggest to also optimize a%b to be consistent. For integers, a*b, a<<b and a>>b would make sense too. Coming from the C language, I would prefer a<<b and a>>b than a2k or a//2k, since I expect better performance. For float, -x and +x may be common, but less a+b, a-b, a*b, a/b. Well, what I'm trying to say: if choose fastintfloat_alt.patch design, we will have to expose like a lot of new C functions in headers, and duplicate a lot of code. To support more than 4 operators, we need a macro. If we use a macro, it's cheap (in term of code maintenance) to use it for most or even all operators. -- > But I don't quite understand why it adds any gain. Is this just due to overhead of calling PyNumber_Add? Hum, that's a good question. > Then we should test with other compilers and with the LTO option. There are projects (I don't recall the number number) but I would prefer to handle LTO separatly. Python supports platforms and compilers which don't implement LTO. > fastint5.patch adds an overhead for type checks and increases the size of ceval loop. What is outweigh this overhead? I stopped to guess the speedup just by reading the code or a patch. I only trust benchmarks :-) Advice: don't trust yourself! only trust benchmarks.
msg259702 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-06 00:10
Attached is the new version of fastint5 patch. I fixed most of the review comments. I also optimized %, << and >> operators. I didn't optimize other operators because they are less common. I guess we have to draw a line somwhere... Victor, thanks a lot for your suggestion to drop NB_SLOT etc macros! Without them the code is even simpler.
msg259706 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-06 01:31
inline-2.patch: more complete version of inline.patch. Optimize the same instructions than Python 2: BINARY_ADD, INPLACE_ADD, BINARY_SUBSTRACT, INPLACE_SUBSTRACT. Quick & dirty microbenchmark: $ ./python -m timeit -s 'x=1' 'x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x+x' * Original: 287 ns * fastint5_2.patch: 261 ns (-9%) * inline-2.patch: 212 ns (-26%) $ ./python -m timeit -s 'x=1000; y=1' 'x-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y-y' * Original: 517 ns * fastint5_2.patch: 469 ns (-9%) * inline-2.patch: 442 ns (-15%) Ok. Now I'm lost. We have so many patches :-) Which one do you prefer? In term of speedup, I expect that Python 2 design (inline-2.patch) cannot be beaten in term of performance by another another option since it doesn't need any C code and does everything in ceval.c.
msg259707 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-06 01:36
> Ok. Now I'm lost. We have so many patches :-) Which one do you prefer? To no-ones surprise I prefer fastint5, because it optimizes almost all binary operators on both ints and floats. inline-2.patch only optimizes just + and - for just ints. If + and - performance of inline-2 is really important, I suggest to merge it in fastint5, but i'd keep it simple ;)
msg259712 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-06 01:52
msg222985: Raymond Hettinger "There also used to be a fast path for binary subscriptions with integer indexes. I would like to see that performance regression fixed if it can be done cleanly." The issue #26280 was opened to track this optimization.
msg259713 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-06 02:12
msg223186, Serhiy Storchaka about inline.patch: "Confirmed speed up about 20%. Surprisingly it affects even integers outside of the of preallocated small integers (-5...255)." The optimization applies to Python int with 0 or 1 digit so in range [-2^30+1; 2^30-1]. Small integers in [-5; 255] might be faster but for PyLong_FromLong().
msg259714 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-06 02:22
myself> Ok. Now I'm lost. We have so many patches :-) Which one do you prefer? I read again fully this old issue, well, almost all messages. Well, it's clear that no consensus was found yet :-) I see two main trends: optimize most cases (optimize most operators for int and float, ex: fastint5_4.patch) versus optimize very few cases to limit changes and to limit effects on ceval.c (ex: inline-2.patch). Marc-Andre and Antoine asked to not stick to micro-optimizations but think wider: run macro benchmarks, like perf.py, and suggest to use PyPy, Numba, Cython & cie for users who use best performances on numeric functions. They also warned about subtle side-effects of any kind of change on ceval.c which may be counter-productive. It was shown in the long list of patches that some of them introduced performance regressions. I don't expect that CPython can beat any compiler emiting machine code. CPython will always have to pay the price of boxing/unboxing and its loop evaluating bytecode. We can do better, the question is "how far?". I think that we gone far enough on investigation all different options to optimize 1+2 ;-) Each option was micro-benchmarked very carefully. Now I suggest to focus on macro benchmarks to help use to take a decision. I will run perf.py on fastint5_4.patch and inline-2.patch.
msg259722 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-06 08:37
> I see two main trends: optimize most cases (optimize most operators for int and float, ex: fastint5_4.patch) versus optimize very few cases to limit changes and to limit effects on ceval.c (ex: inline-2.patch). I agree that may be optimizing very few cases is better. We need to collect the statistics of using different operations with different types in long run of tests or benchmarks. If say division is used 100 times less than addition, we shouldn't complicate ceval loop to optimize it.
msg259729 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-06 15:47
Benchmark on inline-2.patch. No speedup, only slowdown. I'm now running benchmark on fastint5_4.patch. $ python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python.inline-2 Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64 Total CPU cores: 8 ### json_load ### Min: 0.707290 -> 0.723411: 1.02x slower Avg: 0.707845 -> 0.724238: 1.02x slower Significant (t=-297.25) Stddev: 0.00026 -> 0.00049: 1.8696x larger ### regex_v8 ### Min: 0.066663 -> 0.070435: 1.06x slower Avg: 0.066947 -> 0.071378: 1.07x slower Significant (t=-17.98) Stddev: 0.00172 -> 0.00177: 1.0313x larger The following not significant results are hidden, use -v to show them: 2to3, chameleon_v2, django_v3, fastpickle, fastunpickle, json_dump_v2, nbody, tornado_http. real 58m32.662s user 57m43.058s sys 0m47.428s
msg259730 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-06 15:49
Benchmark on fastint5_4.patch. python3 -u perf.py --affinity=2-3,6-7 --rigorous ../default/python.orig ../default/python_fastint5_4 Report on Linux smithers 4.3.4-300.fc23.x86_64 #1 SMP Mon Jan 25 13:39:23 UTC 2016 x86_64 x86_64 Total CPU cores: 8 ### django_v3 ### Min: 0.563959 -> 0.578181: 1.03x slower Avg: 0.565383 -> 0.579137: 1.02x slower Significant (t=-152.48) Stddev: 0.00075 -> 0.00050: 1.4900x smaller ### fastunpickle ### Min: 0.551076 -> 0.563469: 1.02x slower Avg: 0.555481 -> 0.567028: 1.02x slower Significant (t=-27.05) Stddev: 0.00278 -> 0.00324: 1.1687x larger ### json_dump_v2 ### Min: 2.737429 -> 2.662615: 1.03x faster Avg: 2.754239 -> 2.685404: 1.03x faster Significant (t=55.63) Stddev: 0.00610 -> 0.01077: 1.7662x larger ### nbody ### Min: 0.228548 -> 0.212292: 1.08x faster Avg: 0.230082 -> 0.213574: 1.08x faster Significant (t=73.74) Stddev: 0.00175 -> 0.00139: 1.2567x smaller ### regex_v8 ### Min: 0.041323 -> 0.048099: 1.16x slower Avg: 0.041624 -> 0.049318: 1.18x slower Significant (t=-45.38) Stddev: 0.00123 -> 0.00116: 1.0613x smaller The following not significant results are hidden, use -v to show them: 2to3, chameleon_v2, fastpickle, json_load, tornado_http.
msg259733 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-06 17:00
> ### regex_v8 ### > Min: 0.041323 -> 0.048099: 1.16x slower > Avg: 0.041624 -> 0.049318: 1.18x slower I think this is a random fluctuation, that benchmark (and re lib) doesn't use the operators too much. It can't be THAT slower just because of optimizing a few binops.
msg259734 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-06 17:02
You're also running a very small subset of all benchmarks available. Please try the '-b all' option. I'll also run benchmarks on my machines.
msg259743 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-06 17:38
Alright, I ran a few benchmarks myself. In rigorous mode regex_v8 has the same performance on my 2013 Macbook Pro and an 8-years old i7 CPU (Linux). Here're results of "perf.py -b raytrace,spectral_norm,meteor_contest,nbody ../cpython/python.exe ../cpython-git/python.exe -r" fastint5: ### nbody ### Min: 0.227683 -> 0.197046: 1.16x faster Avg: 0.229366 -> 0.198889: 1.15x faster Significant (t=137.31) Stddev: 0.00170 -> 0.00142: 1.1977x smaller ### spectral_norm ### Min: 0.296840 -> 0.262279: 1.13x faster Avg: 0.299616 -> 0.265387: 1.13x faster Significant (t=74.52) Stddev: 0.00331 -> 0.00319: 1.0382x smaller The following not significant results are hidden, use -v to show them: meteor_contest, raytrace. ====== inline-2: ### raytrace ### Min: 1.188825 -> 1.213788: 1.02x slower Avg: 1.199827 -> 1.227276: 1.02x slower Significant (t=-18.12) Stddev: 0.00559 -> 0.01408: 2.5184x larger ### spectral_norm ### Min: 0.296535 -> 0.277025: 1.07x faster Avg: 0.299044 -> 0.278071: 1.08x faster Significant (t=87.40) Stddev: 0.00220 -> 0.00097: 2.2684x smaller The following not significant results are hidden, use -v to show them: meteor_contest, nbody.
msg259790 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-07 15:01
From what I can see there is no negative impact of the patch on stable macro benchmarks. There is quite a detectable positive impact on most of integer and float operations from my patch. 13-16% on nbody and spectral_norm benchmarks is still impressive. And you can see a huge improvement in various timeit micro-benchmarks. fastint5 is a very compact patch, that only touches the ceval.c file. It doesn't complicate the code, as the macro is very straightforward. Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it.
msg259791 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-07 15:08
Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics.
msg259792 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-07 15:16
> Please don't commit it right now. Yes, due to using macros the patch looks simple, but macros expanded to complex code. We need more statistics. But what you will use to gather statistics data? Test suite isn't representative, and we already know what will benchmarks suite show. I can assist with writing some code for stats, but what's the plan?
msg259793 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-02-07 16:18
#26288 brought a great speedup for floats. With fastint5_4.patch on top of #26288 I see no improvement for floats and a big slowdown for _decimal.
msg259800 - (view)	Author: Case Van Horsen (casevh)	Date: 2016-02-07 19:30
Can I suggest the mpmath test suite as a good benchmark? I've used it to test the various optimizations in gmpy2 and it has always been a valuable real-world benchmark. And it is slower in Python 3 than Python 2....
msg259801 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-07 19:42
Be careful with test suites: first, they might exercise code that would never be a critical point for performance in a real-world application; second and most important, unittest seems to have gotten slower between 2.x and 3.x, so you would really be comparing apples to oranges.
msg259804 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-07 21:32
Attaching another patch - fastint6.patch that only optimizes longs (no FP fast path). > #26288 brought a great speedup for floats. With fastint5_4.patch on top of #26288 I see no improvement for floats and a big slowdown for _decimal. What benchmark did you use? What were the numbers? I'm asking because before you benchmarked different patches that are conceptually similar to fastint5, and the result was that decimal was 5% faster with fast paths for just longs, and 6% slower with fast paths for longs & floats. Also, some quick timeit results (quite stable from run to run): -m timeit -s "x=2" "x + 10 + x * 20 + x* 10 + 20 -x" 3.6: 0.150usec 3.6+fastint: 0.112usec -m timeit -s "x=2" "x2.2 + 2 + x2.5 + 1.0 - x / 2.0 + (x+0.1)/(x-0.1)2 + (x+10)(x-30)" 3.6: 0.425usec 3.6+fastint: 0.302usec
msg259832 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-08 09:46
Yury Selivanov: > Alright, I ran a few benchmarks myself. (...) > From what I can see there is no negative impact of the patch on stable macro benchmarks. I'm disappointed by the results. In short, these patches have no impact on macro benchmarks, other than the two which are stressing the int and float types. Maybe we are just loosing our time on this issue... I understand that the patches are only useful to get xx% speedup (where xx% is smaller than 25%) if your whole application is blocked by numeric computations. If it's the case, I would suggest to move to PyPy, Numba, Cython, etc. I expect something more interesting than xx% faster, but a much more impressive speedup. http://speed.pypy.org/ : PyPy/CPython 2.7 for spectral_norm is 0.04: 25x faster. For nbody (nbody_modified), it's 0.09: 11x faster. With patches of this issue, the best speedup is only 1.16x faster... We are very far from 11x or 25x faster. It's not even 2x faster... Yury Selivanov: > fastint5 is a very compact patch, that only touches the ceval.c file. It doesn't complicate the code, as the macro is very straightforward. Since the patch passed the code review, thorough benchmarking and discussion stages, I'd like to commit it. According to my micro-benchmark msg259706, inline-2.patch is faster than fastint5_4.patch. I would suggest to "finish" the inline-2.patch to optimize other operations, and maybe add fast-path for float. On macrobenchmark, inline-2.patch is slower than fastint5_4.patch, but it was easy to expect since I only added fast-path for int-int and only for a few operators. The question is it is worth to get xx% speedup on one or two specific benchmarks where CPython really sucks compared to other languages and other implementations of Python... Stefan Krah: > With fastint5_4.patch on top of #26288 I see no improvement for floats and a big slowdown for _decimal. How do you run your benchmark? Case Van Horsen: > Can I suggest the mpmath test suite as a good benchmark? Where can we find this benchmark? Case Van Horsen: > it has always been a valuable real-world benchmark What do you mean by "real-world benchmark"? :-)
msg259859 - (view)	Author: Case Van Horsen (casevh)	Date: 2016-02-08 16:30
mpmath is a library for arbitrary-precision floating-point arithmetic. It uses Python's native long type or gmpy2's mpz type for computations. It is available at https://pypi.python.org/pypi/mpmath. The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation: ~/src/mpmath-0.19/mpmath/tests$ time py36 runtests.py -local For example, I've tried to optimize gmpy2's handling of binary operations between its mpz type and short Python integers. I've found it to provide useful results: improvements that are significant on a micro-benchmark (say 20%) will usually cause a small but repeatable improvement. And some improvements that looked good on a micro-benchmark would slow down mpmath. I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%. I will run more tests tonight.
msg259860 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-08 16:40
> I ran the mpmath test suite with Python 3.6 and with the fastint6 patch. The overall increase when using Python long type was about 1%. When using gmpy2's mpz type, there was a slowdown of about 2%. > I will run more tests tonight. Please try to test fastint5 too (fast paths for long & floats, whereas fastint6 is only focused on longs).
msg259918 - (view)	Author: Case Van Horsen (casevh)	Date: 2016-02-09 08:25
I ran the mpmath test suite with the fastint6 and fastint5_4 patches. fastint6 results without gmpy: 0.25% faster with gmpy: 3% slower fastint5_4 results without gmpy: 1.5% slower with gmpy: 5.5% slower
msg259919 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-09 09:15
Case Van Horsen added the comment: > I ran the mpmath test suite with the fastint6 and fastint5_4 patches. > > fastint6 results > without gmpy: 0.25% faster > with gmpy: 3% slower > > fastint5_4 results > without gmpy: 1.5% slower > with gmpy: 5.5% slower I'm more and more disappointed by this issue... If even a test stressing int & float is slower (or less than 1% faster) with a patch supposed to optimized them, what's the point? I'm also concerned by the slow-down for other types (gmpy types). Maybe we should just close the issue?
msg259948 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-09 18:24
> Maybe we should just close the issue? I'll take a closer look at gmpy later. Please don't close.
msg259999 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-10 10:02
> The test suite can be run directly from the source tree. The test suite includes timing information for individual tests and for the the entire test. Sample invocation: I extracted the slowest test (test_polyroots_legendre) and put it in a loop of 5 iterations: see attached mpmath_bench.py. I ran this benchmark on Linux with 4 isolated CPU (/sys/devices/system/cpu/isolated=2-3,6-7). http://haypo-notes.readthedocs.org/misc.html#reliable-micro-benchmarks On such setup, the benchmark looks stable. Example: Run #1/5: 12.28 sec Run #2/5: 12.27 sec Run #3/5: 12.29 sec Run #4/5: 12.28 sec Run #5/5: 12.30 sec test_polyroots_legendre (min of 5 runs): * Original: 12.51 sec * fastint5_4.patch: (min of 5 runs): 12.27 sec (-1.9%) * fastint6.patch: 12.21 sec (-2.4%) I ran tests without GMP, to stress the Python int type. I guess that the benchmark is dominated by CPU time spent on computing operations on large Python int, not by the time spent in ceval.c. So the speedup is low (2%). Such use case doesn't seem to benefit of micro optimization discussed in this issue. mpmath is an arbitrary-precision floating-point arithmetic using Python int (or GMP if available).
msg264018 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 14:05
Maybe we should adopt a difference approach. There is something called "inline caching": put the cache between instructions, in the same memory block. Example of paper on CPython: "Efficient Inline Caching without Dynamic Translation" by Stefan Brunthaler (2009) https://www.sba-research.org/wp-content/uploads/publications/sac10.pdf Maybe we can build something on top of the issue #26219 "implement per-opcode cache in ceval"?
msg264019 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-04-22 14:24
#14757 has an implementation of inline caching, which at least seemed to slow down some use cases. Then again, whenever someone posts a new speedup suggestion, it seems to slow down things I'm working on. At least Case van Horsen independently verified the phenomenon in this issue. :)
msg279021 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-10-20 09:25
Between inline2.patch and fastint6.patch, it seems like inline2.patch is faster (between 9% and 12% faster than fastint6.patch). Microbenchmark on Python default (rev 554fb699af8c), compilation using LTO (./configure --with-lto), GCC 6.2.1 on Fedora 24, Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz, perf 0.8.3 (dev version, just after 0.8.2). Commands: ./python -m perf timeit --name='x+y' -s 'x=1; y=2' 'x+y' --dup 1000 -v -o timeit-$branch.json ./python -m perf timeit --name=sum -s "R=range(100)" "[x + x + 1 for x in R]" --dup 1000 -v --append timeit-$branch.json Results: $ python3 -m perf compare_to timeit-master.json timeit-inline2.json sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.14x faster x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.29x faster $ python3 -m perf compare_to timeit-master.json timeit-fastint6.json sum: Median +- std dev: [timeit-master] 6.23 us +- 0.13 us -> [timeit-fastint6] 6.09 us +- 0.11 us: 1.02x faster x+y: Median +- std dev: [timeit-master] 15.0 ns +- 0.2 ns -> [timeit-fastint6] 12.7 ns +- 0.2 ns: 1.18x faster $ python3 -m perf compare_to timeit-fastint6.json timeit-inline2.json sum: Median +- std dev: [timeit-fastint6] 6.09 us +- 0.11 us -> [timeit-inline2] 5.45 us +- 0.09 us: 1.12x faster x+y: Median +- std dev: [timeit-fastint6] 12.7 ns +- 0.2 ns -> [timeit-inline2] 11.6 ns +- 0.2 ns: 1.09x faster
msg279022 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-10-20 09:29
Result of performance 0.3.3 (and perf 0.8.3). No major benchmark is faster. A few benchmarks seem to be event slower using fastint6.patch (but I don't really trust pybench). == fastint6.patch == $ python3 -m perf compare_to master.json fastint6.json --group-by-speed --min-speed=5 Slower (3): - pybench.ConcatUnicode: 52.7 ns +- 0.0 ns -> 56.1 ns +- 0.4 ns: 1.06x slower - pybench.ConcatStrings: 52.7 ns +- 0.3 ns -> 56.1 ns +- 0.1 ns: 1.06x slower - pybench.CompareInternedStrings: 16.5 ns +- 0.0 ns -> 17.4 ns +- 0.0 ns: 1.05x slower Faster (4): - pybench.SimpleIntFloatArithmetic: 441 ns +- 2 ns -> 400 ns +- 6 ns: 1.10x faster - pybench.SimpleIntegerArithmetic: 441 ns +- 2 ns -> 401 ns +- 5 ns: 1.10x faster - pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 608 ns +- 6 ns: 1.06x faster - genshi_text: 79.6 ms +- 0.5 ms -> 75.5 ms +- 0.8 ms: 1.05x faster Benchmark hidden because not significant (114): 2to3, call_method, (...) == inline2.patch == haypo@selma$ python3 -m perf compare_to master.json inline2.json --group-by-speed --min-speed=5 Faster (2): - spectral_norm: 223 ms +- 1 ms -> 209 ms +- 1 ms: 1.07x faster - pybench.SimpleLongArithmetic: 643 ns +- 4 ns -> 606 ns +- 7 ns: 1.06x faster Benchmark hidden because not significant (119): 2to3, call_method, (...)
msg279023 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-10-20 09:31
fastint6_inline2_json.tar.gz: archive of JSON files - fastint6.json - inline2.json - master.json - timeit-fastint6.json - timeit-inline2.json - timeit-master.json
msg279026 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-10-20 10:11
The fatest patch (inline2.patch) has a negligible impact on benchmarks. The purpose of an optimization is to make Python faster, it's not the case here, so I close the issue. Using timeit, the largest speedup is 1.29x faster. Using performance, spectral_norm is 1.07x faster and pybench.SimpleLongArithmetic is 1.06x faster. I consider that spectral_norm and pybench.SimpleLongArithmetic are microbenchmarks and so not representative of a real application. The issue was fun, thank you for playing with me the game of micro-optimization ;-) Let's move to more interesting optimizations having a larger impact on more realistic workloads, like cache global variables, optimizing method calls, fastcalls, etc.
msg279027 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-10-20 10:19
New changeset 61fcb12a9873 by Victor Stinner in branch 'default': Issue #21955: Please don't try to optimize int+int https://hg.python.org/cpython/rev/61fcb12a9873
msg377777 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-01 16:57
New changeset bd0a08ea90e4c7a2ebf29697937e9786d4d8e5ee by Victor Stinner in branch 'master': bpo-21955: Change my nickname in BINARY_ADD comment (GH-22481) https://github.com/python/cpython/commit/bd0a08ea90e4c7a2ebf29697937e9786d4d8e5ee

History
Date	User	Action	Args
2022-04-11 14:58:05	admin	set	github: 66154
2020-10-01 16:57:44	vstinner	set	messages: + msg377777
2020-10-01 16:34:12	vstinner	set	pull_requests: + pull_request21500
2016-10-20 10:19:59	python-dev	set	nosy: + python-dev messages: + msg279027
2016-10-20 10:12:46	vstinner	set	resolution: fixed -> rejected
2016-10-20 10:11:39	vstinner	set	status: open -> closed resolution: fixed messages: + msg279026
2016-10-20 09:31:18	vstinner	set	files: + fastint6_inline2_json.tar.gz messages: + msg279023
2016-10-20 09:29:35	vstinner	set	messages: + msg279022
2016-10-20 09:25:38	vstinner	set	messages: + msg279021
2016-04-22 14:24:45	skrah	set	messages: + msg264019
2016-04-22 14:05:45	vstinner	set	messages: + msg264018
2016-02-10 10:02:15	vstinner	set	files: + mpmath_bench.py messages: + msg259999
2016-02-09 18:24:01	yselivanov	set	messages: + msg259948
2016-02-09 09:15:55	vstinner	set	messages: + msg259919
2016-02-09 08:25:42	casevh	set	messages: + msg259918
2016-02-08 16:40:24	yselivanov	set	messages: + msg259860
2016-02-08 16:30:16	casevh	set	messages: + msg259859
2016-02-08 09:46:11	vstinner	set	messages: + msg259832
2016-02-07 21:32:54	yselivanov	set	files: + fastint6.patch messages: + msg259804
2016-02-07 19:42:01	pitrou	set	messages: + msg259801
2016-02-07 19:30:03	casevh	set	messages: + msg259800
2016-02-07 16:18:24	skrah	set	messages: + msg259793
2016-02-07 15:16:35	yselivanov	set	messages: + msg259792
2016-02-07 15:08:44	serhiy.storchaka	set	messages: + msg259791
2016-02-07 15:01:06	yselivanov	set	messages: + msg259790
2016-02-06 17:38:20	yselivanov	set	messages: + msg259743
2016-02-06 17:02:55	yselivanov	set	messages: + msg259734
2016-02-06 17:00:25	yselivanov	set	messages: + msg259733
2016-02-06 15:49:02	vstinner	set	messages: + msg259730
2016-02-06 15:47:45	vstinner	set	messages: + msg259729
2016-02-06 08:37:30	serhiy.storchaka	set	messages: + msg259722
2016-02-06 02:22:08	vstinner	set	messages: + msg259714
2016-02-06 02:12:02	vstinner	set	messages: + msg259713
2016-02-06 01:52:51	vstinner	set	messages: + msg259712
2016-02-06 01:36:45	yselivanov	set	messages: + msg259707
2016-02-06 01:31:57	vstinner	set	files: + inline-2.patch messages: + msg259706
2016-02-06 01:29:55	yselivanov	set	files: + fastint5_4.patch
2016-02-06 00:45:07	yselivanov	set	files: + fastint5_3.patch
2016-02-06 00:10:27	yselivanov	set	files: + fastint5_2.patch messages: + msg259702
2016-02-05 22:37:27	vstinner	set	messages: + msg259695
2016-02-05 17:17:00	yselivanov	set	messages: + msg259678
2016-02-05 17:01:35	serhiy.storchaka	set	messages: + msg259675
2016-02-05 16:32:58	yselivanov	set	messages: + msg259673
2016-02-05 16:22:39	yselivanov	set	messages: + msg259672
2016-02-05 16:18:59	yselivanov	set	messages: + msg259671
2016-02-05 16:15:24	vstinner	set	messages: + msg259670
2016-02-05 15:58:30	vstinner	set	files: + compare_to.txt
2016-02-05 15:58:24	vstinner	set	files: + compare.txt
2016-02-05 15:58:18	vstinner	set	files: + bench_long2.py messages: + msg259669
2016-02-05 15:43:39	yselivanov	set	messages: + msg259668
2016-02-05 15:32:30	yselivanov	set	messages: + msg259667
2016-02-05 15:26:13	lemburg	set	messages: + msg259666
2016-02-05 15:14:25	vstinner	set	messages: + msg259664
2016-02-05 15:10:26	yselivanov	set	messages: + msg259663
2016-02-05 04:04:35	yselivanov	set	files: + fastint5.patch messages: + msg259626
2016-02-05 01:48:02	yselivanov	set	messages: + msg259614
2016-02-05 01:37:43	yselivanov	set	files: + fastint4.patch messages: + msg259612
2016-02-05 01:06:01	pitrou	set	messages: + msg259607
2016-02-05 00:09:37	yselivanov	set	messages: + msg259605
2016-02-04 22:55:42	pitrou	set	messages: + msg259601
2016-02-04 16:44:07	skrah	set	messages: + msg259578
2016-02-04 16:42:10	skrah	set	messages: + msg259577
2016-02-04 16:36:19	serhiy.storchaka	set	files: + fastintfloat_alt.patch messages: + msg259574
2016-02-04 15:56:36	Yury.Selivanov	set	messages: + msg259573
2016-02-04 15:40:09	skrah	set	nosy: + skrah messages: + msg259571
2016-02-04 14:27:21	Yury.Selivanov	set	nosy: + Yury.Selivanov messages: + msg259568
2016-02-04 14:24:48	pitrou	set	messages: + msg259567
2016-02-04 14:18:41	yselivanov	set	messages: + msg259565
2016-02-04 14:06:50	pitrou	set	messages: + msg259564
2016-02-04 14:01:39	vstinner	set	messages: + msg259563
2016-02-04 13:54:55	yselivanov	set	messages: + msg259562
2016-02-04 12:50:15	vstinner	set	messages: + msg259560
2016-02-04 10:30:04	serhiy.storchaka	set	files: + fastint_alt.patch messages: + msg259554
2016-02-04 09:37:42	vstinner	set	messages: + msg259552
2016-02-04 09:35:46	pitrou	set	messages: + msg259549
2016-02-04 08:56:21	lemburg	set	messages: + msg259545
2016-02-04 08:13:35	vstinner	set	messages: + msg259542
2016-02-04 08:01:51	vstinner	set	messages: + msg259541
2016-02-04 07:58:06	lemburg	set	messages: + msg259540
2016-02-04 06:02:49	yselivanov	set	files: + fastint2.patch messages: + msg259530
2016-02-03 19:43:20	serhiy.storchaka	set	messages: + msg259509
2016-02-03 19:35:22	yselivanov	set	messages: + msg259508
2016-02-03 19:29:16	vstinner	set	messages: + msg259506
2016-02-03 19:26:33	yselivanov	set	messages: + msg259505
2016-02-03 19:19:30	serhiy.storchaka	set	messages: + msg259503
2016-02-03 19:04:03	yselivanov	set	messages: + msg259502
2016-02-03 18:50:03	lemburg	set	nosy: + lemburg messages: + msg259500
2016-02-03 17:52:40	pitrou	set	messages: + msg259499
2016-02-03 17:47:22	zbyrne	set	messages: + msg259497
2016-02-03 17:21:01	yselivanov	set	messages: + msg259496
2016-02-03 17:15:50	zbyrne	set	messages: + msg259495
2016-02-03 17:07:18	pitrou	set	messages: + msg259494
2016-02-03 17:05:22	vstinner	set	messages: + msg259493
2016-02-03 17:00:24	yselivanov	set	files: + fastint1.patch messages: + msg259491
2016-02-03 16:40:31	zbyrne	set	files: + bench_results.txt messages: + msg259490
2016-02-02 21:06:54	yselivanov	set	messages: + msg259431
2016-02-02 20:56:51	pitrou	set	messages: + msg259429
2016-02-02 20:37:15	zbyrne	set	messages: + msg259428
2016-02-02 18:55:11	yselivanov	set	versions: + Python 3.6, - Python 3.5 messages: + msg259417 assignee: yselivanov components: + Interpreter Core stage: patch review
2016-01-12 03:37:31	zbyrne	set	messages: + msg258062
2016-01-12 02:42:00	yselivanov	set	nosy: + yselivanov messages: + msg258060
2016-01-12 02:10:11	zbyrne	set	messages: + msg258057
2015-03-18 15:53:44	zbyrne	set	messages: + msg238455
2015-03-18 13:31:59	vstinner	set	messages: + msg238437
2014-07-23 06:00:19	zbyrne	set	messages: + msg223726
2014-07-23 01:20:41	pitrou	set	nosy: + pitrou messages: + msg223711
2014-07-22 02:34:35	zbyrne	set	files: + 21955_2.patch messages: + msg223623
2014-07-16 18:31:45	casevh	set	nosy: + casevh
2014-07-16 14:40:57	zbyrne	set	messages: + msg223214
2014-07-16 09:28:43	serhiy.storchaka	set	messages: + msg223186
2014-07-16 08:14:29	vstinner	set	messages: + msg223180
2014-07-16 08:13:51	vstinner	set	messages: - msg223179
2014-07-16 08:13:29	vstinner	set	files: + inline.patch
2014-07-16 08:13:14	vstinner	set	files: + bench_long.py messages: + msg223179
2014-07-16 06:23:25	serhiy.storchaka	set	messages: + msg223177
2014-07-16 00:29:28	zbyrne	set	files: + 21955.patch nosy: + zbyrne messages: + msg223162 keywords: + patch
2014-07-14 00:42:07	rhettinger	set	nosy: + rhettinger messages: + msg222985
2014-07-12 09:01:55	vstinner	set	messages: + msg222830
2014-07-12 09:01:30	vstinner	set	messages: + msg222829
2014-07-12 07:19:28	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg222824
2014-07-11 22:23:35	josh.r	set	nosy: + josh.r messages: + msg222804
2014-07-11 09:10:27	vstinner	create