msg294739 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-05-30 05:56 |
In excellent Peter Cawley's article "Why are slots so slow?" [1] analysed causes why `a + b` is slower than `a.__add__(b)` for custom __add__ and provided suggestions for optimizing type slot calls. `a + b` and `a.__add__(b)` execute the same user code, `a + b` should have smaller overhead of bytecode interpreting, but it was 2 times slower than `a.__add__(b)`. In the article `a + b` has been made 16% faster than `a.__add__(b)`.
In 3.7 the difference between two ways is smaller, but `a + b` still is 25-30% slower than `a.__add__(b)`. After analyzing the article and comparing it with the current code I have found that virtually all proposed optimization steps already applied in 3.7 by Victor! The difference is only in details.
The proposed patch tweaks the code and makes `a + b` only 12% slower than `a.__add__(b)`. There is similar effect for other type slot calls.
[1] https://www.corsix.org/content/why-are-slots-so-slow
|
msg294740 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-05-30 06:02 |
Despite the fact that `a + b` still is slower than `a.__add__(b)` in 3.7, it is more than 2 times faster than in 2.7 and 3.5.
|
msg294742 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-05-30 07:06 |
I have other patch that makes `a + b` yet tens percents faster, faster than `a.__add__(b)`, by adding functions and declarations that are never used. Confusing.
|
msg294749 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-05-30 11:01 |
> I have other patch that makes `a + b` yet tens percents faster, faster than `a.__add__(b)`, by adding functions and declarations that are never used. Confusing.
It seems like you are a victim of the "deadcode" issue related to code locality:
https://haypo.github.io/journey-to-stable-benchmark-deadcode.html
To run microbenchmarks on such very tiny functions taken less than 200 ns, it's more reliable to compile Python using LTO+PGO.
|
msg294750 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-05-30 11:05 |
type-slot-calls.diff: Can you please create a pull request?
> `a + b` still is 25-30% slower than `a.__add__(b)`
Hum, can you please post a microbenchmark results to see the effect of the patch?
> After analyzing the article and comparing it with the current code I have found that virtually all proposed optimization steps already applied in 3.7 by Victor! The difference is only in details.
The article has two main points:
* the calling convention of the Python C API requires to create a tuple, and that's expensive
* "a + b" has a complex semantics which requires to check for __radd__, check for issubclass(), etc.
Yeah, it seems like the FASTCALL changes I made in typeobject.c removed the overhead of the temporary tuple. Yury's and Naoki's work on CALL_METHOD also improved performances here on method calls.
I don't think that we can change the semantics, only try to optimize the implementation.
|
msg294757 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-05-30 14:00 |
> type-slot-calls.diff: Can you please create a pull request?
I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub. But if you prefer, I'll make a PR.
> Hum, can you please post a microbenchmark results to see the effect of the patch?
$ cat x.py
class A(object):
def __add__(self, other):
return 42
$ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a.__add__(b)'
Unpatched: Mean +- std dev: 256 ns +- 9 ns
Patched: Mean +- std dev: 255 ns +- 10 ns
$ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a + b'
Unpatched: Mean +- std dev: 332 ns +- 10 ns
Patched: Mean +- std dev: 286 ns +- 5 ns
> * the calling convention of the Python C API requires to create a tuple, and that's expensive
It also makes other optimizations, like avoiding using varargs and creating immediate method object. All this already is applied as side effects of your changes.
* "a + b" has a complex semantics which requires to check for __radd__, check for issubclass(), etc.
Since a and b have the same type the complex semantic doesn't play a role here.
|
msg294762 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-05-30 14:29 |
> I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub.
It seems like Rietveld is broken: there is no [Review] button on your patch. I wouldn't be suprised that it's broken since CPython migrated to Git.
|
msg294763 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-05-30 14:37 |
The PR makes different changes:
* replace lookup_method() with lookup_maybe_method()
* specialize call_xxx() functions for a fixed number of parameters
* rename lookup_maybe() to _PyObject_LookupSpecial()
If possible, I would prefer to not have to duplicate functions for 0, 1 and 2 parameters (3 variants). I would like to know which changes are responsible for the speedup.
To ease the review, would it be possible to split your change into smaller changes? At least, separated commits, maybe even a first "cleanup" PR before the "optimization" PR.
|
msg294817 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-05-31 06:56 |
PR 1883 cleans up the code related to calling type slots.
* Use _PyObject_LookupSpecial() instead of lookup_maybe() in set_names(). This isn't performance critical.
* Inline lookup_maybe() in _PyObject_LookupSpecial(). This was the only remaining use of lookup_maybe().
* Replace lookup_maybe_method() and following raising an exception with lookup_method() in call_method().
* Replace lookup_method() with lookup_maybe_method() if the exception is ignored.
* Update outdated comments.
* Use call_method() in slot_sq_item. The comment about super-optimized version was outdated, now call_method() implements this.
|
msg294839 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-05-31 12:24 |
Sorry, wrong data. PR 1883 makes indexing 1.2 times faster, PR 1861 makes it 1.7 times faster
$ ./python -m perf timeit -s 'class A:' -s ' def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' --duplicate 100 'list(a)'
Unpatched: Mean +- std dev: 498 us +- 26 us
PR 1883: Mean +- std dev: 351 us +- 10 us
PR 1861: Mean +- std dev: 288 us +- 7 us
|
msg294843 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-05-31 14:02 |
FYI you can use "./python -m perf timeit --compare-to=./python-ref" if you keep the "reference" Python (unpatched), so perf computes for you the "?.??x slower/faster" factor ;-)
|
msg294848 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-05-31 16:13 |
Thank you, I know about this, but it takes twice more time, so I don't use it regularly. And it doesn't allow to compare three versions. :-(
|
msg294905 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-06-01 05:18 |
New changeset 4e624ca50a665d7e4d527ab98932347ff43a19b0 by Serhiy Storchaka in branch 'master':
bpo-30509: Clean up calling type slots. (#1883)
https://github.com/python/cpython/commit/4e624ca50a665d7e4d527ab98932347ff43a19b0
|
msg295062 - (view) |
Author: Terry J. Reedy (terry.reedy) * |
Date: 2017-06-03 00:31 |
I believe Rietveld does not work with git-format patches. I don't know if git can produce the format hg did.
|
msg295953 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-06-13 19:22 |
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a.__add__(b)'
Mean +- std dev: [python-orig] 229 ns +- 9 ns -> [python] 235 ns +- 13 ns: 1.02x slower (+2%)
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a + b'
Mean +- std dev: [python-orig] 277 ns +- 10 ns -> [python] 251 ns +- 23 ns: 1.10x faster (-9%)
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __add__(s, o): return s' -s 'a = [A() for i in range(1000)]' 'sum(a, A())'
Mean +- std dev: [python-orig] 259 us +- 17 us -> [python] 218 us +- 16 us: 1.19x faster (-16%)
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' 'list(a)'
Mean +- std dev: [python-orig] 324 us +- 14 us -> [python] 300 us +- 16 us: 1.08x faster (-8%)
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s ' def __neg__(s): return s' -s 'a = A()' --duplicate=100 '(----------a)'
Mean +- std dev: [python-orig] 2.12 us +- 0.13 us -> [python] 1.91 us +- 0.11 us: 1.11x faster (-10%)
|
msg295963 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2017-06-13 22:07 |
I'm not sure about adding Py_LOCAL_INLINE() (static inline). I'm not sure that it's needed when you use PGO compilation.
Would it be possible to run again your benchmark without added Py_LOCAL_INLINE() please?
It's hard to say no to a change makes so many core Python functions faster. I'm just suprised that "specializing" the "call_unbound" and "call_method" functions make the code up to 1.2x faster.
|
msg295984 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2017-06-14 05:21 |
Without Py_LOCAL_INLINE all mickrobenchmarks become about 20% slower.
I'm not sure that all these changes are needed. Maybe the same effect can be achieved by smaller changes. But I tried and failed to achieve the same performance with a smaller patch yet. Maybe you will be more lucky.
Note that even with this patch type slots still about 5% slower than ordinal methods (despite the fact that using operators needs less bytecode instructions than calling a method). There is some additional overhead.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:47 | admin | set | github: 74694 |
2019-05-17 22:24:57 | cheryl.sabella | set | versions:
+ Python 3.8, - Python 3.7 |
2017-06-14 05:21:34 | serhiy.storchaka | set | messages:
+ msg295984 |
2017-06-13 22:07:49 | vstinner | set | messages:
+ msg295963 |
2017-06-13 19:22:42 | serhiy.storchaka | set | messages:
+ msg295953 |
2017-06-03 00:31:32 | terry.reedy | set | nosy:
+ terry.reedy messages:
+ msg295062
|
2017-06-01 05:18:28 | serhiy.storchaka | set | messages:
+ msg294905 |
2017-05-31 16:13:18 | serhiy.storchaka | set | messages:
+ msg294848 |
2017-05-31 14:02:43 | vstinner | set | messages:
+ msg294843 |
2017-05-31 12:24:56 | serhiy.storchaka | set | messages:
- msg294837 |
2017-05-31 12:24:41 | serhiy.storchaka | set | messages:
+ msg294839 |
2017-05-31 12:18:51 | serhiy.storchaka | set | messages:
+ msg294837 |
2017-05-31 06:56:19 | serhiy.storchaka | set | messages:
+ msg294817 |
2017-05-31 06:47:28 | serhiy.storchaka | set | pull_requests:
+ pull_request1959 |
2017-05-30 14:37:47 | vstinner | set | messages:
+ msg294763 |
2017-05-30 14:29:23 | vstinner | set | messages:
+ msg294762 |
2017-05-30 14:03:35 | serhiy.storchaka | set | pull_requests:
+ pull_request1944 |
2017-05-30 14:00:33 | serhiy.storchaka | set | messages:
+ msg294757 |
2017-05-30 11:05:34 | vstinner | set | messages:
+ msg294750 |
2017-05-30 11:01:51 | vstinner | set | messages:
+ msg294749 |
2017-05-30 07:08:10 | serhiy.storchaka | set | files:
- type-slot-calls.diff |
2017-05-30 07:08:02 | serhiy.storchaka | set | files:
+ type-slot-calls.diff |
2017-05-30 07:06:43 | serhiy.storchaka | set | messages:
+ msg294742 |
2017-05-30 06:02:04 | serhiy.storchaka | set | messages:
+ msg294740 |
2017-05-30 05:56:40 | serhiy.storchaka | create | |