classification
Title: Optimize calling type slots
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: haypo, pitrou, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2017-05-30 05:56 by serhiy.storchaka, last changed 2017-06-14 05:21 by serhiy.storchaka.

Files
File name Uploaded Description Edit
type-slot-calls.diff serhiy.storchaka, 2017-05-30 07:08
Pull Requests
URL Status Linked Edit
PR 1861 open serhiy.storchaka, 2017-05-30 14:03
PR 1883 merged serhiy.storchaka, 2017-05-31 06:47
Messages (17)
msg294739 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-30 05:56
In excellent Peter Cawley's article "Why are slots so slow?" [1] analysed causes why `a + b` is slower than `a.__add__(b)` for custom __add__ and provided suggestions for optimizing type slot calls. `a + b` and `a.__add__(b)` execute the same user code, `a + b` should have smaller overhead of bytecode interpreting, but it was 2 times slower than  `a.__add__(b)`. In the article `a + b` has been made 16% faster than `a.__add__(b)`.

In 3.7 the difference between two ways is smaller, but `a + b` still is 25-30% slower than `a.__add__(b)`. After analyzing the article and comparing it with the current code I have found that virtually all proposed optimization steps already applied in 3.7 by Victor! The difference is only in details.

The proposed patch tweaks the code and makes `a + b` only 12% slower than `a.__add__(b)`. There is similar effect for other type slot calls.

[1] https://www.corsix.org/content/why-are-slots-so-slow
msg294740 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-30 06:02
Despite the fact that `a + b` still is slower than `a.__add__(b)` in 3.7, it is more than 2 times faster than in 2.7 and 3.5.
msg294742 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-30 07:06
I have other patch that makes `a + b` yet tens percents faster, faster than `a.__add__(b)`, by adding functions and declarations that are never used. Confusing.
msg294749 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-30 11:01
> I have other patch that makes `a + b` yet tens percents faster, faster than `a.__add__(b)`, by adding functions and declarations that are never used. Confusing.

It seems like you are a victim of the "deadcode" issue related to code locality:
https://haypo.github.io/journey-to-stable-benchmark-deadcode.html

To run microbenchmarks on such very tiny functions taken less than 200 ns, it's more reliable to compile Python using LTO+PGO.
msg294750 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-30 11:05
type-slot-calls.diff: Can you please create a pull request?

> `a + b` still is 25-30% slower than `a.__add__(b)`

Hum, can you please post a microbenchmark results to see the effect of the patch?

> After analyzing the article and comparing it with the current code I have found that virtually all proposed optimization steps already applied in 3.7 by Victor! The difference is only in details.

The article has two main points:

* the calling convention of the Python C API requires to create a tuple, and that's expensive
* "a + b" has a complex semantics which requires to check for __radd__, check for issubclass(), etc.

Yeah, it seems like the FASTCALL changes I made in typeobject.c removed the overhead of the temporary tuple. Yury's and Naoki's work on CALL_METHOD also improved performances here on method calls.

I don't think that we can change the semantics, only try to optimize the implementation.
msg294757 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-30 14:00
> type-slot-calls.diff: Can you please create a pull request?

I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub. But if you prefer, I'll make a PR.

> Hum, can you please post a microbenchmark results to see the effect of the patch?

$ cat x.py
class A(object):
    def __add__(self, other):
        return 42

$ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a.__add__(b)'
Unpatched:  Mean +- std dev: 256 ns +- 9 ns
Patched:    Mean +- std dev: 255 ns +- 10 ns

$ ./python -m perf timeit -s 'from x import A; a = A(); b = A()' --duplicate 100 'a + b'
Unpatched:  Mean +- std dev: 332 ns +- 10 ns
Patched:    Mean +- std dev: 286 ns +- 5 ns

> * the calling convention of the Python C API requires to create a tuple, and that's expensive

It also makes other optimizations, like avoiding using varargs and creating immediate method object. All this already is applied as side effects of your changes.

* "a + b" has a complex semantics which requires to check for __radd__, check for issubclass(), etc.

Since a and b have the same type the complex semantic doesn't play a role here.
msg294762 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-30 14:29
> I provided just a patch because I expected that you perhaps will want to play with it and propose alternative patch. It is simpler to compare patches with Rietveld than on GitHub.

It seems like Rietveld is broken: there is no [Review] button on your patch. I wouldn't be suprised that it's broken since CPython migrated to Git.
msg294763 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-30 14:37
The PR makes different changes:

* replace lookup_method() with lookup_maybe_method()
* specialize call_xxx() functions for a fixed number of parameters
* rename lookup_maybe() to _PyObject_LookupSpecial()

If possible, I would prefer to not have to duplicate functions for 0, 1 and 2 parameters (3 variants). I would like to know which changes are responsible for the speedup.

To ease the review, would it be possible to split your change into smaller changes? At least, separated commits, maybe even a first "cleanup" PR before the "optimization" PR.
msg294817 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-31 06:56
PR 1883 cleans up the code related to calling type slots.

* Use _PyObject_LookupSpecial() instead of lookup_maybe() in set_names(). This isn't performance critical.
* Inline lookup_maybe() in _PyObject_LookupSpecial(). This was the only remaining use of lookup_maybe().
* Replace lookup_maybe_method() and following raising an exception with lookup_method() in call_method().
* Replace lookup_method() with lookup_maybe_method() if the exception is ignored.
* Update outdated comments.
* Use call_method() in slot_sq_item. The comment about super-optimized version was outdated, now call_method() implements this.
msg294839 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-31 12:24
Sorry, wrong data. PR 1883 makes indexing 1.2 times faster, PR 1861 makes it 1.7 times faster

$ ./python -m perf timeit -s 'class A:' -s '  def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' --duplicate 100 'list(a)'

Unpatched:  Mean +- std dev: 498 us +- 26 us
PR 1883:    Mean +- std dev: 351 us +- 10 us
PR 1861:    Mean +- std dev: 288 us +- 7 us
msg294843 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-31 14:02
FYI you can use "./python -m perf timeit --compare-to=./python-ref" if you keep the "reference" Python (unpatched), so perf computes for you the "?.??x slower/faster" factor ;-)
msg294848 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-31 16:13
Thank you, I know about this, but it takes twice more time, so I don't use it regularly. And it doesn't allow to compare three versions. :-(
msg294905 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-01 05:18
New changeset 4e624ca50a665d7e4d527ab98932347ff43a19b0 by Serhiy Storchaka in branch 'master':
bpo-30509: Clean up calling type slots. (#1883)
https://github.com/python/cpython/commit/4e624ca50a665d7e4d527ab98932347ff43a19b0
msg295062 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-06-03 00:31
I believe Rietveld does not work with git-format patches.  I don't know if git can produce the format hg did.
msg295953 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-13 19:22
$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a.__add__(b)'
Mean +- std dev: [python-orig] 229 ns +- 9 ns -> [python] 235 ns +- 13 ns: 1.02x slower (+2%)

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = A(); b = A()' --duplicate=100 'a + b'
Mean +- std dev: [python-orig] 277 ns +- 10 ns -> [python] 251 ns +- 23 ns: 1.10x faster (-9%)

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __add__(s, o): return s' -s 'a = [A() for i in range(1000)]' 'sum(a, A())'
Mean +- std dev: [python-orig] 259 us +- 17 us -> [python] 218 us +- 16 us: 1.19x faster (-16%)

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __getitem__(s, i): return t[i]' -s 'a = A(); t = tuple(range(1000))' 'list(a)'
Mean +- std dev: [python-orig] 324 us +- 14 us -> [python] 300 us +- 16 us: 1.08x faster (-8%)

$ ./python -m perf timeit -q --compare-to=./python-orig -s 'class A:' -s '  def __neg__(s): return s' -s 'a = A()' --duplicate=100 '(----------a)'
Mean +- std dev: [python-orig] 2.12 us +- 0.13 us -> [python] 1.91 us +- 0.11 us: 1.11x faster (-10%)
msg295963 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-06-13 22:07
I'm not sure about adding Py_LOCAL_INLINE() (static inline). I'm not sure that it's needed when you use PGO compilation.

Would it be possible to run again your benchmark without added Py_LOCAL_INLINE() please?

It's hard to say no to a change makes so many core Python functions faster. I'm just suprised that "specializing" the "call_unbound" and "call_method" functions make the code up to 1.2x faster.
msg295984 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-14 05:21
Without Py_LOCAL_INLINE all mickrobenchmarks become about 20% slower.

I'm not sure that all these changes are needed. Maybe the same effect can be achieved by smaller changes. But I tried and failed to achieve the same performance with a smaller patch yet. Maybe you will be more lucky.

Note that even with this patch type slots still about 5% slower than ordinal methods (despite the fact that using operators needs less bytecode instructions than calling a method). There is some additional overhead.
History
Date User Action Args
2017-06-14 05:21:34serhiy.storchakasetmessages: + msg295984
2017-06-13 22:07:49hayposetmessages: + msg295963
2017-06-13 19:22:42serhiy.storchakasetmessages: + msg295953
2017-06-03 00:31:32terry.reedysetnosy: + terry.reedy
messages: + msg295062
2017-06-01 05:18:28serhiy.storchakasetmessages: + msg294905
2017-05-31 16:13:18serhiy.storchakasetmessages: + msg294848
2017-05-31 14:02:43hayposetmessages: + msg294843
2017-05-31 12:24:56serhiy.storchakasetmessages: - msg294837
2017-05-31 12:24:41serhiy.storchakasetmessages: + msg294839
2017-05-31 12:18:51serhiy.storchakasetmessages: + msg294837
2017-05-31 06:56:19serhiy.storchakasetmessages: + msg294817
2017-05-31 06:47:28serhiy.storchakasetpull_requests: + pull_request1959
2017-05-30 14:37:47hayposetmessages: + msg294763
2017-05-30 14:29:23hayposetmessages: + msg294762
2017-05-30 14:03:35serhiy.storchakasetpull_requests: + pull_request1944
2017-05-30 14:00:33serhiy.storchakasetmessages: + msg294757
2017-05-30 11:05:34hayposetmessages: + msg294750
2017-05-30 11:01:51hayposetmessages: + msg294749
2017-05-30 07:08:10serhiy.storchakasetfiles: - type-slot-calls.diff
2017-05-30 07:08:02serhiy.storchakasetfiles: + type-slot-calls.diff
2017-05-30 07:06:43serhiy.storchakasetmessages: + msg294742
2017-05-30 06:02:04serhiy.storchakasetmessages: + msg294740
2017-05-30 05:56:40serhiy.storchakacreate