classification
Title: Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects
Type: performance Stage: resolved
Components: Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: haypo, inada.naoki, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2017-01-13 12:24 by haypo, last changed 2017-06-09 11:24 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
tp_fastcall.patch haypo, 2017-01-13 15:13 review
tp_fastcall-2.patch haypo, 2017-01-17 02:41 review
tp_fastcall-3.patch haypo, 2017-01-18 01:24
tp_fastcall-4.patch haypo, 2017-01-18 07:47 review
tp_fastcall-5.patch haypo, 2017-01-20 00:08 review
tp_fastcall-6.patch haypo, 2017-01-26 02:19 review
Pull Requests
URL Status Linked Edit
PR 65 haypo, 2017-01-13 12:24
PR 1886 merged haypo, 2017-05-31 14:39
Messages (48)
msg285388 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 12:24
A new FASTCALL calling convention was added to Python 3.6. It allows to avoid the creation of a temporary tuple to pass positional arguments and a temporary dictionary to pass keyword arguments. A new METH_FASTCALL calling convention was added for C functions. Most functions now support fastcall, except objects with a __call__() method which have to go through slot_tp_call() which still requires a tuple and dictionary.

I tried multiple implementations to support fast calls to call the __call__() method, but I had practical and technical issues.

First, I tried to reuse the tp_call field to PyTypeObject: it can be a regular call (tuple/dict for arguments) or a fast call. I added a flag to the tp_flags field. It was tricky to support class inheritance, decide to set or clear the flag. But the real blocker issue is fAthat it is obviously breaks the backward compatibility: existing code calling directly tp_call with the regular calling convention will crash immediatly, and the error is not catched during compilation, even if the code is recompiled.

I propose a different design: add a new tp_fastcall field to PyTypeObject and use a wrapper for tp_call when tp_fastcall is defined. If a type defines tp_fastcall but not, the tp_call wrapper "simply" calls tp_fastcall. Advantages:

* The wrapper is trivial
* Minor changes to PyType_Ready() to support inheritance (simple logic)
* Fully backward compatible
* If tp_call is called directly without keyword arguments, there is no overhead but a speedup!

Inheritance:

* If a type only defines tp_call, tp_fastcall is not inherited from the parent: tp_fastcall is set to NULL.
* If a type only defines tp_fastcall: tp_fastcall is always use (tp_call uses the wrapper)
* If a type defines tp_call and tp_fastcall, PyObject_Call() uses tp_call whereas _PyObject_FastCallDict() uses tp_fastcall.

Functions of the C API will be modified to use tp_fastcall if available.

The plan is then to patch most Python types to replace their tp_call with tp_fastcall. First, most important (common) types like Python and C functions, descriptors, and the various kinds of wrappers should be patched. Later, we should maybe discuss on a case by case basis to decide if it's worth it.

I will try to run benchmark before any kind.
msg285389 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 12:26
Naoki wants to use FASTCALL for CALL_METHOD:
http://bugs.python.org/issue26110#msg283093

tp_fastcall should allow to easily implement it.
msg285390 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 12:33
I started to work on FASTCALL, because I dislike the "cached tuple" hack used in some performance critical code, and the hack causes various kinds of tricky but severe issues (can lead to segfault).

Thanks to tp_fastcall, it becomes possible to drop the "cached tuple" hack from property_descr_get() *and* keep good performances.

First, a benchmark to show the performance gain of using "cached tuple". I modified property_descr_get() to use Python 3.4 code which doesn't have the optimization:

$ ./python -m perf compare_to py34.json ref.json 
Median +- std dev: [py34] 75.0 ns +- 1.7 ns -> [ref] 50.0 ns +- 0.9 ns: 1.50x faster (-33%)

It's MUCH faster, good job. But it requires complex and fragile code. Ok, let's see with operator.itemgetter() supporting tp_fastcall, Python modified to use tp_fastcall and without the "cached arg" hack:

$ ./python -m perf compare_to ref.json fastcall_wrapper.json 
Median +- std dev: [ref] 50.0 ns +- 0.9 ns -> [fastcall_wrapper] 48.2 ns +- 1.5 ns: 1.04x faster (-4%)

It's a little bit faster, but that's not the point. The point is that it isn't slower and it doesn't require to modify C code to benefit of the optimization! Just to be clear, another benchmark result on property_descr_get() without "cache args", without fastcall (py34) and with fastcall ("fastcall_wrapper"):

$ ./python -m perf compare_to py34.json fastcall_wrapper.json 
Median +- std dev: [py34] 75.0 ns +- 1.7 ns -> [fastcall_wrapper] 48.2 ns +- 1.5 ns: 1.56x faster (-36%)

Summary:

* tp_fastcall avoids to remove the "cached args" hack which will fix severe issue in corner cases
* tp_fastcall makes existing code faster for free. I mean, builtin types should be modified to support tp_fastcall, most all code *calling* these types don't need any change.
msg285400 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-13 14:47
* If a type only defines tp_fastcall: tp_fastcall is always use (tp_call uses the wrapper)

Is tp_call set to the wrapper rather then inheriting? What if tp_call is defined in a superclass?

> * If a type defines tp_call and tp_fastcall, PyObject_Call() uses tp_call whereas _PyObject_FastCallDict() uses tp_fastcall.

I would consider this as a bug. It would be weird if different ways of calling cause executing different code.

What about dynamically changed Python types? What if you set or delete the __call__ attribute of Python class?
msg285404 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 15:08
> * If a type only defines tp_fastcall: tp_fastcall is always use (tp_call uses the wrapper)
> 
> Is tp_call set to the wrapper rather then inheriting?

It is set to the wrapper. Defining tp_fastcall should work as defining tp_call: it should override the tp_call and tp_fastcall slots.


>  What if tp_call is defined in a superclass?

What is a superclass? Do you have an example?



> * If a type defines tp_call and tp_fastcall, PyObject_Call() uses tp_call whereas _PyObject_FastCallDict() uses tp_fastcall.
>
> I would consider this as a bug. It would be weird if different ways of calling cause executing different code.

It's a micro-optimization to avoid arguments unpacking/packing, conversions needed when you switch from the regular calling convention to the fast call convention, or the opposite.

I don't think that types defining tp_call and tp_fastcall will be common.


> What about dynamically changed Python types?

What do you mean? Do you have an example?


> What if you set or delete the __call__ attribute of Python class?

I don't know how these things work :-) Let me try on a patched Python (using tp_fastcall):

$ ./python
>>> class A:
...  def __call__(self): print("A")
... 
>>> a=A()
>>> a()
A

>>> A.__call__=lambda self: print("B!")
>>> a()
B!

>>> del A.__call__
>>> a()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'A' object is not callable

It seems like "it just works", but I don't know how it works internally :-)

a() uses a dynamic lookup of a.__class__.__call__, no?
msg285405 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 15:13
Ok, here is a first patch: tp_fastcall.patch

* Add tp_fastcall to PyTypeObject
* Add fastcall_wrapper() and use it for tp_call in PyType_Ready()
* Modify _PyObject_FastCallDict() and _PyObject_FastCallKeywords() to use tp_fastcall if defined
* Modify a lot of types to define tp_fastcall instead of tp_call (to use FASTCALL calling convention)
* Add a few new helper functions:

  - _PyObject_FastCall_Prepend() -- similar to _PyObject_Call_Prepend()
  - _Py_FastCall_FromArgs() -- ugly name, sorry about that, it should be renamed :-)
  - PyArg_UnpackStack() -- similar to PyArg_UnpackTuple()
  - _PyArg_NoStackKeywords() -- similar to _PyArg_NoKeywords

Hum, I should make sure that these new functions are not public. By the way, I should probably rename PyArg_UnpackStack() to _PyArg_UnpackStack().
msg285411 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 15:51
tp_fastcall.patch modifies property_descr_get() to use tp_fastcall if available, or use the cached tuple. Once performances are validated (no regression), the cached tuple as to pass position arguments, should go away.

By the way, msg285390 uses the following micro-benchmark:
---
import perf
bench = perf.Runner()
bench.timeit("namedtuple.attr",
             "a.a",
             "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)",
             duplicate=20)
---

It tests property_descr_get() with operator.itemgetter(). It would be nice to benchmark property_descr_get() with other functions.
msg285416 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-13 16:22
PyArg_UnpackStack() is declared as in the limited API since 3.3. If you want to add PyArg_UnpackStack() to the limited API, define it as added in 3.7.

For compatibility with extensions built with older Pythons you should define new type flag and read tp_fastcall only if the flag is set. See for example Py_TPFLAGS_HAVE_FINALIZE.
msg285419 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 16:48
Ok, I pushed a new commit "Add Py_TPFLAGS_HAVE_FASTCALL flag" in my Git branch.

(By the way, I'm using the Git branch as a working copy, it can change anytime to get a nice list of commits. If you want reliable changes, use patches attached to this issue.)


Serhiy Storchaka!:
> For compatibility with extensions built with older Pythons you should define new type flag and read tp_fastcall only if the flag is set. See for example Py_TPFLAGS_HAVE_FINALIZE.

Wait, there is maybe an issue with the stable API in my design :-/

Let's say that a builtin type NEW defines tp_fastcall and uses fastcall_wrapper() for tp_call. A type OLD inherits from NEW and don't override tp_call (not tp_fastcall). Type OLD comes from an extension compiled with Python 3.6 and use the stable ABI. Does the type OLD contain the field tp_fastcall in memory? Is it possible to copy the tp_fastcall filed from type NEW into the type OLD?

If not, the fastcall_wrapper() should be smarter to not get tp_fastcall from the type OLD but follow the MRO to find the first type with the Py_TPFLAGS_HAVE_FASTCALL flag. Is it possible that this short loop to find has a cost on performances?
msg285420 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 16:53
I ran benchmarks on tp_fastcall.patch. Hum, it seems like there is a performance issue somewhere :-( Almost all benchmarks are much slower with the patch.

haypo@speed-python$ python3 -m perf compare_to 2017-01-11_00-07-default-b9404639a18c.json tp_fastcall_ref_b9404639a18c.json -G
Slower (63):
- unpickle_pure_python: 671 us +- 16 us -> 952 us +- 9 us: 1.42x slower (+42%)
- pickle_pure_python: 1.02 ms +- 0.02 ms -> 1.40 ms +- 0.02 ms: 1.37x slower (+37%)
- telco: 16.6 ms +- 0.6 ms -> 22.7 ms +- 0.3 ms: 1.37x slower (+37%)
- xml_etree_generate: 215 ms +- 3 ms -> 290 ms +- 7 ms: 1.35x slower (+35%)
- hexiom: 17.7 ms +- 0.2 ms -> 23.6 ms +- 0.1 ms: 1.33x slower (+33%)
- xml_etree_process: 182 ms +- 3 ms -> 240 ms +- 4 ms: 1.32x slower (+32%)
- genshi_text: 69.0 ms +- 0.8 ms -> 90.4 ms +- 0.8 ms: 1.31x slower (+31%)
- logging_simple: 27.5 us +- 0.4 us -> 35.8 us +- 0.4 us: 1.30x slower (+30%)
(...)

Faster (1):
- spectral_norm: 271 ms +- 2 ms -> 257 ms +- 2 ms: 1.06x faster (-5%)
msg285425 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 17:15
Strange, I'm unable to reproduce the result on a different computer:

haypo@smithers$ ./python -m perf compare_to ref.json tp_fastcall.json  -v
Median +- std dev: [ref] 20.0 ms +- 0.5 ms -> [tp_fastcall] 20.0 ms +-
0.6 ms: 1.00x slower (+0%)
Not significant!
msg285426 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-13 17:16
> Strange, I'm unable to reproduce the result on a different computer:
>
> haypo@smithers$ ./python -m perf compare_to ref.json tp_fastcall.json  -v
> Median +- std dev: [ref] 20.0 ms +- 0.5 ms -> [tp_fastcall] 20.0 ms +-
> 0.6 ms: 1.00x slower (+0%)
> Not significant!

Oops, I forgot to mention that I tried the telco benchmark.
msg285451 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-14 00:46
> Slower (63):
> - unpickle_pure_python: 671 us +- 16 us -> 952 us +- 9 us: 1.42x slower (+42%)
> ...

This benchmark was run with LTO+PGO. It seems like PGO produced less efficient machine code with the patch. Maybe a missed optimization.

Oh, I just realized that I announced that objects with a __call__() are faster, but it's not the case yet! Such object still gets slot_tp_call() as tp_call which isn't a fastcall!
msg285463 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-14 07:13
1. _PyCFunction_FastCallKeywords() calls _PyStack_AsDict(), then calls _PyCFunction_FastCallDict().
2. _PyCFunction_FastCallDict() calls _Py_FastCall_FromArgs() when METH_FASTCALL.
3. _Py_FastCall_FromArgs() calls _PyStack_UnpackDict(), then calls fastcall().

Can you remove this _PyStack_AsDict() & _PyStack_UnpackDict() by creating shortcut in _PyCFunction_FastCallKeywords()?
msg285484 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-14 14:47
I've created pull request about it:
https://github.com/haypo/cpython/pull/1
msg285568 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-16 15:53
> Strange, I'm unable to reproduce the result on a different computer: (...)

Ah, the benchmark result change after a reboot. I reran the same benchmark (LTO+PGO on speed-python) after a reboot:
---
haypo@speed-python$ python3 -m perf compare_to ~/test/2017-01-13_19-18-default-3486425ed82b.json ~/patchdir/patch.json  -G 
Slower (29):
- spectral_norm: 275 ms +- 4 ms -> 301 ms +- 16 ms: 1.09x slower (+9%)
- fannkuch: 1.10 sec +- 0.01 sec -> 1.16 sec +- 0.18 sec: 1.06x slower (+6%)
- scimark_monte_carlo: 282 ms +- 3 ms -> 298 ms +- 6 ms: 1.06x slower (+6%)
- sqlite_synth: 9.39 us +- 0.19 us -> 9.82 us +- 0.15 us: 1.05x slower (+5%)
- nqueens: 261 ms +- 2 ms -> 271 ms +- 2 ms: 1.04x slower (+4%)
- meteor_contest: 209 ms +- 1 ms -> 217 ms +- 2 ms: 1.04x slower (+4%)
- scimark_fft: 684 ms +- 22 ms -> 709 ms +- 9 ms: 1.04x slower (+4%)
- pickle_pure_python: 1.35 ms +- 0.01 ms -> 1.39 ms +- 0.02 ms: 1.03x slower (+3%)
- xml_etree_iterparse: 227 ms +- 5 ms -> 233 ms +- 6 ms: 1.03x slower (+3%)
- sympy_integrate: 49.3 ms +- 0.3 ms -> 50.5 ms +- 0.3 ms: 1.02x slower (+2%)
- genshi_text: 90.3 ms +- 1.0 ms -> 92.4 ms +- 1.1 ms: 1.02x slower (+2%)
- regex_dna: 283 ms +- 1 ms -> 289 ms +- 1 ms: 1.02x slower (+2%)
- telco: 22.3 ms +- 0.5 ms -> 22.6 ms +- 0.3 ms: 1.02x slower (+2%)
- chaos: 295 ms +- 3 ms -> 300 ms +- 3 ms: 1.02x slower (+2%)
- go: 598 ms +- 5 ms -> 607 ms +- 7 ms: 1.01x slower (+1%)
- json_dumps: 30.5 ms +- 0.4 ms -> 30.9 ms +- 0.5 ms: 1.01x slower (+1%)
- richards: 183 ms +- 4 ms -> 186 ms +- 5 ms: 1.01x slower (+1%)
- deltablue: 17.1 ms +- 0.5 ms -> 17.4 ms +- 0.3 ms: 1.01x slower (+1%)
- logging_format: 42.1 us +- 0.6 us -> 42.5 us +- 0.6 us: 1.01x slower (+1%)
- sympy_str: 511 ms +- 4 ms -> 515 ms +- 5 ms: 1.01x slower (+1%)
- 2to3: 760 ms +- 4 ms -> 766 ms +- 4 ms: 1.01x slower (+1%)
- regex_compile: 462 ms +- 4 ms -> 465 ms +- 5 ms: 1.01x slower (+1%)
- django_template: 447 ms +- 5 ms -> 450 ms +- 4 ms: 1.01x slower (+1%)
- logging_simple: 35.6 us +- 0.5 us -> 35.8 us +- 0.6 us: 1.01x slower (+1%)
- hexiom: 23.0 ms +- 0.2 ms -> 23.2 ms +- 0.1 ms: 1.01x slower (+1%)
- python_startup: 16.0 ms +- 0.1 ms -> 16.1 ms +- 0.1 ms: 1.00x slower (+0%)
- regex_effbot: 5.08 ms +- 0.03 ms -> 5.10 ms +- 0.09 ms: 1.00x slower (+0%)
- call_method: 13.8 ms +- 0.1 ms -> 13.9 ms +- 0.2 ms: 1.00x slower (+0%)
- python_startup_no_site: 9.36 ms +- 0.04 ms -> 9.37 ms +- 0.07 ms: 1.00x slower (+0%)

Faster (19):
- scimark_lu: 518 ms +- 18 ms -> 458 ms +- 25 ms: 1.13x faster (-11%)
- xml_etree_process: 242 ms +- 4 ms -> 234 ms +- 4 ms: 1.03x faster (-3%)
- unpickle: 36.0 us +- 0.5 us -> 35.0 us +- 0.5 us: 1.03x faster (-3%)
- chameleon: 30.6 ms +- 0.2 ms -> 29.8 ms +- 0.3 ms: 1.03x faster (-3%)
- nbody: 240 ms +- 2 ms -> 234 ms +- 2 ms: 1.02x faster (-2%)
- call_method_slots: 13.9 ms +- 0.2 ms -> 13.6 ms +- 0.4 ms: 1.02x faster (-2%)
- unpickle_pure_python: 924 us +- 8 us -> 904 us +- 9 us: 1.02x faster (-2%)
- xml_etree_generate: 289 ms +- 4 ms -> 283 ms +- 4 ms: 1.02x faster (-2%)
- mako: 43.2 ms +- 0.4 ms -> 42.3 ms +- 0.4 ms: 1.02x faster (-2%)
- pickle_dict: 65.5 us +- 4.1 us -> 64.4 us +- 0.9 us: 1.02x faster (-2%)
- call_simple: 12.8 ms +- 0.4 ms -> 12.6 ms +- 0.4 ms: 1.01x faster (-1%)
- xml_etree_parse: 294 ms +- 13 ms -> 291 ms +- 9 ms: 1.01x faster (-1%)
- pickle_list: 9.38 us +- 0.35 us -> 9.32 us +- 0.14 us: 1.01x faster (-1%)
- unpack_sequence: 118 ns +- 3 ns -> 118 ns +- 1 ns: 1.00x faster (-0%)
- sympy_expand: 1.14 sec +- 0.01 sec -> 1.13 sec +- 0.01 sec: 1.00x faster (-0%)
- sqlalchemy_declarative: 338 ms +- 3 ms -> 337 ms +- 3 ms: 1.00x faster (-0%)
- crypto_pyaes: 236 ms +- 2 ms -> 235 ms +- 1 ms: 1.00x faster (-0%)
- pidigits: 321 ms +- 1 ms -> 320 ms +- 1 ms: 1.00x faster (-0%)
- regex_v8: 45.6 ms +- 0.7 ms -> 45.6 ms +- 0.3 ms: 1.00x faster (-0%)

Benchmark hidden because not significant (16): call_method_unknown, dulwich_log, float, genshi_xml, html5lib, json_loads, logging_silent, pathlib, pickle, raytrace, scimark_sor, scimark_sparse_mat_mult, sqlalchemy_imperative, sympy_sum, tornado_http, unpickle_list
---

Lighter verbose with a threshold of 5%:
---
haypo@speed-python$ python3 -m perf compare_to ~/test/2017-01-13_19-18-default-3486425ed82b.json ~/patchdir/patch.json  -G --min-speed=5
Slower (3):
- spectral_norm: 275 ms +- 4 ms -> 301 ms +- 16 ms: 1.09x slower (+9%)
- fannkuch: 1.10 sec +- 0.01 sec -> 1.16 sec +- 0.18 sec: 1.06x slower (+6%)
- scimark_monte_carlo: 282 ms +- 3 ms -> 298 ms +- 6 ms: 1.06x slower (+6%)

Faster (1):
- scimark_lu: 518 ms +- 18 ms -> 458 ms +- 25 ms: 1.13x faster (-11%)

Benchmark hidden because not significant (60): 2to3, call_method, ...
---

spectral_norm and scimark_lu are two unstable benchmarks. Sadly, it seems like the current patch is not really significant on benchmarks. But again, it doesn't optimize __call__() slot yet.
msg285570 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-16 16:12
Additionally, there are not enough METH_FASTCALL CFunctions.

I'm interested in performance difference between METH_VARARGS and METH_FASTCALL.  If METH_FASTCALL is significant faster than VARARGS,
we can convert some common methods from VARARGS to FASTCALL, by hand or
by AC.
msg285571 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-16 16:32
New changeset 2ccdf2b3819d by Victor Stinner in branch 'default':
Optimize _PyCFunction_FastCallKeywords()
https://hg.python.org/cpython/rev/2ccdf2b3819d
msg285602 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-17 02:37
> Additionally, there are not enough METH_FASTCALL CFunctions.

I pushed many changes tonight which convert many C functions to METH_FASTCALL.


> ..., by hand or by AC.

IMHO performance is a good motivation to convert code to Argument Clinic ;-) But sometimes, AC can be painful, so by hand is also acceptable ;-)
msg285603 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-17 02:41
Patch version 2:

* Add slot_tp_fastcall(): __call__() wrapper using the FASTCALL calling convention
* Add Py_TPFLAGS_HAVE_FASTCALL flag
* Remove completely the "cached args" optimization from property_descr_get()
* Rebase the patch serie on top of the default branch to benefit of recent changes related to FASTCALL

I'm not sure that the Py_TPFLAGS_HAVE_FASTCALL flag is used correctly. Maybe this new flag disabled optimizations in some cases, I didn't read carefully this part of the code.
msg285666 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-17 16:42
Back on performances: if you combined tp_fastcall-2.patch + 3 minor changes to use METH_FASTCALL, the bm_telco benchmark becomes 22% faster, which is significant for a benchmark:
http://bugs.python.org/issue29301#msg285665

"20.9 ms +- 0.5 ms => 16.4 ms +- 0.5 ms"

It also means that 22% of the benchmark time is currently spent in handling function arguments! I expected that most of the time would be spent on maths and I/O!
msg285667 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-17 16:57
I checked the effect of individual patches:

* tp_fastcall-2
* print
* struct
* decimal

tp_fastcall-2 + print + struct + decimal: 16.3 ms +- 0.6 ms
tp_fastcall-2 + struct + decimal: 21.2 ms +- 0.3 ms
tp_fastcall-2 + print: 16.7 ms +- 0.2 ms

Oh wow, I didn't expect that print would be the bottleneck of this benchmark!? There is a single print() in the hotcode of bm_telco:

   print(t, file=outfil)

Maybe it's the expensive price of creating a temporary dictionary?
msg285674 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-17 18:08
> Maybe it's the expensive price of creating a temporary dictionary?

Please see the patch on issue29295.
PyDict_GetItemString() is bad part, too.
It's create str object, calc hash, lookup dict, and dealloc the str object.
msg285693 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-18 01:24
Ok, I fixed all remaining issues. The new patch version 3 is now waiting for your review ;-)


Patch version 3:

* Add Py_TPFLAGS_HAVE_FINALIZE and Py_TPFLAGS_HAVE_FASTCALL to Py_TPFLAGS_DEFAULT
* Don't read or write tp_fastcall if the type doesn't have the Py_TPFLAGS_HAVE_FASTCALL feature
* fastcall_wrapper() looks into base classes to find a tp_fastcall field: use the first base class with the Py_TPFLAGS_HAVE_FASTCALL feature
* Fix how tp_call and tp_fastcall are inherited
* Use designated initializers for  tp_fastcall, as asked by Naoki (and discussed on python-dev)
* Rebase the change and revert now uselesss or unwanted changes
* No more FIXME ;-)

About the stable ABI: since the version 3, it should just work thanks to the change on Py_TPFLAGS_DEFAULT. Code compiled with Python 3.7 benefit directly of tp_fastcall. Code compiled with Python 3.6 or older will get fastcall_wrapper() which finds tp_fastcall in base classes.
msg285704 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-18 06:22
@haypo, Rietveld didn't accept your patch.  (I don't know what format Rietveld accepts...)

Is PR 65 is same to the third patch?
msg285705 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-18 07:47
Oops, I had a local commit for testing purpose. I regenerated the patch: tp_fastcall-4.patch.

> Is PR 65 is same to the third patch?

Yes, patches are generated from my Git pull request:
https://github.com/python/cpython/pull/65

The Git history became a little bit more ugly after many changes :-) I should try to rework it to have better changes. I will likely do that before pushing changes once the change will be accepted.
msg285708 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-18 08:44
I commented at the pull request.
msg285710 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-18 09:32
New changeset 261b108b9468 by Victor Stinner in branch 'default':
Remove unused func parameter of _PyStack_UnpackDict()
https://hg.python.org/cpython/rev/261b108b9468
msg285712 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-18 09:39
New changeset 0b219348ec9e by Victor Stinner in branch 'default':
Optimize methoddescr_call(): avoid temporary PyCFunction
https://hg.python.org/cpython/rev/0b219348ec9e
msg285715 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-18 10:28
New changeset 1b7a5bdd05ed by Victor Stinner in branch 'default':
_PyObject_FastCallKeywords() now checks the result
https://hg.python.org/cpython/rev/1b7a5bdd05ed
msg285724 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-18 12:50
New changeset a3865d0c1844 by Victor Stinner in branch 'default':
Fix Python 2.6 support in python-gdb.py
https://hg.python.org/cpython/rev/a3865d0c1844
msg285726 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-18 13:23
New changeset a241817424e5 by Victor Stinner in branch 'default':
Fix _PyMethodDef_RawFastCallDict() argument parsing
https://hg.python.org/cpython/rev/a241817424e5

New changeset a8d35309dcc0 by Victor Stinner in branch 'default':
PyCFunction_Call() now calls _PyCFunction_FastCallDict()
https://hg.python.org/cpython/rev/a8d35309dcc0

New changeset ee6e1b1151a8 by Victor Stinner in branch 'default':
_PyObject_FastCallKeywords() now checks !PyErr_Occurred()
https://hg.python.org/cpython/rev/ee6e1b1151a8

New changeset 11039ece46b9 by Victor Stinner in branch 'default':
Rephrase !PyErr_Occurred() comment: may=>can
https://hg.python.org/cpython/rev/11039ece46b9

New changeset 8cc5d78d9b18 by Victor Stinner in branch 'default':
Cleanup _PyMethodDef_RawFastCallDict()
https://hg.python.org/cpython/rev/8cc5d78d9b18
msg285731 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-18 14:54
See also issue #29306: "Check usage of Py_EnterRecursiveCall() and Py_LeaveRecursiveCall() in new FASTCALL functions".
msg285741 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-18 16:23
New changeset 127401469628 by Victor Stinner in branch 'default':
Update and enhance python-gdb.py
https://hg.python.org/cpython/rev/127401469628
msg285776 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-19 11:41
New changeset bf6728085b01 by Victor Stinner in branch 'default':
_PyStack_AsDict() now checks kwnames != NULL
https://hg.python.org/cpython/rev/bf6728085b01
msg285863 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-20 00:08
tp_fastcall-5.patch: Rebased patch. I already tried to push changes not directly related to tp_fastcall, so the new patch should be a *little bit* shorter.
msg286289 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-26 01:53
Patch version 6: type inheritance and all wrappers should now be handled correctly. Changes:

* Add fastwrapper to wrapperbase structure
* Add d_use_fastwrapper to PyWrapperDescrObject structure
* wrapper_call() uses fast call
* Protect access to tp_fastcall with Py_TPFLAGS_HAVE_FASTCALL in all cases
* _Py_RawFastCallDict() doesn't check the result anymore, it's now checked in the caller: self can be NULL for static methods (try test_bytes unit test!)
* Fix type inheritance: update_one_slot() now makes sure that tp_call and tp_fastcall are synchronize. Before, the update was done after but not in all cases.
* Fix fastcall_wrapper() to support "StaticType.tp_call(obj, ...);" calls: Py_TYPE(obj)->tp_fastcall can be NULL, we have to search non-NULL tp_fastcall in base classes in this case.

IMHO the most tricky parts of the change are in update_one_slot() and add_operators().

Many fixes were "backported" from my work on the issue #29358 (tp_fastnew and tp_fastinit). Technically, I cherry-picked commits and fixes all the merge conflicts.

It may be worth to use FASTCALL in more wrappers (wrap_xxx functions in typeobject.c) of slotdefs.
msg286291 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-26 02:19
Oh, I had a local change not commited yet, so tp_fastcall-6.patch didn't get the [review] button. I just pushed the commit and then posted again the exact same patch.
msg286299 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-26 08:46
performance results on tp_fastcall-6.patch (diff >= 2%):
---
haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-01-23_14-02-default-cebc9c7ad195.json tp_fastcall-6_ref_cebc9c7ad195.json -G --min-speed=2
Slower (6):
- scimark_monte_carlo: 202 ms +- 5 ms -> 222 ms +- 8 ms: 1.10x slower (+10%)
- telco: 13.0 ms +- 0.2 ms -> 14.2 ms +- 0.4 ms: 1.09x slower (+9%)
- call_method: 11.0 ms +- 0.3 ms -> 11.5 ms +- 0.5 ms: 1.05x slower (+5%)
- scimark_sparse_mat_mult: 8.57 ms +- 0.14 ms -> 8.92 ms +- 0.15 ms: 1.04x slower (+4%)
- fannkuch: 938 ms +- 7 ms -> 974 ms +- 11 ms: 1.04x slower (+4%)
- scimark_fft: 667 ms +- 9 ms -> 690 ms +- 10 ms: 1.03x slower (+3%)

Faster (17):
- scimark_lu: 362 ms +- 13 ms -> 322 ms +- 8 ms: 1.12x faster (-11%)
- chameleon: 23.5 ms +- 0.3 ms -> 21.3 ms +- 0.3 ms: 1.10x faster (-9%)
- xml_etree_process: 179 ms +- 3 ms -> 169 ms +- 3 ms: 1.06x faster (-5%)
- xml_etree_generate: 208 ms +- 4 ms -> 199 ms +- 3 ms: 1.05x faster (-4%)
- logging_silent: 586 ns +- 8 ns -> 562 ns +- 7 ns: 1.04x faster (-4%)
- richards: 147 ms +- 2 ms -> 141 ms +- 3 ms: 1.04x faster (-4%)
- pickle_pure_python: 1.03 ms +- 0.02 ms -> 994 us +- 15 us: 1.04x faster (-4%)
- regex_compile: 379 ms +- 2 ms -> 367 ms +- 4 ms: 1.03x faster (-3%)
- raytrace: 1.08 sec +- 0.01 sec -> 1.04 sec +- 0.02 sec: 1.03x faster (-3%)
- unpickle: 31.0 us +- 0.4 us -> 30.0 us +- 0.4 us: 1.03x faster (-3%)
- pathlib: 42.3 ms +- 0.5 ms -> 41.0 ms +- 0.5 ms: 1.03x faster (-3%)
- scimark_sor: 391 ms +- 7 ms -> 379 ms +- 12 ms: 1.03x faster (-3%)
- genshi_xml: 167 ms +- 2 ms -> 162 ms +- 2 ms: 1.03x faster (-3%)
- django_template: 356 ms +- 5 ms -> 347 ms +- 4 ms: 1.03x faster (-3%)
- regex_dna: 283 ms +- 1 ms -> 276 ms +- 4 ms: 1.03x faster (-3%)
- hexiom: 18.0 ms +- 0.1 ms -> 17.6 ms +- 0.2 ms: 1.02x faster (-2%)
- call_simple: 10.7 ms +- 0.4 ms -> 10.5 ms +- 0.2 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (41): (...)
---

Or less verbose output, only show difference >= 5%:
---
haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-01-23_14-02-default-cebc9c7ad195.json tp_fastcall-6_ref_cebc9c7ad195.json -G --min-speed=5
Slower (2):
- scimark_monte_carlo: 202 ms +- 5 ms -> 222 ms +- 8 ms: 1.10x slower (+10%)
- telco: 13.0 ms +- 0.2 ms -> 14.2 ms +- 0.4 ms: 1.09x slower (+9%)

Faster (3):
- scimark_lu: 362 ms +- 13 ms -> 322 ms +- 8 ms: 1.12x faster (-11%)
- chameleon: 23.5 ms +- 0.3 ms -> 21.3 ms +- 0.3 ms: 1.10x faster (-9%)
- xml_etree_process: 179 ms +- 3 ms -> 169 ms +- 3 ms: 1.06x faster (-5%)

Benchmark hidden because not significant (59): (...)
---

To be honest, I'm disappointed, I expected more faster benchmarks. Maybe the cost of the temporary tuple/dict to pass arguments to function is not as high as I expected.

telco is also my favorite benchmark, I'm sad that it's 1.09x slower. I should investigate that.

scimark_monte_carlo and scimark_lu benchmarks are still unstable for an unknown reasons, so I don't care too much of these two benchmarks.

* https://speed.python.org/timeline/#/?exe=5&ben=scimark_monte_carlo&env=1&revs=50&equid=off&quarts=on&extr=on
* https://speed.python.org/timeline/#/?exe=5&ben=scimark_lu&env=1&revs=50&equid=off&quarts=on&extr=on
msg286303 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-26 10:13
tp_fastcall-6.patch: "call_method: 11.0 ms +- 0.3 ms -> 11.5 ms +- 0.5 ms: 1.05x slower (+5%)"

I looked at the C code: the patch doesn't change any C function used by this benchmark. IHMO it's only a better of code locality. Maybe the PGO compilation produced less efficient code with the patch.
msg286307 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-26 11:45
Another run at my machine. (tp_fastcall-6.patch)

+ ../python.default -m perf compare_to default.json patched.json -G --min-speed=2
Slower (8):
- logging_silent: 736 ns +- 7 ns -> 783 ns +- 12 ns: 1.06x slower (+6%)
- unpickle_pure_python: 829 us +- 13 us -> 875 us +- 12 us: 1.06x slower (+6%)
- scimark_sor: 482 ms +- 8 ms -> 507 ms +- 14 ms: 1.05x slower (+5%)
- xml_etree_iterparse: 223 ms +- 6 ms -> 228 ms +- 7 ms: 1.02x slower (+2%)
- scimark_fft: 728 ms +- 16 ms -> 745 ms +- 19 ms: 1.02x slower (+2%)
- scimark_sparse_mat_mult: 8.71 ms +- 0.32 ms -> 8.90 ms +- 0.28 ms: 1.02x slower (+2%)
- json_dumps: 29.3 ms +- 0.3 ms -> 29.9 ms +- 0.3 ms: 1.02x slower (+2%)
- pathlib: 51.2 ms +- 0.6 ms -> 52.3 ms +- 0.8 ms: 1.02x slower (+2%)

Faster (8):
- unpack_sequence: 132 ns +- 2 ns -> 123 ns +- 2 ns: 1.08x faster (-7%)
- xml_etree_parse: 319 ms +- 11 ms -> 303 ms +- 11 ms: 1.05x faster (-5%)
- call_simple: 14.3 ms +- 0.5 ms -> 13.7 ms +- 0.5 ms: 1.04x faster (-4%)
- call_method_unknown: 16.6 ms +- 1.1 ms -> 16.0 ms +- 0.2 ms: 1.04x faster (-4%)
- call_method: 15.1 ms +- 0.1 ms -> 14.5 ms +- 0.5 ms: 1.04x faster (-4%)
- xml_etree_process: 241 ms +- 4 ms -> 235 ms +- 4 ms: 1.03x faster (-3%)
- call_method_slots: 14.7 ms +- 0.2 ms -> 14.3 ms +- 0.2 ms: 1.02x faster (-2%)
- genshi_xml: 197 ms +- 3 ms -> 193 ms +- 2 ms: 1.02x faster (-2%)
msg286313 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-01-26 13:55
While I feel your work is great, performance benefit seems very small,
compared complexity of this patch.

I think AST level optimize is more important. co_const and co_stacksize
can be more small if constant folding is done in AST.

And I think global cache (#10401 or #28158) is important too.
msg286314 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-26 14:08
"While I feel your work is great, performance benefit seems very small,
compared complexity of this patch."

I have to agree. I spent a lot of times on benhchmarking these tp_fast* changes. While one or two benchmarks are faster, it's not really the case for the others.

I also agree with the complexity. In Python 3.6, most FASTCALL changes were internals. For example, using PyObject_CallFunctionObjArgs() now uses FASTCALL internally, without having to modify callers of the API. I tried to only use _PyObject_FastCallDict/Keywords() in a few places where the speedup was significant.

The main visible change of Python 3.6 FASTCALL is the new METH_CALL calling convention for C function. Your change modifying print() to use METH_CALL has a significant impact on the telco benchmark, without no drawback. I tested further changes to use METH_FASTCALL in struct and decimal modules, and they optimize telco even more.

To continue the optimization work, I guess that using METH_CALL in more cases, using Argument Clinic whenever possible, would have a more concrete and measurable impact on performances, than this big tp_fastcall patch.

But I'm not ready to abandon the whole approach yet, so I change the status to Pending. I may come back in one or two months, to check if I didn't miss anything obvious to unlock even more optimizations ;-)
msg287374 - (view) Author: Roundup Robot (python-dev) Date: 2017-02-09 01:03
New changeset 31342913fb1e by Victor Stinner in branch 'default':
Fix PyCFunction_Call() performance issue
https://hg.python.org/cpython/rev/31342913fb1e
msg287376 - (view) Author: Roundup Robot (python-dev) Date: 2017-02-09 02:00
New changeset e9807e307f1f561a2dfe3e9f97a38444528dba86 by Victor Stinner in branch 'master':
Fix PyCFunction_Call() performance issue
https://github.com/python/cpython/commit/e9807e307f1f561a2dfe3e9f97a38444528dba86
msg288458 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-23 16:00
I don't completely reject the issue. I may come back later if I find a way to make it even more efficient. But I prefer to close the issue as REJECTED to make it clear that right now with the current implementation and benchmark results, it's not worth it.
msg293583 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-05-12 22:21
New changeset d05f7fdf6cf77724bd3064fb5a0846ef5cfe0c88 by Victor Stinner in branch '3.6':
[3.6] bpo-30345: Update test_gdb.py and python-gdb.py from master (#1549)
https://github.com/python/cpython/commit/d05f7fdf6cf77724bd3064fb5a0846ef5cfe0c88
msg295517 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-06-09 11:24
New changeset f0ff849adc6b4a01f9d1f08d9ad0f1511ff84541 by Victor Stinner in branch '3.6':
bpo-30524: Fix _PyStack_UnpackDict() (#1886)
https://github.com/python/cpython/commit/f0ff849adc6b4a01f9d1f08d9ad0f1511ff84541
History
Date User Action Args
2017-06-09 11:24:56hayposetmessages: + msg295517
2017-05-31 14:39:28hayposetpull_requests: + pull_request1963
2017-05-12 22:21:53hayposetmessages: + msg293583
2017-02-23 16:00:47hayposetstatus: open -> closed
resolution: rejected
messages: + msg288458

stage: resolved
2017-02-09 02:00:22python-devsetmessages: + msg287376
2017-02-09 01:03:06python-devsetstatus: pending -> open

messages: + msg287374
2017-02-02 23:13:20inada.naokiunlinkissue29263 dependencies
2017-01-26 14:08:49hayposetstatus: open -> pending

messages: + msg286314
2017-01-26 13:55:30inada.naokisetmessages: + msg286313
2017-01-26 11:45:42inada.naokisetmessages: + msg286307
2017-01-26 10:13:55hayposetmessages: + msg286303
2017-01-26 08:46:56hayposetmessages: + msg286299
2017-01-26 02:19:38hayposetfiles: + tp_fastcall-6.patch

messages: + msg286291
2017-01-26 02:18:12hayposetfiles: - tp_fastcall-6.patch
2017-01-26 01:53:46hayposetfiles: + tp_fastcall-6.patch

messages: + msg286289
2017-01-20 00:08:21hayposetfiles: + tp_fastcall-5.patch

messages: + msg285863
2017-01-19 11:41:13python-devsetmessages: + msg285776
2017-01-18 16:23:46python-devsetmessages: + msg285741
2017-01-18 14:54:13hayposetmessages: + msg285731
2017-01-18 13:23:00python-devsetmessages: + msg285726
2017-01-18 12:50:33python-devsetmessages: + msg285724
2017-01-18 10:28:05python-devsetmessages: + msg285715
2017-01-18 09:39:51python-devsetmessages: + msg285712
2017-01-18 09:32:59python-devsetmessages: + msg285710
2017-01-18 08:44:17inada.naokisetmessages: + msg285708
2017-01-18 07:47:24hayposetfiles: + tp_fastcall-4.patch

messages: + msg285705
2017-01-18 06:22:27inada.naokisetmessages: + msg285704
2017-01-18 01:25:15hayposetfiles: + tp_fastcall-3.patch

messages: + msg285693
2017-01-17 18:08:30inada.naokisetmessages: + msg285674
2017-01-17 16:57:09hayposetmessages: + msg285667
2017-01-17 16:42:21hayposetmessages: + msg285666
2017-01-17 02:42:06hayposetfiles: + tp_fastcall-2.patch

messages: + msg285603
2017-01-17 02:37:50hayposetmessages: + msg285602
2017-01-16 16:32:18python-devsetnosy: + python-dev
messages: + msg285571
2017-01-16 16:12:53inada.naokisetmessages: + msg285570
2017-01-16 15:53:50hayposetmessages: + msg285568
2017-01-14 14:47:37inada.naokisetmessages: + msg285484
2017-01-14 07:13:13inada.naokisetmessages: + msg285463
2017-01-14 00:46:50hayposetmessages: + msg285451
2017-01-13 17:16:05hayposetmessages: + msg285426
2017-01-13 17:15:44hayposetmessages: + msg285425
2017-01-13 16:53:33hayposetmessages: + msg285420
2017-01-13 16:48:41hayposetmessages: + msg285419
2017-01-13 16:22:59serhiy.storchakasetmessages: + msg285416
2017-01-13 16:10:46haypolinkissue29263 dependencies
2017-01-13 15:51:14hayposetmessages: + msg285411
2017-01-13 15:13:29hayposetfiles: + tp_fastcall.patch
keywords: + patch
messages: + msg285405
2017-01-13 15:08:22hayposetmessages: + msg285404
2017-01-13 14:47:35serhiy.storchakasetmessages: + msg285400
2017-01-13 12:33:21hayposetmessages: + msg285390
2017-01-13 12:26:01hayposetmessages: + msg285389
2017-01-13 12:24:34haypocreate