Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects #73445

vstinner · 2017-01-13T12:24:35Z

BPO	29259
Nosy	@vstinner, @methane, @serhiy-storchaka, @jdemeyer
PRs	bpo-29534: _decimal difference with _pydecimal #65 bpo-30524: Fix _PyStack_UnpackDict() #1886
Files	tp_fastcall.patch tp_fastcall-2.patch tp_fastcall-3.patch tp_fastcall-4.patch tp_fastcall-5.patch tp_fastcall-6.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-02-23.16:00:47.565>
created_at = <Date 2017-01-13.12:24:34.942>
labels = ['3.7', 'performance']
title = 'Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects'
updated_at = <Date 2019-04-03.15:40:34.099>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2019-04-03.15:40:34.099>
actor = 'jdemeyer'
assignee = 'none'
closed = True
closed_date = <Date 2017-02-23.16:00:47.565>
closer = 'vstinner'
components = []
creation = <Date 2017-01-13.12:24:34.942>
creator = 'vstinner'
dependencies = []
files = ['46281', '46309', '46323', '46326', '46347', '46419']
hgrepos = []
issue_num = 29259
keywords = ['patch']
message_count = 50.0
messages = ['285388', '285389', '285390', '285400', '285404', '285405', '285411', '285416', '285419', '285420', '285425', '285426', '285451', '285463', '285484', '285568', '285570', '285571', '285602', '285603', '285666', '285667', '285674', '285693', '285704', '285705', '285708', '285710', '285712', '285715', '285724', '285726', '285731', '285741', '285776', '285863', '286289', '286291', '286299', '286303', '286307', '286313', '286314', '287374', '287376', '288458', '293583', '295517', '322747', '339397']
nosy_count = 5.0
nosy_names = ['vstinner', 'methane', 'python-dev', 'serhiy.storchaka', 'jdemeyer']
pr_nums = ['65', '1886']
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue29259'
versions = ['Python 3.7']

vstinner · 2017-01-13T12:24:34Z

A new FASTCALL calling convention was added to Python 3.6. It allows to avoid the creation of a temporary tuple to pass positional arguments and a temporary dictionary to pass keyword arguments. A new METH_FASTCALL calling convention was added for C functions. Most functions now support fastcall, except objects with a __call__() method which have to go through slot_tp_call() which still requires a tuple and dictionary.

I tried multiple implementations to support fast calls to call the __call__() method, but I had practical and technical issues.

First, I tried to reuse the tp_call field to PyTypeObject: it can be a regular call (tuple/dict for arguments) or a fast call. I added a flag to the tp_flags field. It was tricky to support class inheritance, decide to set or clear the flag. But the real blocker issue is fAthat it is obviously breaks the backward compatibility: existing code calling directly tp_call with the regular calling convention will crash immediatly, and the error is not catched during compilation, even if the code is recompiled.

I propose a different design: add a new tp_fastcall field to PyTypeObject and use a wrapper for tp_call when tp_fastcall is defined. If a type defines tp_fastcall but not, the tp_call wrapper "simply" calls tp_fastcall. Advantages:

The wrapper is trivial
Minor changes to PyType_Ready() to support inheritance (simple logic)
Fully backward compatible
If tp_call is called directly without keyword arguments, there is no overhead but a speedup!

Inheritance:

If a type only defines tp_call, tp_fastcall is not inherited from the parent: tp_fastcall is set to NULL.
If a type only defines tp_fastcall: tp_fastcall is always use (tp_call uses the wrapper)
If a type defines tp_call and tp_fastcall, PyObject_Call() uses tp_call whereas _PyObject_FastCallDict() uses tp_fastcall.

Functions of the C API will be modified to use tp_fastcall if available.

The plan is then to patch most Python types to replace their tp_call with tp_fastcall. First, most important (common) types like Python and C functions, descriptors, and the various kinds of wrappers should be patched. Later, we should maybe discuss on a case by case basis to decide if it's worth it.

I will try to run benchmark before any kind.

vstinner · 2017-01-13T12:26:01Z

Naoki wants to use FASTCALL for CALL_METHOD:
http://bugs.python.org/issue26110#msg283093

tp_fastcall should allow to easily implement it.

vstinner · 2017-01-13T12:33:21Z

I started to work on FASTCALL, because I dislike the "cached tuple" hack used in some performance critical code, and the hack causes various kinds of tricky but severe issues (can lead to segfault).

Thanks to tp_fastcall, it becomes possible to drop the "cached tuple" hack from property_descr_get() *and* keep good performances.

First, a benchmark to show the performance gain of using "cached tuple". I modified property_descr_get() to use Python 3.4 code which doesn't have the optimization:

$ ./python -m perf compare_to py34.json ref.json 
Median +- std dev: [py34] 75.0 ns +- 1.7 ns -> [ref] 50.0 ns +- 0.9 ns: 1.50x faster (-33%)

It's MUCH faster, good job. But it requires complex and fragile code. Ok, let's see with operator.itemgetter() supporting tp_fastcall, Python modified to use tp_fastcall and without the "cached arg" hack:

$ ./python -m perf compare_to ref.json fastcall_wrapper.json 
Median +- std dev: [ref] 50.0 ns +- 0.9 ns -> [fastcall_wrapper] 48.2 ns +- 1.5 ns: 1.04x faster (-4%)

It's a little bit faster, but that's not the point. The point is that it isn't slower and it doesn't require to modify C code to benefit of the optimization! Just to be clear, another benchmark result on property_descr_get() without "cache args", without fastcall (py34) and with fastcall ("fastcall_wrapper"):

$ ./python -m perf compare_to py34.json fastcall_wrapper.json 
Median +- std dev: [py34] 75.0 ns +- 1.7 ns -> [fastcall_wrapper] 48.2 ns +- 1.5 ns: 1.56x faster (-36%)

Summary:

tp_fastcall avoids to remove the "cached args" hack which will fix severe issue in corner cases
tp_fastcall makes existing code faster for free. I mean, builtin types should be modified to support tp_fastcall, most all code *calling* these types don't need any change.

serhiy-storchaka · 2017-01-13T14:47:35Z

If a type only defines tp_fastcall: tp_fastcall is always use (tp_call uses the wrapper)

Is tp_call set to the wrapper rather then inheriting? What if tp_call is defined in a superclass?

If a type defines tp_call and tp_fastcall, PyObject_Call() uses tp_call whereas _PyObject_FastCallDict() uses tp_fastcall.

I would consider this as a bug. It would be weird if different ways of calling cause executing different code.

What about dynamically changed Python types? What if you set or delete the __call__ attribute of Python class?

vstinner · 2017-01-13T15:08:22Z

If a type only defines tp_fastcall: tp_fastcall is always use (tp_call uses the wrapper)

Is tp_call set to the wrapper rather then inheriting?

It is set to the wrapper. Defining tp_fastcall should work as defining tp_call: it should override the tp_call and tp_fastcall slots.

What if tp_call is defined in a superclass?

What is a superclass? Do you have an example?

If a type defines tp_call and tp_fastcall, PyObject_Call() uses tp_call whereas _PyObject_FastCallDict() uses tp_fastcall.

I would consider this as a bug. It would be weird if different ways of calling cause executing different code.

It's a micro-optimization to avoid arguments unpacking/packing, conversions needed when you switch from the regular calling convention to the fast call convention, or the opposite.

I don't think that types defining tp_call and tp_fastcall will be common.

What about dynamically changed Python types?

What do you mean? Do you have an example?

What if you set or delete the __call__ attribute of Python class?

I don't know how these things work :-) Let me try on a patched Python (using tp_fastcall):

$ ./python
>>> class A:
...  def __call__(self): print("A")
... 
>>> a=A()
>>> a()
A

>>> A.__call__=lambda self: print("B!")
>>> a()
B!

>>> del A.__call__
>>> a()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'A' object is not callable

It seems like "it just works", but I don't know how it works internally :-)

a() uses a dynamic lookup of a.__class__.__call__, no?

vstinner · 2017-01-13T15:13:26Z

Ok, here is a first patch: tp_fastcall.patch

Add tp_fastcall to PyTypeObject
Add fastcall_wrapper() and use it for tp_call in PyType_Ready()
Modify _PyObject_FastCallDict() and _PyObject_FastCallKeywords() to use tp_fastcall if defined
Modify a lot of types to define tp_fastcall instead of tp_call (to use FASTCALL calling convention)
Add a few new helper functions:
- _PyObject_FastCall_Prepend() -- similar to _PyObject_Call_Prepend()
- _Py_FastCall_FromArgs() -- ugly name, sorry about that, it should be renamed :-)
- PyArg_UnpackStack() -- similar to PyArg_UnpackTuple()
- _PyArg_NoStackKeywords() -- similar to _PyArg_NoKeywords

Hum, I should make sure that these new functions are not public. By the way, I should probably rename PyArg_UnpackStack() to _PyArg_UnpackStack().

vstinner · 2017-01-13T15:51:15Z

tp_fastcall.patch modifies property_descr_get() to use tp_fastcall if available, or use the cached tuple. Once performances are validated (no regression), the cached tuple as to pass position arguments, should go away.

By the way, msg285390 uses the following micro-benchmark:
---

import perf
bench = perf.Runner()
bench.timeit("namedtuple.attr",
             "a.a",
             "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)",
             duplicate=20)

It tests property_descr_get() with operator.itemgetter(). It would be nice to benchmark property_descr_get() with other functions.

serhiy-storchaka · 2017-01-13T16:23:00Z

PyArg_UnpackStack() is declared as in the limited API since 3.3. If you want to add PyArg_UnpackStack() to the limited API, define it as added in 3.7.

For compatibility with extensions built with older Pythons you should define new type flag and read tp_fastcall only if the flag is set. See for example Py_TPFLAGS_HAVE_FINALIZE.

vstinner · 2017-01-13T16:48:41Z

Ok, I pushed a new commit "Add Py_TPFLAGS_HAVE_FASTCALL flag" in my Git branch.

(By the way, I'm using the Git branch as a working copy, it can change anytime to get a nice list of commits. If you want reliable changes, use patches attached to this issue.)

Serhiy Storchaka!:

For compatibility with extensions built with older Pythons you should define new type flag and read tp_fastcall only if the flag is set. See for example Py_TPFLAGS_HAVE_FINALIZE.

Wait, there is maybe an issue with the stable API in my design :-/

Let's say that a builtin type NEW defines tp_fastcall and uses fastcall_wrapper() for tp_call. A type OLD inherits from NEW and don't override tp_call (not tp_fastcall). Type OLD comes from an extension compiled with Python 3.6 and use the stable ABI. Does the type OLD contain the field tp_fastcall in memory? Is it possible to copy the tp_fastcall filed from type NEW into the type OLD?

If not, the fastcall_wrapper() should be smarter to not get tp_fastcall from the type OLD but follow the MRO to find the first type with the Py_TPFLAGS_HAVE_FASTCALL flag. Is it possible that this short loop to find has a cost on performances?

vstinner · 2017-01-13T16:53:33Z

I ran benchmarks on tp_fastcall.patch. Hum, it seems like there is a performance issue somewhere :-( Almost all benchmarks are much slower with the patch.

haypo@speed-python$ python3 -m perf compare_to 2017-01-11_00-07-default-b9404639a18c.json tp_fastcall_ref_b9404639a18c.json -G
Slower (63):

unpickle_pure_python: 671 us +- 16 us -> 952 us +- 9 us: 1.42x slower (+42%)
pickle_pure_python: 1.02 ms +- 0.02 ms -> 1.40 ms +- 0.02 ms: 1.37x slower (+37%)
telco: 16.6 ms +- 0.6 ms -> 22.7 ms +- 0.3 ms: 1.37x slower (+37%)
xml_etree_generate: 215 ms +- 3 ms -> 290 ms +- 7 ms: 1.35x slower (+35%)
hexiom: 17.7 ms +- 0.2 ms -> 23.6 ms +- 0.1 ms: 1.33x slower (+33%)
xml_etree_process: 182 ms +- 3 ms -> 240 ms +- 4 ms: 1.32x slower (+32%)
genshi_text: 69.0 ms +- 0.8 ms -> 90.4 ms +- 0.8 ms: 1.31x slower (+31%)
logging_simple: 27.5 us +- 0.4 us -> 35.8 us +- 0.4 us: 1.30x slower (+30%)
(...)

Faster (1):

spectral_norm: 271 ms +- 2 ms -> 257 ms +- 2 ms: 1.06x faster (-5%)

vstinner · 2017-01-13T17:15:45Z

Strange, I'm unable to reproduce the result on a different computer:

haypo@smithers$ ./python -m perf compare_to ref.json tp_fastcall.json -v
Median +- std dev: [ref] 20.0 ms +- 0.5 ms -> [tp_fastcall] 20.0 ms +-
0.6 ms: 1.00x slower (+0%)
Not significant!

vstinner · 2017-01-13T17:16:05Z

Strange, I'm unable to reproduce the result on a different computer:

haypo@smithers$ ./python -m perf compare_to ref.json tp_fastcall.json -v
Median +- std dev: [ref] 20.0 ms +- 0.5 ms -> [tp_fastcall] 20.0 ms +-
0.6 ms: 1.00x slower (+0%)
Not significant!

Oops, I forgot to mention that I tried the telco benchmark.

vstinner · 2017-01-14T00:46:51Z

Slower (63):

unpickle_pure_python: 671 us +- 16 us -> 952 us +- 9 us: 1.42x slower (+42%)
...

This benchmark was run with LTO+PGO. It seems like PGO produced less efficient machine code with the patch. Maybe a missed optimization.

Oh, I just realized that I announced that objects with a __call__() are faster, but it's not the case yet! Such object still gets slot_tp_call() as tp_call which isn't a fastcall!

methane · 2017-01-14T07:13:13Z

_PyCFunction_FastCallKeywords() calls _PyStack_AsDict(), then calls _PyCFunction_FastCallDict().
_PyCFunction_FastCallDict() calls _Py_FastCall_FromArgs() when METH_FASTCALL.
_Py_FastCall_FromArgs() calls _PyStack_UnpackDict(), then calls fastcall().

Can you remove this _PyStack_AsDict() & _PyStack_UnpackDict() by creating shortcut in _PyCFunction_FastCallKeywords()?

methane · 2017-01-14T14:47:37Z

I've created pull request about it:
https://github.com/haypo/cpython/pull/1

vstinner · 2017-01-16T15:53:49Z

Strange, I'm unable to reproduce the result on a different computer: (...)

Ah, the benchmark result change after a reboot. I reran the same benchmark (LTO+PGO on speed-python) after a reboot:
---
haypo@speed-python$ python3 -m perf compare_to ~/test/2017-01-13_19-18-default-3486425ed82b.json ~/patchdir/patch.json -G
Slower (29):

spectral_norm: 275 ms +- 4 ms -> 301 ms +- 16 ms: 1.09x slower (+9%)
fannkuch: 1.10 sec +- 0.01 sec -> 1.16 sec +- 0.18 sec: 1.06x slower (+6%)
scimark_monte_carlo: 282 ms +- 3 ms -> 298 ms +- 6 ms: 1.06x slower (+6%)
sqlite_synth: 9.39 us +- 0.19 us -> 9.82 us +- 0.15 us: 1.05x slower (+5%)
nqueens: 261 ms +- 2 ms -> 271 ms +- 2 ms: 1.04x slower (+4%)
meteor_contest: 209 ms +- 1 ms -> 217 ms +- 2 ms: 1.04x slower (+4%)
scimark_fft: 684 ms +- 22 ms -> 709 ms +- 9 ms: 1.04x slower (+4%)
pickle_pure_python: 1.35 ms +- 0.01 ms -> 1.39 ms +- 0.02 ms: 1.03x slower (+3%)
xml_etree_iterparse: 227 ms +- 5 ms -> 233 ms +- 6 ms: 1.03x slower (+3%)
sympy_integrate: 49.3 ms +- 0.3 ms -> 50.5 ms +- 0.3 ms: 1.02x slower (+2%)
genshi_text: 90.3 ms +- 1.0 ms -> 92.4 ms +- 1.1 ms: 1.02x slower (+2%)
regex_dna: 283 ms +- 1 ms -> 289 ms +- 1 ms: 1.02x slower (+2%)
telco: 22.3 ms +- 0.5 ms -> 22.6 ms +- 0.3 ms: 1.02x slower (+2%)
chaos: 295 ms +- 3 ms -> 300 ms +- 3 ms: 1.02x slower (+2%)
go: 598 ms +- 5 ms -> 607 ms +- 7 ms: 1.01x slower (+1%)
json_dumps: 30.5 ms +- 0.4 ms -> 30.9 ms +- 0.5 ms: 1.01x slower (+1%)
richards: 183 ms +- 4 ms -> 186 ms +- 5 ms: 1.01x slower (+1%)
deltablue: 17.1 ms +- 0.5 ms -> 17.4 ms +- 0.3 ms: 1.01x slower (+1%)
logging_format: 42.1 us +- 0.6 us -> 42.5 us +- 0.6 us: 1.01x slower (+1%)
sympy_str: 511 ms +- 4 ms -> 515 ms +- 5 ms: 1.01x slower (+1%)
2to3: 760 ms +- 4 ms -> 766 ms +- 4 ms: 1.01x slower (+1%)
regex_compile: 462 ms +- 4 ms -> 465 ms +- 5 ms: 1.01x slower (+1%)
django_template: 447 ms +- 5 ms -> 450 ms +- 4 ms: 1.01x slower (+1%)
logging_simple: 35.6 us +- 0.5 us -> 35.8 us +- 0.6 us: 1.01x slower (+1%)
hexiom: 23.0 ms +- 0.2 ms -> 23.2 ms +- 0.1 ms: 1.01x slower (+1%)
python_startup: 16.0 ms +- 0.1 ms -> 16.1 ms +- 0.1 ms: 1.00x slower (+0%)
regex_effbot: 5.08 ms +- 0.03 ms -> 5.10 ms +- 0.09 ms: 1.00x slower (+0%)
call_method: 13.8 ms +- 0.1 ms -> 13.9 ms +- 0.2 ms: 1.00x slower (+0%)
python_startup_no_site: 9.36 ms +- 0.04 ms -> 9.37 ms +- 0.07 ms: 1.00x slower (+0%)

Faster (19):

scimark_lu: 518 ms +- 18 ms -> 458 ms +- 25 ms: 1.13x faster (-11%)
xml_etree_process: 242 ms +- 4 ms -> 234 ms +- 4 ms: 1.03x faster (-3%)
unpickle: 36.0 us +- 0.5 us -> 35.0 us +- 0.5 us: 1.03x faster (-3%)
chameleon: 30.6 ms +- 0.2 ms -> 29.8 ms +- 0.3 ms: 1.03x faster (-3%)
nbody: 240 ms +- 2 ms -> 234 ms +- 2 ms: 1.02x faster (-2%)
call_method_slots: 13.9 ms +- 0.2 ms -> 13.6 ms +- 0.4 ms: 1.02x faster (-2%)
unpickle_pure_python: 924 us +- 8 us -> 904 us +- 9 us: 1.02x faster (-2%)
xml_etree_generate: 289 ms +- 4 ms -> 283 ms +- 4 ms: 1.02x faster (-2%)
mako: 43.2 ms +- 0.4 ms -> 42.3 ms +- 0.4 ms: 1.02x faster (-2%)
pickle_dict: 65.5 us +- 4.1 us -> 64.4 us +- 0.9 us: 1.02x faster (-2%)
call_simple: 12.8 ms +- 0.4 ms -> 12.6 ms +- 0.4 ms: 1.01x faster (-1%)
xml_etree_parse: 294 ms +- 13 ms -> 291 ms +- 9 ms: 1.01x faster (-1%)
pickle_list: 9.38 us +- 0.35 us -> 9.32 us +- 0.14 us: 1.01x faster (-1%)
unpack_sequence: 118 ns +- 3 ns -> 118 ns +- 1 ns: 1.00x faster (-0%)
sympy_expand: 1.14 sec +- 0.01 sec -> 1.13 sec +- 0.01 sec: 1.00x faster (-0%)
sqlalchemy_declarative: 338 ms +- 3 ms -> 337 ms +- 3 ms: 1.00x faster (-0%)
crypto_pyaes: 236 ms +- 2 ms -> 235 ms +- 1 ms: 1.00x faster (-0%)
pidigits: 321 ms +- 1 ms -> 320 ms +- 1 ms: 1.00x faster (-0%)
regex_v8: 45.6 ms +- 0.7 ms -> 45.6 ms +- 0.3 ms: 1.00x faster (-0%)

Benchmark hidden because not significant (16): call_method_unknown, dulwich_log, float, genshi_xml, html5lib, json_loads, logging_silent, pathlib, pickle, raytrace, scimark_sor, scimark_sparse_mat_mult, sqlalchemy_imperative, sympy_sum, tornado_http, unpickle_list
---

Lighter verbose with a threshold of 5%:
---
haypo@speed-python$ python3 -m perf compare_to ~/test/2017-01-13_19-18-default-3486425ed82b.json ~/patchdir/patch.json -G --min-speed=5
Slower (3):

spectral_norm: 275 ms +- 4 ms -> 301 ms +- 16 ms: 1.09x slower (+9%)
fannkuch: 1.10 sec +- 0.01 sec -> 1.16 sec +- 0.18 sec: 1.06x slower (+6%)
scimark_monte_carlo: 282 ms +- 3 ms -> 298 ms +- 6 ms: 1.06x slower (+6%)

Faster (1):

scimark_lu: 518 ms +- 18 ms -> 458 ms +- 25 ms: 1.13x faster (-11%)

Benchmark hidden because not significant (60): 2to3, call_method, ...
---

spectral_norm and scimark_lu are two unstable benchmarks. Sadly, it seems like the current patch is not really significant on benchmarks. But again, it doesn't optimize __call__() slot yet.

methane · 2017-01-16T16:12:53Z

Additionally, there are not enough METH_FASTCALL CFunctions.

I'm interested in performance difference between METH_VARARGS and METH_FASTCALL. If METH_FASTCALL is significant faster than VARARGS,
we can convert some common methods from VARARGS to FASTCALL, by hand or
by AC.

python-dev · 2017-01-16T16:32:18Z

New changeset 2ccdf2b3819d by Victor Stinner in branch 'default':
Optimize _PyCFunction_FastCallKeywords()
https://hg.python.org/cpython/rev/2ccdf2b3819d

vstinner · 2017-01-17T02:37:50Z

Additionally, there are not enough METH_FASTCALL CFunctions.

I pushed many changes tonight which convert many C functions to METH_FASTCALL.

..., by hand or by AC.

IMHO performance is a good motivation to convert code to Argument Clinic ;-) But sometimes, AC can be painful, so by hand is also acceptable ;-)

vstinner · 2017-01-17T02:41:57Z

Patch version 2:

Add slot_tp_fastcall(): __call__() wrapper using the FASTCALL calling convention
Add Py_TPFLAGS_HAVE_FASTCALL flag
Remove completely the "cached args" optimization from property_descr_get()
Rebase the patch serie on top of the default branch to benefit of recent changes related to FASTCALL

I'm not sure that the Py_TPFLAGS_HAVE_FASTCALL flag is used correctly. Maybe this new flag disabled optimizations in some cases, I didn't read carefully this part of the code.

vstinner · 2017-01-17T16:42:22Z

Back on performances: if you combined tp_fastcall-2.patch + 3 minor changes to use METH_FASTCALL, the bm_telco benchmark becomes 22% faster, which is significant for a benchmark:
http://bugs.python.org/issue29301#msg285665

"20.9 ms +- 0.5 ms => 16.4 ms +- 0.5 ms"

It also means that 22% of the benchmark time is currently spent in handling function arguments! I expected that most of the time would be spent on maths and I/O!

vstinner · 2017-01-17T16:57:10Z

I checked the effect of individual patches:

tp_fastcall-2
print
struct
decimal

tp_fastcall-2 + print + struct + decimal: 16.3 ms +- 0.6 ms
tp_fastcall-2 + struct + decimal: 21.2 ms +- 0.3 ms
tp_fastcall-2 + print: 16.7 ms +- 0.2 ms

Oh wow, I didn't expect that print would be the bottleneck of this benchmark!? There is a single print() in the hotcode of bm_telco:

   print(t, file=outfil)

Maybe it's the expensive price of creating a temporary dictionary?

methane · 2017-01-17T18:08:31Z

Maybe it's the expensive price of creating a temporary dictionary?

Please see the patch on bpo-29295.
PyDict_GetItemString() is bad part, too.
It's create str object, calc hash, lookup dict, and dealloc the str object.

vstinner · 2017-01-18T01:24:42Z

Ok, I fixed all remaining issues. The new patch version 3 is now waiting for your review ;-)

Patch version 3:

Add Py_TPFLAGS_HAVE_FINALIZE and Py_TPFLAGS_HAVE_FASTCALL to Py_TPFLAGS_DEFAULT
Don't read or write tp_fastcall if the type doesn't have the Py_TPFLAGS_HAVE_FASTCALL feature
fastcall_wrapper() looks into base classes to find a tp_fastcall field: use the first base class with the Py_TPFLAGS_HAVE_FASTCALL feature
Fix how tp_call and tp_fastcall are inherited
Use designated initializers for tp_fastcall, as asked by Naoki (and discussed on python-dev)
Rebase the change and revert now uselesss or unwanted changes
No more FIXME ;-)

About the stable ABI: since the version 3, it should just work thanks to the change on Py_TPFLAGS_DEFAULT. Code compiled with Python 3.7 benefit directly of tp_fastcall. Code compiled with Python 3.6 or older will get fastcall_wrapper() which finds tp_fastcall in base classes.

methane · 2017-01-18T06:22:28Z

@Haypo, Rietveld didn't accept your patch. (I don't know what format Rietveld accepts...)

Is PR 65 is same to the third patch?

vstinner · 2017-01-18T07:47:22Z

Oops, I had a local commit for testing purpose. I regenerated the patch: tp_fastcall-4.patch.

Is PR 65 is same to the third patch?

Yes, patches are generated from my Git pull request:
#65

The Git history became a little bit more ugly after many changes :-) I should try to rework it to have better changes. I will likely do that before pushing changes once the change will be accepted.

methane · 2017-01-18T08:44:18Z

I commented at the pull request.

python-dev · 2017-01-18T09:32:59Z

New changeset 261b108b9468 by Victor Stinner in branch 'default':
Remove unused func parameter of _PyStack_UnpackDict()
https://hg.python.org/cpython/rev/261b108b9468

python-dev · 2017-01-18T09:39:51Z

New changeset 0b219348ec9e by Victor Stinner in branch 'default':
Optimize methoddescr_call(): avoid temporary PyCFunction
https://hg.python.org/cpython/rev/0b219348ec9e

python-dev · 2017-01-18T10:28:05Z

New changeset 1b7a5bdd05ed by Victor Stinner in branch 'default':
_PyObject_FastCallKeywords() now checks the result
https://hg.python.org/cpython/rev/1b7a5bdd05ed

python-dev · 2017-01-18T12:50:33Z

New changeset a3865d0c1844 by Victor Stinner in branch 'default':
Fix Python 2.6 support in python-gdb.py
https://hg.python.org/cpython/rev/a3865d0c1844

python-dev · 2017-01-18T13:23:00Z

New changeset a241817424e5 by Victor Stinner in branch 'default':
Fix _PyMethodDef_RawFastCallDict() argument parsing
https://hg.python.org/cpython/rev/a241817424e5

New changeset a8d35309dcc0 by Victor Stinner in branch 'default':
PyCFunction_Call() now calls _PyCFunction_FastCallDict()
https://hg.python.org/cpython/rev/a8d35309dcc0

New changeset ee6e1b1151a8 by Victor Stinner in branch 'default':
_PyObject_FastCallKeywords() now checks !PyErr_Occurred()
https://hg.python.org/cpython/rev/ee6e1b1151a8

New changeset 11039ece46b9 by Victor Stinner in branch 'default':
Rephrase !PyErr_Occurred() comment: may=>can
https://hg.python.org/cpython/rev/11039ece46b9

New changeset 8cc5d78d9b18 by Victor Stinner in branch 'default':
Cleanup _PyMethodDef_RawFastCallDict()
https://hg.python.org/cpython/rev/8cc5d78d9b18

vstinner · 2017-01-18T14:54:13Z

See also issue bpo-29306: "Check usage of Py_EnterRecursiveCall() and Py_LeaveRecursiveCall() in new FASTCALL functions".

python-dev · 2017-01-18T16:23:47Z

New changeset 127401469628 by Victor Stinner in branch 'default':
Update and enhance python-gdb.py
https://hg.python.org/cpython/rev/127401469628

python-dev · 2017-01-19T11:41:13Z

New changeset bf6728085b01 by Victor Stinner in branch 'default':
_PyStack_AsDict() now checks kwnames != NULL
https://hg.python.org/cpython/rev/bf6728085b01

vstinner · 2017-01-20T00:08:21Z

tp_fastcall-5.patch: Rebased patch. I already tried to push changes not directly related to tp_fastcall, so the new patch should be a *little bit* shorter.

vstinner · 2017-01-26T01:53:32Z

Patch version 6: type inheritance and all wrappers should now be handled correctly. Changes:

Add fastwrapper to wrapperbase structure
Add d_use_fastwrapper to PyWrapperDescrObject structure
wrapper_call() uses fast call
Protect access to tp_fastcall with Py_TPFLAGS_HAVE_FASTCALL in all cases
_Py_RawFastCallDict() doesn't check the result anymore, it's now checked in the caller: self can be NULL for static methods (try test_bytes unit test!)
Fix type inheritance: update_one_slot() now makes sure that tp_call and tp_fastcall are synchronize. Before, the update was done after but not in all cases.
Fix fastcall_wrapper() to support "StaticType.tp_call(obj, ...);" calls: Py_TYPE(obj)->tp_fastcall can be NULL, we have to search non-NULL tp_fastcall in base classes in this case.

IMHO the most tricky parts of the change are in update_one_slot() and add_operators().

Many fixes were "backported" from my work on the issue bpo-29358 (tp_fastnew and tp_fastinit). Technically, I cherry-picked commits and fixes all the merge conflicts.

It may be worth to use FASTCALL in more wrappers (wrap_xxx functions in typeobject.c) of slotdefs.

vstinner · 2017-01-26T02:19:38Z

Oh, I had a local change not commited yet, so tp_fastcall-6.patch didn't get the [review] button. I just pushed the commit and then posted again the exact same patch.

vstinner · 2017-01-26T08:46:56Z

performance results on tp_fastcall-6.patch (diff >= 2%):
---
haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-01-23_14-02-default-cebc9c7ad195.json tp_fastcall-6_ref_cebc9c7ad195.json -G --min-speed=2
Slower (6):

scimark_monte_carlo: 202 ms +- 5 ms -> 222 ms +- 8 ms: 1.10x slower (+10%)
telco: 13.0 ms +- 0.2 ms -> 14.2 ms +- 0.4 ms: 1.09x slower (+9%)
call_method: 11.0 ms +- 0.3 ms -> 11.5 ms +- 0.5 ms: 1.05x slower (+5%)
scimark_sparse_mat_mult: 8.57 ms +- 0.14 ms -> 8.92 ms +- 0.15 ms: 1.04x slower (+4%)
fannkuch: 938 ms +- 7 ms -> 974 ms +- 11 ms: 1.04x slower (+4%)
scimark_fft: 667 ms +- 9 ms -> 690 ms +- 10 ms: 1.03x slower (+3%)

Faster (17):

scimark_lu: 362 ms +- 13 ms -> 322 ms +- 8 ms: 1.12x faster (-11%)
chameleon: 23.5 ms +- 0.3 ms -> 21.3 ms +- 0.3 ms: 1.10x faster (-9%)
xml_etree_process: 179 ms +- 3 ms -> 169 ms +- 3 ms: 1.06x faster (-5%)
xml_etree_generate: 208 ms +- 4 ms -> 199 ms +- 3 ms: 1.05x faster (-4%)
logging_silent: 586 ns +- 8 ns -> 562 ns +- 7 ns: 1.04x faster (-4%)
richards: 147 ms +- 2 ms -> 141 ms +- 3 ms: 1.04x faster (-4%)
pickle_pure_python: 1.03 ms +- 0.02 ms -> 994 us +- 15 us: 1.04x faster (-4%)
regex_compile: 379 ms +- 2 ms -> 367 ms +- 4 ms: 1.03x faster (-3%)
raytrace: 1.08 sec +- 0.01 sec -> 1.04 sec +- 0.02 sec: 1.03x faster (-3%)
unpickle: 31.0 us +- 0.4 us -> 30.0 us +- 0.4 us: 1.03x faster (-3%)
pathlib: 42.3 ms +- 0.5 ms -> 41.0 ms +- 0.5 ms: 1.03x faster (-3%)
scimark_sor: 391 ms +- 7 ms -> 379 ms +- 12 ms: 1.03x faster (-3%)
genshi_xml: 167 ms +- 2 ms -> 162 ms +- 2 ms: 1.03x faster (-3%)
django_template: 356 ms +- 5 ms -> 347 ms +- 4 ms: 1.03x faster (-3%)
regex_dna: 283 ms +- 1 ms -> 276 ms +- 4 ms: 1.03x faster (-3%)
hexiom: 18.0 ms +- 0.1 ms -> 17.6 ms +- 0.2 ms: 1.02x faster (-2%)
call_simple: 10.7 ms +- 0.4 ms -> 10.5 ms +- 0.2 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (41): (...)
---

Or less verbose output, only show difference >= 5%:
---
haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-01-23_14-02-default-cebc9c7ad195.json tp_fastcall-6_ref_cebc9c7ad195.json -G --min-speed=5
Slower (2):

scimark_monte_carlo: 202 ms +- 5 ms -> 222 ms +- 8 ms: 1.10x slower (+10%)
telco: 13.0 ms +- 0.2 ms -> 14.2 ms +- 0.4 ms: 1.09x slower (+9%)

Faster (3):

scimark_lu: 362 ms +- 13 ms -> 322 ms +- 8 ms: 1.12x faster (-11%)
chameleon: 23.5 ms +- 0.3 ms -> 21.3 ms +- 0.3 ms: 1.10x faster (-9%)
xml_etree_process: 179 ms +- 3 ms -> 169 ms +- 3 ms: 1.06x faster (-5%)

Benchmark hidden because not significant (59): (...)
---

To be honest, I'm disappointed, I expected more faster benchmarks. Maybe the cost of the temporary tuple/dict to pass arguments to function is not as high as I expected.

telco is also my favorite benchmark, I'm sad that it's 1.09x slower. I should investigate that.

scimark_monte_carlo and scimark_lu benchmarks are still unstable for an unknown reasons, so I don't care too much of these two benchmarks.

vstinner · 2017-01-26T10:13:56Z

tp_fastcall-6.patch: "call_method: 11.0 ms +- 0.3 ms -> 11.5 ms +- 0.5 ms: 1.05x slower (+5%)"

I looked at the C code: the patch doesn't change any C function used by this benchmark. IHMO it's only a better of code locality. Maybe the PGO compilation produced less efficient code with the patch.

methane · 2017-01-26T11:45:42Z

Another run at my machine. (tp_fastcall-6.patch)

+ ../python.default -m perf compare_to default.json patched.json -G --min-speed=2
Slower (8):
- logging_silent: 736 ns +- 7 ns -> 783 ns +- 12 ns: 1.06x slower (+6%)
- unpickle_pure_python: 829 us +- 13 us -> 875 us +- 12 us: 1.06x slower (+6%)
- scimark_sor: 482 ms +- 8 ms -> 507 ms +- 14 ms: 1.05x slower (+5%)
- xml_etree_iterparse: 223 ms +- 6 ms -> 228 ms +- 7 ms: 1.02x slower (+2%)
- scimark_fft: 728 ms +- 16 ms -> 745 ms +- 19 ms: 1.02x slower (+2%)
- scimark_sparse_mat_mult: 8.71 ms +- 0.32 ms -> 8.90 ms +- 0.28 ms: 1.02x slower (+2%)
- json_dumps: 29.3 ms +- 0.3 ms -> 29.9 ms +- 0.3 ms: 1.02x slower (+2%)
- pathlib: 51.2 ms +- 0.6 ms -> 52.3 ms +- 0.8 ms: 1.02x slower (+2%)

Faster (8):

unpack_sequence: 132 ns +- 2 ns -> 123 ns +- 2 ns: 1.08x faster (-7%)
xml_etree_parse: 319 ms +- 11 ms -> 303 ms +- 11 ms: 1.05x faster (-5%)
call_simple: 14.3 ms +- 0.5 ms -> 13.7 ms +- 0.5 ms: 1.04x faster (-4%)
call_method_unknown: 16.6 ms +- 1.1 ms -> 16.0 ms +- 0.2 ms: 1.04x faster (-4%)
call_method: 15.1 ms +- 0.1 ms -> 14.5 ms +- 0.5 ms: 1.04x faster (-4%)
xml_etree_process: 241 ms +- 4 ms -> 235 ms +- 4 ms: 1.03x faster (-3%)
call_method_slots: 14.7 ms +- 0.2 ms -> 14.3 ms +- 0.2 ms: 1.02x faster (-2%)
genshi_xml: 197 ms +- 3 ms -> 193 ms +- 2 ms: 1.02x faster (-2%)

methane · 2017-01-26T13:55:30Z

While I feel your work is great, performance benefit seems very small,
compared complexity of this patch.

I think AST level optimize is more important. co_const and co_stacksize
can be more small if constant folding is done in AST.

And I think global cache (bpo-10401 or bpo-28158) is important too.

vstinner · 2017-01-26T14:08:48Z

"While I feel your work is great, performance benefit seems very small,
compared complexity of this patch."

I have to agree. I spent a lot of times on benhchmarking these tp_fast* changes. While one or two benchmarks are faster, it's not really the case for the others.

I also agree with the complexity. In Python 3.6, most FASTCALL changes were internals. For example, using PyObject_CallFunctionObjArgs() now uses FASTCALL internally, without having to modify callers of the API. I tried to only use _PyObject_FastCallDict/Keywords() in a few places where the speedup was significant.

The main visible change of Python 3.6 FASTCALL is the new METH_CALL calling convention for C function. Your change modifying print() to use METH_CALL has a significant impact on the telco benchmark, without no drawback. I tested further changes to use METH_FASTCALL in struct and decimal modules, and they optimize telco even more.

To continue the optimization work, I guess that using METH_CALL in more cases, using Argument Clinic whenever possible, would have a more concrete and measurable impact on performances, than this big tp_fastcall patch.

But I'm not ready to abandon the whole approach yet, so I change the status to Pending. I may come back in one or two months, to check if I didn't miss anything obvious to unlock even more optimizations ;-)

python-dev · 2017-02-09T01:03:06Z

New changeset 31342913fb1e by Victor Stinner in branch 'default':
Fix PyCFunction_Call() performance issue
https://hg.python.org/cpython/rev/31342913fb1e

python-dev · 2017-02-09T02:00:22Z

New changeset e9807e307f1f561a2dfe3e9f97a38444528dba86 by Victor Stinner in branch 'master':
Fix PyCFunction_Call() performance issue
e9807e3

vstinner · 2017-02-23T16:00:47Z

I don't completely reject the issue. I may come back later if I find a way to make it even more efficient. But I prefer to close the issue as REJECTED to make it clear that right now with the current implementation and benchmark results, it's not worth it.

vstinner · 2017-05-12T22:21:54Z

New changeset d05f7fd by Victor Stinner in branch '3.6':
[3.6] bpo-30345: Update test_gdb.py and python-gdb.py from master (bpo-1549)
d05f7fd

vstinner · 2017-06-09T11:24:57Z

New changeset f0ff849 by Victor Stinner in branch '3.6':
bpo-30524: Fix _PyStack_UnpackDict() (bpo-1886)
f0ff849

jdemeyer · 2018-07-31T09:36:10Z

For compatibility with extensions built with older Pythons you should define new type flag and read tp_fastcall only if the flag is set.

Can you comment on #4944 why you think that such compatibility should be guaranteed? According to PEP-384, the layout of PyTypeObject is not part of the stable ABI.

jdemeyer · 2019-04-03T15:40:34Z

See also PEP-590, which has very similar ideas. Also PEP-580 is related to this.

vstinner added 3.7 (EOL) end of life performance Performance or resource usage labels Jan 13, 2017

vstinner closed this as completed Feb 23, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects #73445

Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects #73445

Comments

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

serhiy-storchaka commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

serhiy-storchaka commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 13, 2017

vstinner commented Jan 14, 2017

methane commented Jan 14, 2017

methane commented Jan 14, 2017

vstinner commented Jan 16, 2017

methane commented Jan 16, 2017

python-dev mannequin commented Jan 16, 2017

vstinner commented Jan 17, 2017

vstinner commented Jan 17, 2017

vstinner commented Jan 17, 2017

vstinner commented Jan 17, 2017

methane commented Jan 17, 2017

vstinner commented Jan 18, 2017

methane commented Jan 18, 2017

vstinner commented Jan 18, 2017

methane commented Jan 18, 2017

python-dev mannequin commented Jan 18, 2017

python-dev mannequin commented Jan 18, 2017

python-dev mannequin commented Jan 18, 2017

python-dev mannequin commented Jan 18, 2017

python-dev mannequin commented Jan 18, 2017

vstinner commented Jan 18, 2017

python-dev mannequin commented Jan 18, 2017

python-dev mannequin commented Jan 19, 2017

vstinner commented Jan 20, 2017

vstinner commented Jan 26, 2017

vstinner commented Jan 26, 2017

vstinner commented Jan 26, 2017

vstinner commented Jan 26, 2017

methane commented Jan 26, 2017

methane commented Jan 26, 2017

vstinner commented Jan 26, 2017

python-dev mannequin commented Feb 9, 2017

python-dev mannequin commented Feb 9, 2017

vstinner commented Feb 23, 2017

vstinner commented May 12, 2017

vstinner commented Jun 9, 2017

jdemeyer commented Jul 31, 2018

jdemeyer commented Apr 3, 2019