New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects #73445
Comments
A new FASTCALL calling convention was added to Python 3.6. It allows to avoid the creation of a temporary tuple to pass positional arguments and a temporary dictionary to pass keyword arguments. A new METH_FASTCALL calling convention was added for C functions. Most functions now support fastcall, except objects with a __call__() method which have to go through slot_tp_call() which still requires a tuple and dictionary. I tried multiple implementations to support fast calls to call the __call__() method, but I had practical and technical issues. First, I tried to reuse the tp_call field to PyTypeObject: it can be a regular call (tuple/dict for arguments) or a fast call. I added a flag to the tp_flags field. It was tricky to support class inheritance, decide to set or clear the flag. But the real blocker issue is fAthat it is obviously breaks the backward compatibility: existing code calling directly tp_call with the regular calling convention will crash immediatly, and the error is not catched during compilation, even if the code is recompiled. I propose a different design: add a new tp_fastcall field to PyTypeObject and use a wrapper for tp_call when tp_fastcall is defined. If a type defines tp_fastcall but not, the tp_call wrapper "simply" calls tp_fastcall. Advantages:
Inheritance:
Functions of the C API will be modified to use tp_fastcall if available. The plan is then to patch most Python types to replace their tp_call with tp_fastcall. First, most important (common) types like Python and C functions, descriptors, and the various kinds of wrappers should be patched. Later, we should maybe discuss on a case by case basis to decide if it's worth it. I will try to run benchmark before any kind. |
Naoki wants to use FASTCALL for CALL_METHOD: tp_fastcall should allow to easily implement it. |
I started to work on FASTCALL, because I dislike the "cached tuple" hack used in some performance critical code, and the hack causes various kinds of tricky but severe issues (can lead to segfault). Thanks to tp_fastcall, it becomes possible to drop the "cached tuple" hack from property_descr_get() *and* keep good performances. First, a benchmark to show the performance gain of using "cached tuple". I modified property_descr_get() to use Python 3.4 code which doesn't have the optimization: $ ./python -m perf compare_to py34.json ref.json
Median +- std dev: [py34] 75.0 ns +- 1.7 ns -> [ref] 50.0 ns +- 0.9 ns: 1.50x faster (-33%) It's MUCH faster, good job. But it requires complex and fragile code. Ok, let's see with operator.itemgetter() supporting tp_fastcall, Python modified to use tp_fastcall and without the "cached arg" hack: $ ./python -m perf compare_to ref.json fastcall_wrapper.json
Median +- std dev: [ref] 50.0 ns +- 0.9 ns -> [fastcall_wrapper] 48.2 ns +- 1.5 ns: 1.04x faster (-4%) It's a little bit faster, but that's not the point. The point is that it isn't slower and it doesn't require to modify C code to benefit of the optimization! Just to be clear, another benchmark result on property_descr_get() without "cache args", without fastcall (py34) and with fastcall ("fastcall_wrapper"): $ ./python -m perf compare_to py34.json fastcall_wrapper.json
Median +- std dev: [py34] 75.0 ns +- 1.7 ns -> [fastcall_wrapper] 48.2 ns +- 1.5 ns: 1.56x faster (-36%) Summary:
|
Is tp_call set to the wrapper rather then inheriting? What if tp_call is defined in a superclass?
I would consider this as a bug. It would be weird if different ways of calling cause executing different code. What about dynamically changed Python types? What if you set or delete the __call__ attribute of Python class? |
It is set to the wrapper. Defining tp_fastcall should work as defining tp_call: it should override the tp_call and tp_fastcall slots.
What is a superclass? Do you have an example?
It's a micro-optimization to avoid arguments unpacking/packing, conversions needed when you switch from the regular calling convention to the fast call convention, or the opposite. I don't think that types defining tp_call and tp_fastcall will be common.
What do you mean? Do you have an example?
I don't know how these things work :-) Let me try on a patched Python (using tp_fastcall): $ ./python
>>> class A:
... def __call__(self): print("A")
...
>>> a=A()
>>> a()
A
>>> A.__call__=lambda self: print("B!")
>>> a()
B!
>>> del A.__call__
>>> a()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'A' object is not callable It seems like "it just works", but I don't know how it works internally :-) a() uses a dynamic lookup of a.__class__.__call__, no? |
Ok, here is a first patch: tp_fastcall.patch
Hum, I should make sure that these new functions are not public. By the way, I should probably rename PyArg_UnpackStack() to _PyArg_UnpackStack(). |
tp_fastcall.patch modifies property_descr_get() to use tp_fastcall if available, or use the cached tuple. Once performances are validated (no regression), the cached tuple as to pass position arguments, should go away. By the way, msg285390 uses the following micro-benchmark: import perf
bench = perf.Runner()
bench.timeit("namedtuple.attr",
"a.a",
"from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)",
duplicate=20) It tests property_descr_get() with operator.itemgetter(). It would be nice to benchmark property_descr_get() with other functions. |
PyArg_UnpackStack() is declared as in the limited API since 3.3. If you want to add PyArg_UnpackStack() to the limited API, define it as added in 3.7. For compatibility with extensions built with older Pythons you should define new type flag and read tp_fastcall only if the flag is set. See for example Py_TPFLAGS_HAVE_FINALIZE. |
Ok, I pushed a new commit "Add Py_TPFLAGS_HAVE_FASTCALL flag" in my Git branch. (By the way, I'm using the Git branch as a working copy, it can change anytime to get a nice list of commits. If you want reliable changes, use patches attached to this issue.) Serhiy Storchaka!:
Wait, there is maybe an issue with the stable API in my design :-/ Let's say that a builtin type NEW defines tp_fastcall and uses fastcall_wrapper() for tp_call. A type OLD inherits from NEW and don't override tp_call (not tp_fastcall). Type OLD comes from an extension compiled with Python 3.6 and use the stable ABI. Does the type OLD contain the field tp_fastcall in memory? Is it possible to copy the tp_fastcall filed from type NEW into the type OLD? If not, the fastcall_wrapper() should be smarter to not get tp_fastcall from the type OLD but follow the MRO to find the first type with the Py_TPFLAGS_HAVE_FASTCALL flag. Is it possible that this short loop to find has a cost on performances? |
I ran benchmarks on tp_fastcall.patch. Hum, it seems like there is a performance issue somewhere :-( Almost all benchmarks are much slower with the patch. haypo@speed-python$ python3 -m perf compare_to 2017-01-11_00-07-default-b9404639a18c.json tp_fastcall_ref_b9404639a18c.json -G
Faster (1):
|
Strange, I'm unable to reproduce the result on a different computer: haypo@smithers$ ./python -m perf compare_to ref.json tp_fastcall.json -v |
Oops, I forgot to mention that I tried the telco benchmark. |
This benchmark was run with LTO+PGO. It seems like PGO produced less efficient machine code with the patch. Maybe a missed optimization. Oh, I just realized that I announced that objects with a __call__() are faster, but it's not the case yet! Such object still gets slot_tp_call() as tp_call which isn't a fastcall! |
Can you remove this _PyStack_AsDict() & _PyStack_UnpackDict() by creating shortcut in _PyCFunction_FastCallKeywords()? |
I've created pull request about it: |
Ah, the benchmark result change after a reboot. I reran the same benchmark (LTO+PGO on speed-python) after a reboot:
Faster (19):
Benchmark hidden because not significant (16): call_method_unknown, dulwich_log, float, genshi_xml, html5lib, json_loads, logging_silent, pathlib, pickle, raytrace, scimark_sor, scimark_sparse_mat_mult, sqlalchemy_imperative, sympy_sum, tornado_http, unpickle_list Lighter verbose with a threshold of 5%:
Faster (1):
Benchmark hidden because not significant (60): 2to3, call_method, ... spectral_norm and scimark_lu are two unstable benchmarks. Sadly, it seems like the current patch is not really significant on benchmarks. But again, it doesn't optimize __call__() slot yet. |
Additionally, there are not enough METH_FASTCALL CFunctions. I'm interested in performance difference between METH_VARARGS and METH_FASTCALL. If METH_FASTCALL is significant faster than VARARGS, |
New changeset 2ccdf2b3819d by Victor Stinner in branch 'default': |
I pushed many changes tonight which convert many C functions to METH_FASTCALL.
IMHO performance is a good motivation to convert code to Argument Clinic ;-) But sometimes, AC can be painful, so by hand is also acceptable ;-) |
Patch version 2:
I'm not sure that the Py_TPFLAGS_HAVE_FASTCALL flag is used correctly. Maybe this new flag disabled optimizations in some cases, I didn't read carefully this part of the code. |
Back on performances: if you combined tp_fastcall-2.patch + 3 minor changes to use METH_FASTCALL, the bm_telco benchmark becomes 22% faster, which is significant for a benchmark: "20.9 ms +- 0.5 ms => 16.4 ms +- 0.5 ms" It also means that 22% of the benchmark time is currently spent in handling function arguments! I expected that most of the time would be spent on maths and I/O! |
I checked the effect of individual patches:
tp_fastcall-2 + print + struct + decimal: 16.3 ms +- 0.6 ms Oh wow, I didn't expect that print would be the bottleneck of this benchmark!? There is a single print() in the hotcode of bm_telco: print(t, file=outfil) Maybe it's the expensive price of creating a temporary dictionary? |
Please see the patch on bpo-29295. |
Ok, I fixed all remaining issues. The new patch version 3 is now waiting for your review ;-) Patch version 3:
About the stable ABI: since the version 3, it should just work thanks to the change on Py_TPFLAGS_DEFAULT. Code compiled with Python 3.7 benefit directly of tp_fastcall. Code compiled with Python 3.6 or older will get fastcall_wrapper() which finds tp_fastcall in base classes. |
Oops, I had a local commit for testing purpose. I regenerated the patch: tp_fastcall-4.patch.
Yes, patches are generated from my Git pull request: The Git history became a little bit more ugly after many changes :-) I should try to rework it to have better changes. I will likely do that before pushing changes once the change will be accepted. |
I commented at the pull request. |
New changeset 261b108b9468 by Victor Stinner in branch 'default': |
New changeset 0b219348ec9e by Victor Stinner in branch 'default': |
New changeset 1b7a5bdd05ed by Victor Stinner in branch 'default': |
New changeset a3865d0c1844 by Victor Stinner in branch 'default': |
New changeset a241817424e5 by Victor Stinner in branch 'default': New changeset a8d35309dcc0 by Victor Stinner in branch 'default': New changeset ee6e1b1151a8 by Victor Stinner in branch 'default': New changeset 11039ece46b9 by Victor Stinner in branch 'default': New changeset 8cc5d78d9b18 by Victor Stinner in branch 'default': |
See also issue bpo-29306: "Check usage of Py_EnterRecursiveCall() and Py_LeaveRecursiveCall() in new FASTCALL functions". |
New changeset 127401469628 by Victor Stinner in branch 'default': |
New changeset bf6728085b01 by Victor Stinner in branch 'default': |
tp_fastcall-5.patch: Rebased patch. I already tried to push changes not directly related to tp_fastcall, so the new patch should be a *little bit* shorter. |
Patch version 6: type inheritance and all wrappers should now be handled correctly. Changes:
IMHO the most tricky parts of the change are in update_one_slot() and add_operators(). Many fixes were "backported" from my work on the issue bpo-29358 (tp_fastnew and tp_fastinit). Technically, I cherry-picked commits and fixes all the merge conflicts. It may be worth to use FASTCALL in more wrappers (wrap_xxx functions in typeobject.c) of slotdefs. |
Oh, I had a local change not commited yet, so tp_fastcall-6.patch didn't get the [review] button. I just pushed the commit and then posted again the exact same patch. |
performance results on tp_fastcall-6.patch (diff >= 2%):
Faster (17):
Benchmark hidden because not significant (41): (...) Or less verbose output, only show difference >= 5%:
Faster (3):
Benchmark hidden because not significant (59): (...) To be honest, I'm disappointed, I expected more faster benchmarks. Maybe the cost of the temporary tuple/dict to pass arguments to function is not as high as I expected. telco is also my favorite benchmark, I'm sad that it's 1.09x slower. I should investigate that. scimark_monte_carlo and scimark_lu benchmarks are still unstable for an unknown reasons, so I don't care too much of these two benchmarks. |
tp_fastcall-6.patch: "call_method: 11.0 ms +- 0.3 ms -> 11.5 ms +- 0.5 ms: 1.05x slower (+5%)" I looked at the C code: the patch doesn't change any C function used by this benchmark. IHMO it's only a better of code locality. Maybe the PGO compilation produced less efficient code with the patch. |
Another run at my machine. (tp_fastcall-6.patch) + ../python.default -m perf compare_to default.json patched.json -G --min-speed=2
Slower (8):
- logging_silent: 736 ns +- 7 ns -> 783 ns +- 12 ns: 1.06x slower (+6%)
- unpickle_pure_python: 829 us +- 13 us -> 875 us +- 12 us: 1.06x slower (+6%)
- scimark_sor: 482 ms +- 8 ms -> 507 ms +- 14 ms: 1.05x slower (+5%)
- xml_etree_iterparse: 223 ms +- 6 ms -> 228 ms +- 7 ms: 1.02x slower (+2%)
- scimark_fft: 728 ms +- 16 ms -> 745 ms +- 19 ms: 1.02x slower (+2%)
- scimark_sparse_mat_mult: 8.71 ms +- 0.32 ms -> 8.90 ms +- 0.28 ms: 1.02x slower (+2%)
- json_dumps: 29.3 ms +- 0.3 ms -> 29.9 ms +- 0.3 ms: 1.02x slower (+2%)
- pathlib: 51.2 ms +- 0.6 ms -> 52.3 ms +- 0.8 ms: 1.02x slower (+2%) Faster (8):
|
"While I feel your work is great, performance benefit seems very small, I have to agree. I spent a lot of times on benhchmarking these tp_fast* changes. While one or two benchmarks are faster, it's not really the case for the others. I also agree with the complexity. In Python 3.6, most FASTCALL changes were internals. For example, using PyObject_CallFunctionObjArgs() now uses FASTCALL internally, without having to modify callers of the API. I tried to only use _PyObject_FastCallDict/Keywords() in a few places where the speedup was significant. The main visible change of Python 3.6 FASTCALL is the new METH_CALL calling convention for C function. Your change modifying print() to use METH_CALL has a significant impact on the telco benchmark, without no drawback. I tested further changes to use METH_FASTCALL in struct and decimal modules, and they optimize telco even more. To continue the optimization work, I guess that using METH_CALL in more cases, using Argument Clinic whenever possible, would have a more concrete and measurable impact on performances, than this big tp_fastcall patch. But I'm not ready to abandon the whole approach yet, so I change the status to Pending. I may come back in one or two months, to check if I didn't miss anything obvious to unlock even more optimizations ;-) |
New changeset 31342913fb1e by Victor Stinner in branch 'default': |
New changeset e9807e307f1f561a2dfe3e9f97a38444528dba86 by Victor Stinner in branch 'master': |
I don't completely reject the issue. I may come back later if I find a way to make it even more efficient. But I prefer to close the issue as REJECTED to make it clear that right now with the current implementation and benchmark results, it's not worth it. |
Can you comment on #4944 why you think that such compatibility should be guaranteed? According to PEP-384, the layout of PyTypeObject is not part of the stable ABI. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: