classification
Title: _PyFunction_FastCallDict(): replace PyTuple_New() with PyMem_Malloc()
Type: performance Stage:
Components: Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: haypo, josh.r, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2016-11-30 11:32 by haypo, last changed 2017-01-11 01:22 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
fastcalldict.patch haypo, 2016-11-30 11:32 review
fastcalldict-2.patch haypo, 2016-11-30 11:49 review
fastcalldict-3.patch haypo, 2016-12-01 10:05 review
bench_fastcalldict.py haypo, 2016-12-01 12:08
fastcalldict-4.patch haypo, 2017-01-03 01:22 review
Messages (18)
msg282079 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-30 11:32
Attached patch is a minor optimization for _PyFunction_FastCallDict(): avoid the creation of a tuple to pass keyword arguments, use a simple C array allocated by PyMem_Malloc().

It also uses a small stack of 80 bytes (2*5*sizeof(PyObject*)) allocated on the C stack to pass up to 5 keyword arguments (5 key+value pairs): it avoids completely the allocation on the heap memory 

I wrote _PyFunction_FastCallDict() (issue #27809): the code was based on function_call() which also uses PyTuple_New(). When I wrote the function, I didn't notice that PyEval_EvalCodeEx() doesn't expect a Python tuple object, but a C array.

The patch also modifies function_call() to call _PyFunction_FastCallDict(), so it gets _PyFunction_FastCallDict() optimizations.
msg282080 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-30 11:49
fastcalldict.patch avoided INCREF/DECREF on keyword keys and values. This is wrong: we must hold strong references because the keyword dictionary can be technically modified: see issue #2016 and test_extcall.

Hum, I'm quite sure that it's not the first time that I was bitten by this bug. That's maybe why I didn't try to implement this optimization the first time.

fastcalldict-2.patch keeps INCREF/DECREF and so doesn't crash on test_extcall.
msg282095 - (view) Author: Josh Rosenberg (josh.r) * Date: 2016-11-30 19:35
Given you can't avoid the refcounting overhead, how much does this really help? Are there meaningful benefits in microbenchmarks? I'd worry that unconditional allocation from PyMem_Malloc might lose out relative to PyTuple_New, which is likely to not involve memory allocation at all (pulling from tuple free list).
msg282098 - (view) Author: Josh Rosenberg (josh.r) * Date: 2016-11-30 19:49
Minor correction: No allocation when small stack used, so you'd only see (possibly) regressions with 6+ keyword arguments (assuming the tuple free list applies for tuples that large). Admittedly a minor concern; keyword processing is already pretty slow, and large numbers of keywords being passed are likely a tiny fraction of call cases, so I'd guess microbenchmarks wouldn't show a big change, and broad benchmarks would be completely unaffected, but worth checking.
msg282152 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-01 09:14
I agree with Josh, PyTuple_New() can be faster than PyMem_Malloc() due to tuple free list. small_stack increases C stack consumption even for calls without keyword arguments. This is serious problem since we can't control stack overflow.
msg282154 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-01 10:05
Patch version: fix the "if (0)" to use the small stack allocated on the C stack.
msg282161 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-01 12:07
bench_fastcalldict.py: hardcore microbenchmark on _PyFunction_FastCallDict(). Pass keyword arguments to the tp_init slot of a Python constructor. Result for 1, 5 and 10 keyword arguments:

kw1: Median +- std dev: [ref] 329 ns +- 21 ns -> [patch] 306 ns +- 17 ns: 1.07x faster (-7%)
kw5: Median +- std dev: [ref] 508 ns +- 22 ns -> [patch] 481 ns +- 25 ns: 1.05x faster (-5%)
kw10: Median +- std dev: [ref] 829 ns +- 45 ns -> [patch] 805 ns +- 39 ns: 1.03x faster (-3%)

As expected, the difference is small, but it's faster :-) Indirect benefit is that the garbage collector should be less stressed :-) (tuples are tracked by the GC.)

Note: Using a simple printf() in the C code, I noticed that it is not uncommon that _PyFunction_FastCallDict() is called with an empty dictionary for keyword arguments. Without the patch, an empty tuple was created. With my patch, "unpack" the empty dictionary costs nothing.
msg282162 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-01 12:08
(Oops, I attached the wrong benchmark script. It's now fixed.)
msg282168 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-01 12:44
> Note: Using a simple printf() in the C code, I noticed that it is not uncommon that _PyFunction_FastCallDict() is called with an empty dictionary for keyword arguments.

Simplified Python example where _PyFunction_FastCallDict() is called with an empty dictionary:
---
def f2():
    pass

def wrapper(func, *args, **kw):
    # CALL_FUNCTION_EX: func(**{}) calls PyObject_Call() with kwargs={} which
    # calls _PyFunction_FastCallDict()
    func(*args, **kw)

def f():
    # CALL_FUNCTION: calling wrapper calls fast_function() which calls
    # _PyEval_EvalCodeWithName() which creates an empty dictionary for kw
    wrapper(f2)

f()
---

But on this specific case, the speedup is *very* small: 3 nanoseconds :-)

./python -m perf timeit -s 'kw={}' -s 'def func(): pass' --duplicate=1000 'func(**kw)'
(...)
Median +- std dev: [ref] 108 ns +- 4 ns -> [patch] 105 ns +- 5 ns: 1.02x faster (-2%)
msg282169 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-01 12:47
Serhiy: "small_stack increases C stack consumption even for calls without keyword arguments. This is serious problem since we can't control stack overflow."

This problem is not new and is worked around by Py_EnterRecursiveCall() macro which counts the depth of the Python stack.

I didn't notice any stack overflow crash with my patch. We can reduce the "small_stack" size later if needed.
msg282175 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-01 13:25
> I agree with Josh, PyTuple_New() can be faster than PyMem_Malloc() due to tuple free list.

According to benchmarks, PyTuple_New() is slower than PyMem_Malloc(). It's not surprising for me, using a tuple object requires extra work:

* Track and then untrack the object from the garbage collector
* Destructor uses Py_TRASHCAN_SAFE_BEGIN/Py_TRASHCAN_SAFE_END macros
* Some additional indirectons

When I started working on "fastcall", I was surprised that not creating tuples has a *significant* (positive) effect on performance. It seems to be between 5% and 45% faster. Obviously, it depends on the speed of the function body. The speedup is higher for faster functions, like fast functions implemented in C.
msg282181 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-01 14:27
The problem with C stack overflow is not new, but your patch may make it worse (I don't know if it actually make it worse). Py_EnterRecursiveCall() is used for limiting Python stack. It can't prevent C stack overflow.
msg282191 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-01 16:03
I pushed th echange b9c9691c72c5 to replace PyObject_CallFunctionObjArgs() with _PyObject_CallNoArg() or _PyObject_CallArg1(). These functions are simpler and don't allocate
memory on the C stack.

I made similar to PyObject_CallFunctionObjArgs() in Python 3.6 to this issue: don't create a tuple but a small stack allocated on the C stack or allocate heap memory to pass a C array of PyObject* to _PyObject_FastCall().

Currently, PyObject_CallFunctionObjArgs() uses a small stack for up to 5 positional arguments: it allocates 40 bytes on the C stack.

Serhiy Storchaka: "The problem with C stack overflow is not new, but your patch may make it worse (I don't know if it actually make it worse)."

I consider that 80 bytes is small enough for a C stack. As I wrote, we can reduce this "small stack" in the future if somone reports issues.


"Py_EnterRecursiveCall() is used for limiting Python stack. It can't prevent C stack overflow."

I know that the protection is not perfect. It's an heuristic.

I don't even know if it counts Python builtin functions, or only pure Python functions.


But I'm not sure that I understood your comment: do you suggest to use a tuple and reject this issue? To reduce the size of the small stack? Or to only allocate memory on the heap memory?

If the issue is the memory allocated on the C stack, maybe we can use a free list for "stacks" (C array of PyObject*), as done for tuples? I'm not sure that a free list for PyMem_Malloc()/PyMem_Free() is useful, since it uses our efficient pymalloc.
msg284518 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-03 00:48
Quick update on the fastcall work.

> I pushed th echange b9c9691c72c5 to replace PyObject_CallFunctionObjArgs() with _PyObject_CallNoArg() or _PyObject_CallArg1(). These functions are simpler and don't allocate memory on the C stack.

Using _PyObject_CallArg1() increased the usage of the C stack: see issue #28858. The fix was to remove the _PyObject_CallArg1() macro, replaced with PyObject_CallFunctionObjArgs(func, arg, NULL).

There is still an open issue #28870 to reduce the usage of the C stack memory even more, but it's more a general optimization.
msg284521 - (view) Author: Roundup Robot (python-dev) Date: 2017-01-03 01:05
New changeset 5f7cd3b6c9b1 by Victor Stinner in branch 'default':
Issue #28839: Optimize function_call()
https://hg.python.org/cpython/rev/5f7cd3b6c9b1

New changeset f9dd607dc04c by Victor Stinner in branch 'default':
Optimize _PyFunction_FastCallDict() when kwargs is {}
https://hg.python.org/cpython/rev/f9dd607dc04c
msg284522 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-03 01:10
I pushed the two obvious and safe optimization of fastcalldict-3.patch.
msg284523 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-03 01:22
fastcalldict-4.patch: Rebased patch.

kw1: Median +- std dev: [ref] 290 ns +- 3 ns -> [patch] 253 ns +- 21 ns: 1.14x faster (-13%)
kw5: Median +- std dev: [ref] 438 ns +- 33 ns -> [patch] 394 ns +- 27 ns: 1.11x faster (-10%)
kw10: Median +- std dev: [ref] 663 ns +- 14 ns -> [patch] 617 ns +- 16 ns: 1.07x faster (-7%)

This patch still always allocated 10 pointers (80 bytes) on the C stack: see issue #28870 which discuss options to use less memory.
msg285177 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-11 01:22
Ok, I give up on that one. I don't think that it's worth it.
History
Date User Action Args
2017-01-11 01:22:44hayposetstatus: open -> closed
resolution: rejected
messages: + msg285177
2017-01-03 01:22:40hayposetfiles: + fastcalldict-4.patch

messages: + msg284523
2017-01-03 01:10:17hayposetmessages: + msg284522
2017-01-03 01:05:49python-devsetnosy: + python-dev
messages: + msg284521
2017-01-03 00:48:39hayposetmessages: + msg284518
2016-12-01 16:03:23hayposetmessages: + msg282191
2016-12-01 14:27:48serhiy.storchakasetmessages: + msg282181
2016-12-01 13:25:12hayposetmessages: + msg282175
2016-12-01 12:47:26hayposetmessages: + msg282169
2016-12-01 12:44:29hayposetmessages: + msg282168
2016-12-01 12:08:28hayposetfiles: + bench_fastcalldict.py

messages: + msg282162
2016-12-01 12:07:44hayposetfiles: - bench_fastcalldict.py
2016-12-01 12:07:32hayposetfiles: + bench_fastcalldict.py

messages: + msg282161
2016-12-01 10:05:27hayposetfiles: + fastcalldict-3.patch

messages: + msg282154
2016-12-01 09:14:42serhiy.storchakasetmessages: + msg282152
2016-11-30 19:49:16josh.rsetmessages: + msg282098
2016-11-30 19:35:18josh.rsetnosy: + josh.r
messages: + msg282095
2016-11-30 11:49:26hayposetfiles: + fastcalldict-2.patch

messages: + msg282080
2016-11-30 11:32:47haypocreate