classification
Title: Use fast call in method_call() and slot_tp_new()
Type: performance Stage:
Components: Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: haypo, pitrou, python-dev, scoder, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2016-08-23 14:54 by haypo, last changed 2016-08-24 23:19 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
call_prepend.patch haypo, 2016-08-23 14:54 review
Messages (7)
msg273461 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-08-23 14:54
Attached patch avoids the creation of a temporary tuple in method_call() and slot_tp_new() by using the new fast call calling convention.

It uses a small buffer allocated on the stack C if the function is called with 4 arguments or less, or it allocates a buffer in the heap memory.

The function also avoids INCREF/DECREF: references are borrowed, not strong references.

The patch adds a private _PyObject_Call_Preprend() helper function written to optimize such way of packing positional arguments, it's like:

   args = (obj,) + args
   func(*args, **kw)
msg273462 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-08-23 14:55
> It uses a small buffer allocated on the stack C if the function is called with 4 arguments or less, or it allocates a buffer in the heap memory.

Maybe 4 is too small. On 64 bit, it's just 5*8=40 bytes. Maybe we can use a buffer of 10 pointers: 80 bytes? It would optimize calls with up to 9 arguments (1 pointer is used for "obj" argument, the "prepended" argument).
msg273496 - (view) Author: Stefan Behnel (scoder) * Date: 2016-08-23 17:53
If you care so much about C stack space, you could also try to create two or three entry point functions that keep (say) a 4, 8 and 16 items array on the stack respectively, and then pass the pointer (and the overall length if you need it) of that array on into a single function that copies the argument pointers into it and calls the Python function. As long as they are all really simple static functions that end with tail calls (i.e. with "return other_function()"), the C compiler should always be able to inline the entry points into their caller and not waste any additional C stack space for calling them.
msg273517 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-08-23 21:15
Stefan Behnel added the comment:
> If you care so much about C stack space, you could also try to create two or three entry point functions that keep (say) a 4, 8 and 16 items array on the stack respectively, (...)

I should compute statistics, but I'm quite sure that most function
calls take 5 or less parameters. I don't think that allocating 5
PyObject* uses too much C stack, what do you think?

I don't think that using the heap memory for more parameters will
"kill" performances. Python 3.5 already does the same, a tuple is
allocated in the heap memory ;-) It just that I want to optimize the
common case.
msg273590 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-08-24 20:32
Tuples use a freelist ;-)
I think you can safely bump the small array to 7 or 8 elements, to satisfy even more cases. Though you are right that function calls with more than 5 arguments should be rare (and mostly to complicated functions where function call overhead doesn't really matter).
msg273610 - (view) Author: Roundup Robot (python-dev) Date: 2016-08-24 23:14
New changeset 316e5de8a96b by Victor Stinner in branch 'default':
method_call() and slot_tp_new() now uses fast call
https://hg.python.org/cpython/rev/316e5de8a96b
msg273612 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-08-24 23:19
> Tuples use a freelist ;-)

Honestly, I didn't expect that my "fast call" thing would give any significant speedup. I know that tuple allocator uses a free list.

Fast calls allows to avoid INCREF/DECREF on each argument, can use the C stack for short memory allocations and C arrays are not tracked by the garbage collector. It looks like all these minor things altogether provides a concrete and significant speedup on benchmarks.

So, I pushed call_prepend.patch with two changes:

* Fix a typo in the function name :-D
* Fix a reference leak in slot_tp_new() ;-)
History
Date User Action Args
2016-08-24 23:19:06hayposetstatus: open -> closed
resolution: fixed
messages: + msg273612
2016-08-24 23:14:31python-devsetnosy: + python-dev
messages: + msg273610
2016-08-24 20:32:57pitrousetnosy: + pitrou
messages: + msg273590
2016-08-23 21:15:00hayposetmessages: + msg273517
2016-08-23 17:53:06scodersetmessages: + msg273496
2016-08-23 14:55:31hayposetmessages: + msg273462
2016-08-23 14:54:11haypocreate