classification
Title: Tuple creation is too slow
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Demur Rumed, ezio.melotti, jstasiak, pitrou, python-dev, rhettinger, scoder, serhiy.storchaka, vstinner, yselivanov
Priority: low Keywords: patch

Created on 2015-02-24 08:08 by serhiy.storchaka, last changed 2016-12-18 13:02 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
reuse_argtuples.patch serhiy.storchaka, 2015-02-24 08:08 review
reuse_argtuples_2.patch serhiy.storchaka, 2015-02-26 10:44 review
bench_builtins.py vstinner, 2016-04-22 13:54
reuse_argtuples_3.patch vstinner, 2016-04-22 16:22 review
stack_overflow.py serhiy.storchaka, 2016-12-01 15:38
bench_builtins.py vstinner, 2016-12-01 15:39
reuse_argtuples_4.patch serhiy.storchaka, 2016-12-18 13:02 review
Messages (25)
msg236480 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-24 08:08
Currently tuples creating uses free lists. But this optimization is not enough. When reuse cached tuple for arguments of repeatedly called functions, the performance of some builtins can be significantly increased.

$ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "list(filter(f, s))"
Unpatched: 1000 loops, best of 3: 773 usec per loop
Patched:   1000 loops, best of 3: 558 usec per loop

$ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "list(map(f, s))"
Unpatched: 1000 loops, best of 3: 689 usec per loop
Patched:   1000 loops, best of 3: 556 usec per loop

$ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "sorted(s, key=f)"
Unpatched: 1000 loops, best of 3: 758 usec per loop
Patched:   1000 loops, best of 3: 550 usec per loop

The same effect can be achieved for itertools functions.

I don't propose to commit this complicated patch, but these results can be used as a guide to the optimization of tuple creating. It is surprising to me that this patch has any effect at all.
msg236654 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-02-26 08:50
I had done something like this for zip() long ago and had gotten a nice speed boost.  I remember looking at whether it should be applied elsewhere but don't remember why I decided against it.
msg236656 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-26 08:58
Note the check for Py_REFCNT() == 1. Whithout it the code would be incorrect.
msg236661 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-26 10:44
Fixed bugs found by Victor. Thank you Victor.

Yes, zip() already uses similar approach.
msg236691 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-02-26 17:45
Maybe we want a facility to create on-stack static-size tuples?
msg236692 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-02-26 17:45
How many functions can benefit from this approach, though?
msg236696 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-26 18:21
> Maybe we want a facility to create on-stack static-size tuples?

There is no guarantee that called function doesn't save the reference to args.

> How many functions can benefit from this approach, though?

From hand-writing caching? Any function, that repeatedly call other functions many times. In additional to sorted()/list.sort() with the key argument, filter() and map(), these are min() and max() with the key argument, some iterators from itertools (groupby(), dropwhile(), takewhile(), accumulate(), filterfalse()). May be some functions that are thin wrappers around special method (e.g. round()).

But it is more interesting to investigate why PyTuple_New() is so slow in comparison with hand-written caching. It it can be speeded up, then mauch more code can benefit from this.
msg240693 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-13 18:15
By the way, is PyTuple_New() really the culprit here? Or is it the valist handling in PyObject_CallFunctionObjArgs()?
msg240845 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-04-14 06:13
Unassigning because I don't have an interest in going further down this path.  I used this technique with zip() because tuple reuse nicely benefited the common case of "for a,b,c in zip(alist, blist, clist): ...".  

On another thread, I did see mention of removing the memset() from tuple creation and that seems like a promising direction.
msg240862 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-04-14 09:28
map doesn't use PyObject_CallFunctionObjArgs(). If compare results for map with results for filter and list.sort, seems that PyTuple_New() is responsible for 24% and PyObject_CallFunctionObjArgs() adds other 14%.
msg263991 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 09:38
Hi, I started to work on a new top-down approach avoiding completly temporary tuple/dict: "FASTCALL", issue #26814. As you may expect, the patch is much larger that Serhiy's patches attached to his issue, but IMHO the final code is much simpler: it doesn't have to use complex tricks on the reference count or setting tuple items to NULL (which lead to segfaults like the issue #26811).


Antoine Pitrou: "Maybe we want a facility to create on-stack static-size tuples?"

My current implementation of FASTCALL uses a buffer allocated on the stack to handle calls with up to 10 parameters (10 positional parameters or 5 keyword parameters, since a keyword uses 2 PyObject*).


Antoine Pitrou: "How many functions can benefit from this approach, though?"

In my experimental FASTCALL branch, slot wrappers, PyObject_Call*() functions (except of PyObject_Call()!), ceval.c, etc. benefit from FASTCALL. The question is not really how much benefit from FASTCALL, but more how much changes are worth to avoid temporary tuple/dict. For example, right now I decided to not add a FASTCALL flavor of tp_new and tp_init (instanciate a class or create a new type) because it becomes much more complex when you have to handle inheritance.

Maybe my FASTCALL requires too many changes and is overkill. Maybe we need to find a compromise between FASTCALL (issue #26814) and Serhiy's changes which are limited to a few functions.

Today, it's too early to decide, but I have fun with my FASTCALL experiment ;-) Slot wrappers are 40% faster, getting a namedtuple attribute is 25% faster (whereas Serhiy already optimized this specific case!), etc.
msg264012 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 13:54
msg264009: Serhiy Storchaka "Could you compare filter(), map() and sorted() performance with your patch and with issue23507 patch?"

Here you have the result of bench_builtins.py. The performance look to be the same, even if I expect better performance with less trivial callbacks (where fastcall would avoid the creation of other temporary tuples).

------------------------------------------------+-------------+--------------
Tests                                           |   argtuples |      fastcall
------------------------------------------------+-------------+--------------
filter(lambda x: x, range(1000))                | 80.2 us (*) | 72.7 us (-9%)
map(lambda x: x, range(1000))                   | 79.5 us (*) |       79.9 us
map(lambda x, y: x+y, range(1000), range(1000)) |  111 us (*) |        109 us
sorted(list, key=lambda x: x)                   | 82.6 us (*) | 75.7 us (-8%)
sorted(list)                                    | 15.7 us (*) |       15.6 us
------------------------------------------------+-------------+--------------
Total                                           |  369 us (*) |        353 us
------------------------------------------------+-------------+--------------

Comparison to the original Python 3.6:

------------------------------------------------+-------------+----------------+---------------
Tests                                           |    original |      argtuples |       fastcall
------------------------------------------------+-------------+----------------+---------------
filter(lambda x: x, range(1000))                |  109 us (*) | 80.2 us (-27%) | 72.7 us (-33%)
map(lambda x: x, range(1000))                   | 96.9 us (*) | 79.5 us (-18%) | 79.9 us (-18%)
map(lambda x, y: x+y, range(1000), range(1000)) |  126 us (*) |  111 us (-12%) |  109 us (-13%)
sorted(list, key=lambda x: x)                   |  114 us (*) | 82.6 us (-28%) | 75.7 us (-34%)
sorted(list)                                    | 16.1 us (*) |        15.7 us |        15.6 us
------------------------------------------------+-------------+----------------+---------------
Total                                           |  463 us (*) |  369 us (-20%) |  353 us (-24%)
------------------------------------------------+-------------+----------------+---------------


The difference is more on the changes in bltinmodule.c. Example with filter_next():

Using fastcall:

-            good = PyObject_CallFunctionObjArgs(lz->func, item, NULL);
+            good = PyObject_CallArg1(lz->func, item);

reuse_argtuples_2.patch:

-            good = PyObject_CallFunctionObjArgs(lz->func,
-                                                item, NULL);
+            PyObject *argtuple = lz->argtuple;
+            lz->argtuple = NULL;
+            if (argtuple == NULL) {
+                argtuple = PyTuple_New(1);
+                if (argtuple == NULL) {
+                    Py_DECREF(item);
+                    return NULL;
+                }
+                Py_INCREF(argtuple);
+            }
+            assert(Py_REFCNT(argtuple) == 2);
+            assert(Py_SIZE(argtuple) == 1);
+            assert(PyTuple_GET_ITEM(argtuple, 0) == NULL);
+            PyTuple_SET_ITEM(argtuple, 0, item);
+            good = PyObject_Call(lz->func, argtuple, NULL);
+            if (Py_REFCNT(argtuple) == 2) {
+                PyTuple_SET_ITEM(argtuple, 0, NULL);
+                if (lz->argtuple == NULL)
+                    lz->argtuple = argtuple;
+                else {
+                    Py_DECREF(argtuple);
+                    Py_DECREF(argtuple);
+                }
+            }
+            else {
+                Py_INCREF(item);
+                Py_DECREF(argtuple);
+                Py_DECREF(argtuple);
+            }

(reuse_argtuples_2.patch requires a few other changes related to filter_next().)
msg264026 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 16:22
reuse_argtuples_3.patch: rebased reuse_argtuples_2.patch.
msg264061 - (view) Author: Demur Rumed (Demur Rumed) * Date: 2016-04-23 12:11
This code could be made to look a lot less hackish if the potential tuple reuse was abstracted by a PyTuple_UniqueOrNew(PyTuple) which returns its argument if REFCNT==1 otherwise allocates a tuple of the same size & returns that. It decrefs PyTuple if REFCNT != 1
msg282179 - (view) Author: Roundup Robot (python-dev) Date: 2016-12-01 14:09
New changeset b9c9691c72c5 by Victor Stinner in branch 'default':
Replace PyObject_CallFunctionObjArgs() with fastcall
https://hg.python.org/cpython/rev/b9c9691c72c5
msg282180 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-01 14:12
I think it would be better to optimize PyObject_CallFunctionObjArgs().
msg282186 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-01 15:38
Here is an example that shows increased stack consumption.

With old code it counts to 74700, with the patch it counts only to 65400.

Please revert your changes.
msg282187 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-01 15:39
I rewrote bench_builtins.py to use my new perf module.

Python 3.7 is between 1.27x and 1.42x faster than Python 3.6, but sorted(list) is slower. I don't know why sorted(list) is slower. It doesn't use a key function, and so should not be impacted by FASTCALL changes made since Python 3.6.

Benchmark results of Python 3.7 (include latest FASTCALL patches) compared to Python 3.5 (no FASTCALL):

filter(lambda x: x, range(1000)): Median +- std dev: [3.5] 126 us +- 7 us -> [3.7] 94.6 us +- 4.8 us: 1.34x faster (-25%)
map(lambda x, y: x+y, range(1000), range(1000)): Median +- std dev: [3.5] 163 us +- 10 us -> [3.7] 115 us +- 7 us: 1.42x faster (-30%)
map(lambda x: x, range(1000)): Median +- std dev: [3.5] 115 us +- 6 us -> [3.7] 90.9 us +- 5.4 us: 1.27x faster (-21%)
sorted(list): Median +- std dev: [3.5] 17.5 us +- 1.0 us -> [3.7] 19.7 us +- 1.1 us: 1.12x slower (+12%)
sorted(list, key=lambda x: x): Median +- std dev: [3.5] 122 us +- 10 us -> [3.7] 91.8 us +- 5.7 us: 1.33x faster (-25%)

Note: you need the development version of perf to run the script (future version 0.9.2).
msg282188 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-01 15:49
Serhiy: "I don't propose to commit this complicated patch, but these results can be used as a guide to the optimization of tuple creating. It is surprising to me that this patch has any effect at all."

A new "fast call" calling convention was added to Python 3.6 to avoid the creation of temporary tuples to pass function arguments. Benchmarks confirm that Python 3.7 is faster than Python 3.5 of functions that Serhiy suggested to optimize. I confirm that I was also surprised that fast call has a significant effect on performance!

Fast calls don't require complex code like reuse_argtuples_3.patch which requires to check carefully the reference counter. In the past, a similar optimization on property_descr_get() introduced complex bugs: see issue #24276 and issue #26811. Fast calls don't use such issues.

The implementation of fast calls is currently incomplete: tp_new, tp_init and tp_call slots still require a tuple and a dict for positional and keyword arguments. I'm working on an enhancement for Python 3.7 to support also fast calls for these slots. In the meanwhile, callbacks using tp_call don't benefit yet on fast calls.

That's why I had to keep the tuple/refcount optimization in property_descr_get(), to not introduce a performance regression.

I consider that this issue can now be closed as "fixed".
msg282190 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-01 16:02
Victor, your changes introduced a regression.
msg282195 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-01 16:23
Serhiy Storchaka: "Victor, your changes introduced a regression."

I suppose that you are refering to sorted(list) which seems to be slower. Do you have a specific change in mind? As I wrote, it doesn't use a key function, and so should not be impacted by my work on fast calls. I opened a new issue #28852 to track this performance regression.
msg282214 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-02 02:02
>  I opened a new issue #28852 to track this performance regression.

So can I close again this issue?
msg282217 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-02 05:32
No, I'm referring to the crashing. The code that worked before your changes now are crashing. Please revert your changes and find a way to optimize a function calling without increasing stack consumption.
msg282229 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-02 07:54
haypo: "status: open -> closed"

Oops, it's a mistake, sorry. I only wanted to ask the question, not to close the issue.

Serhiy Storchaka: "No, I'm referring to the crashing. The code that worked before your changes now are crashing."

Sorry, I didn't notice that you attached a script! I opened the issue #28858.
msg283554 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-18 13:02
Here is rebased patch (just for history). Caching argtuple is not faster (or even slower) than using fast call.

Regression in stack consumption is fixed in issue28858.

Thanks Victor!
History
Date User Action Args
2016-12-18 13:02:51serhiy.storchakasetstatus: open -> closed
files: + reuse_argtuples_4.patch
messages: + msg283554

resolution: fixed
stage: resolved
2016-12-02 07:54:42vstinnersetmessages: + msg282229
2016-12-02 05:32:37serhiy.storchakasetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg282217
2016-12-02 02:02:50vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg282214
2016-12-01 16:23:31vstinnersetmessages: + msg282195
2016-12-01 16:02:20serhiy.storchakasetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg282190
2016-12-01 15:49:13vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg282188
2016-12-01 15:39:41vstinnersetfiles: + bench_builtins.py

messages: + msg282187
2016-12-01 15:38:27serhiy.storchakasetfiles: + stack_overflow.py

messages: + msg282186
2016-12-01 14:12:01serhiy.storchakasetmessages: + msg282180
2016-12-01 14:09:50python-devsetnosy: + python-dev
messages: + msg282179
2016-05-09 22:56:48jstasiaksetnosy: + jstasiak
2016-04-23 12:11:11Demur Rumedsetnosy: + Demur Rumed
messages: + msg264061
2016-04-22 20:41:19yselivanovsetnosy: + yselivanov
2016-04-22 16:22:13vstinnersetfiles: + reuse_argtuples_3.patch

messages: + msg264026
2016-04-22 13:54:27vstinnersetfiles: + bench_builtins.py

messages: + msg264012
2016-04-22 09:38:05vstinnersetnosy: + vstinner
messages: + msg263991
2015-04-14 09:28:12serhiy.storchakasetpriority: normal -> low

messages: + msg240862
2015-04-14 06:13:47rhettingersetassignee: rhettinger ->
messages: + msg240845
2015-04-13 18:15:11pitrousetmessages: + msg240693
2015-02-26 18:21:06serhiy.storchakasetmessages: + msg236696
2015-02-26 17:45:35pitrousetmessages: + msg236692
2015-02-26 17:45:18pitrousetnosy: + pitrou
messages: + msg236691
2015-02-26 10:44:08serhiy.storchakasetfiles: + reuse_argtuples_2.patch

messages: + msg236661
2015-02-26 08:58:26serhiy.storchakasetmessages: + msg236656
2015-02-26 08:50:57rhettingersetassignee: rhettinger

messages: + msg236654
nosy: + rhettinger
2015-02-24 17:41:00scodersetnosy: + scoder
2015-02-24 08:34:06ezio.melottisetnosy: + ezio.melotti
2015-02-24 08:09:04serhiy.storchakasettitle: Tuple creation is two slow -> Tuple creation is too slow
2015-02-24 08:08:32serhiy.storchakacreate