Tuple creation is too slow #67695

serhiy-storchaka · 2015-02-24T08:08:32Z

BPO	23507
Nosy	@rhettinger, @pitrou, @scoder, @vstinner, @ezio-melotti, @serhiy-storchaka, @1st1, @jstasiak, @serprex
Files	reuse_argtuples.patch reuse_argtuples_2.patch bench_builtins.py reuse_argtuples_3.patch stack_overflow.py bench_builtins.py reuse_argtuples_4.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-12-18.13:02:51.224>
created_at = <Date 2015-02-24.08:08:32.468>
labels = ['interpreter-core', 'performance']
title = 'Tuple creation is too slow'
updated_at = <Date 2016-12-18.13:02:51.222>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2016-12-18.13:02:51.222>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2016-12-18.13:02:51.224>
closer = 'serhiy.storchaka'
components = ['Interpreter Core']
creation = <Date 2015-02-24.08:08:32.468>
creator = 'serhiy.storchaka'
dependencies = []
files = ['38223', '38244', '42569', '42571', '45725', '45726', '45953']
hgrepos = []
issue_num = 23507
keywords = ['patch']
message_count = 25.0
messages = ['236480', '236654', '236656', '236661', '236691', '236692', '236696', '240693', '240845', '240862', '263991', '264012', '264026', '264061', '282179', '282180', '282186', '282187', '282188', '282190', '282195', '282214', '282217', '282229', '283554']
nosy_count = 10.0
nosy_names = ['rhettinger', 'pitrou', 'scoder', 'vstinner', 'ezio.melotti', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'jstasiak', 'Demur Rumed']
pr_nums = []
priority = 'low'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue23507'
versions = ['Python 3.5']

serhiy-storchaka · 2015-02-24T08:08:31Z

Currently tuples creating uses free lists. But this optimization is not enough. When reuse cached tuple for arguments of repeatedly called functions, the performance of some builtins can be significantly increased.

$ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "list(filter(f, s))"
Unpatched: 1000 loops, best of 3: 773 usec per loop
Patched:   1000 loops, best of 3: 558 usec per loop

$ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "list(map(f, s))"
Unpatched: 1000 loops, best of 3: 689 usec per loop
Patched:   1000 loops, best of 3: 556 usec per loop

$ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "sorted(s, key=f)"
Unpatched: 1000 loops, best of 3: 758 usec per loop
Patched:   1000 loops, best of 3: 550 usec per loop

The same effect can be achieved for itertools functions.

I don't propose to commit this complicated patch, but these results can be used as a guide to the optimization of tuple creating. It is surprising to me that this patch has any effect at all.

rhettinger · 2015-02-26T08:50:57Z

I had done something like this for zip() long ago and had gotten a nice speed boost. I remember looking at whether it should be applied elsewhere but don't remember why I decided against it.

serhiy-storchaka · 2015-02-26T08:58:26Z

Note the check for Py_REFCNT() == 1. Whithout it the code would be incorrect.

serhiy-storchaka · 2015-02-26T10:44:08Z

Fixed bugs found by Victor. Thank you Victor.

Yes, zip() already uses similar approach.

pitrou · 2015-02-26T17:45:18Z

Maybe we want a facility to create on-stack static-size tuples?

pitrou · 2015-02-26T17:45:36Z

How many functions can benefit from this approach, though?

serhiy-storchaka · 2015-02-26T18:21:06Z

Maybe we want a facility to create on-stack static-size tuples?

There is no guarantee that called function doesn't save the reference to args.

How many functions can benefit from this approach, though?

From hand-writing caching? Any function, that repeatedly call other functions many times. In additional to sorted()/list.sort() with the key argument, filter() and map(), these are min() and max() with the key argument, some iterators from itertools (groupby(), dropwhile(), takewhile(), accumulate(), filterfalse()). May be some functions that are thin wrappers around special method (e.g. round()).

But it is more interesting to investigate why PyTuple_New() is so slow in comparison with hand-written caching. It it can be speeded up, then mauch more code can benefit from this.

pitrou · 2015-04-13T18:15:12Z

By the way, is PyTuple_New() really the culprit here? Or is it the valist handling in PyObject_CallFunctionObjArgs()?

rhettinger · 2015-04-14T06:13:48Z

Unassigning because I don't have an interest in going further down this path. I used this technique with zip() because tuple reuse nicely benefited the common case of "for a,b,c in zip(alist, blist, clist): ...".

On another thread, I did see mention of removing the memset() from tuple creation and that seems like a promising direction.

serhiy-storchaka · 2015-04-14T09:28:13Z

map doesn't use PyObject_CallFunctionObjArgs(). If compare results for map with results for filter and list.sort, seems that PyTuple_New() is responsible for 24% and PyObject_CallFunctionObjArgs() adds other 14%.

vstinner · 2016-04-22T09:38:06Z

Hi, I started to work on a new top-down approach avoiding completly temporary tuple/dict: "FASTCALL", issue bpo-26814. As you may expect, the patch is much larger that Serhiy's patches attached to his issue, but IMHO the final code is much simpler: it doesn't have to use complex tricks on the reference count or setting tuple items to NULL (which lead to segfaults like the issue bpo-26811).

Antoine Pitrou: "Maybe we want a facility to create on-stack static-size tuples?"

My current implementation of FASTCALL uses a buffer allocated on the stack to handle calls with up to 10 parameters (10 positional parameters or 5 keyword parameters, since a keyword uses 2 PyObject*).

Antoine Pitrou: "How many functions can benefit from this approach, though?"

In my experimental FASTCALL branch, slot wrappers, PyObject_Call*() functions (except of PyObject_Call()!), ceval.c, etc. benefit from FASTCALL. The question is not really how much benefit from FASTCALL, but more how much changes are worth to avoid temporary tuple/dict. For example, right now I decided to not add a FASTCALL flavor of tp_new and tp_init (instanciate a class or create a new type) because it becomes much more complex when you have to handle inheritance.

Maybe my FASTCALL requires too many changes and is overkill. Maybe we need to find a compromise between FASTCALL (issue bpo-26814) and Serhiy's changes which are limited to a few functions.

Today, it's too early to decide, but I have fun with my FASTCALL experiment ;-) Slot wrappers are 40% faster, getting a namedtuple attribute is 25% faster (whereas Serhiy already optimized this specific case!), etc.

vstinner · 2016-04-22T13:54:27Z

msg264009: Serhiy Storchaka "Could you compare filter(), map() and sorted() performance with your patch and with bpo-23507 patch?"

Here you have the result of bench_builtins.py. The performance look to be the same, even if I expect better performance with less trivial callbacks (where fastcall would avoid the creation of other temporary tuples).

------------------------------------------------+-------------+--------------
Tests                                           |   argtuples |      fastcall
------------------------------------------------+-------------+--------------
filter(lambda x: x, range(1000))                | 80.2 us (*) | 72.7 us (-9%)
map(lambda x: x, range(1000))                   | 79.5 us (*) |       79.9 us
map(lambda x, y: x+y, range(1000), range(1000)) |  111 us (*) |        109 us
sorted(list, key=lambda x: x)                   | 82.6 us (*) | 75.7 us (-8%)
sorted(list)                                    | 15.7 us (*) |       15.6 us
------------------------------------------------+-------------+--------------
Total                                           |  369 us (*) |        353 us
------------------------------------------------+-------------+--------------

Comparison to the original Python 3.6:

------------------------------------------------+-------------+----------------+---------------
Tests                                           |    original |      argtuples |       fastcall
------------------------------------------------+-------------+----------------+---------------
filter(lambda x: x, range(1000))                |  109 us (*) | 80.2 us (-27%) | 72.7 us (-33%)
map(lambda x: x, range(1000))                   | 96.9 us (*) | 79.5 us (-18%) | 79.9 us (-18%)
map(lambda x, y: x+y, range(1000), range(1000)) |  126 us (*) |  111 us (-12%) |  109 us (-13%)
sorted(list, key=lambda x: x)                   |  114 us (*) | 82.6 us (-28%) | 75.7 us (-34%)
sorted(list)                                    | 16.1 us (*) |        15.7 us |        15.6 us
------------------------------------------------+-------------+----------------+---------------
Total                                           |  463 us (*) |  369 us (-20%) |  353 us (-24%)
------------------------------------------------+-------------+----------------+---------------

The difference is more on the changes in bltinmodule.c. Example with filter_next():

Using fastcall:

-            good = PyObject_CallFunctionObjArgs(lz-\>func, item, NULL);
+            good = PyObject_CallArg1(lz-\>func, item);

reuse_argtuples_2.patch:

-            good = PyObject_CallFunctionObjArgs(lz->func,
-                                                item, NULL);
+            PyObject *argtuple = lz->argtuple;
+            lz->argtuple = NULL;
+            if (argtuple == NULL) {
+                argtuple = PyTuple_New(1);
+                if (argtuple == NULL) {
+                    Py_DECREF(item);
+                    return NULL;
+                }
+                Py_INCREF(argtuple);
+            }
+            assert(Py_REFCNT(argtuple) == 2);
+            assert(Py_SIZE(argtuple) == 1);
+            assert(PyTuple_GET_ITEM(argtuple, 0) == NULL);
+            PyTuple_SET_ITEM(argtuple, 0, item);
+            good = PyObject_Call(lz->func, argtuple, NULL);
+            if (Py_REFCNT(argtuple) == 2) {
+                PyTuple_SET_ITEM(argtuple, 0, NULL);
+                if (lz->argtuple == NULL)
+                    lz->argtuple = argtuple;
+                else {
+                    Py_DECREF(argtuple);
+                    Py_DECREF(argtuple);
+                }
+            }
+            else {
+                Py_INCREF(item);
+                Py_DECREF(argtuple);
+                Py_DECREF(argtuple);
+            }

(reuse_argtuples_2.patch requires a few other changes related to filter_next().)

vstinner · 2016-04-22T16:22:13Z

reuse_argtuples_3.patch: rebased reuse_argtuples_2.patch.

serprex · 2016-04-23T12:11:11Z

This code could be made to look a lot less hackish if the potential tuple reuse was abstracted by a PyTuple_UniqueOrNew(PyTuple) which returns its argument if REFCNT==1 otherwise allocates a tuple of the same size & returns that. It decrefs PyTuple if REFCNT != 1

python-dev · 2016-12-01T14:09:51Z

New changeset b9c9691c72c5 by Victor Stinner in branch 'default':
Replace PyObject_CallFunctionObjArgs() with fastcall
https://hg.python.org/cpython/rev/b9c9691c72c5

serhiy-storchaka · 2016-12-01T14:12:01Z

I think it would be better to optimize PyObject_CallFunctionObjArgs().

serhiy-storchaka · 2016-12-01T15:38:27Z

Here is an example that shows increased stack consumption.

With old code it counts to 74700, with the patch it counts only to 65400.

Please revert your changes.

vstinner · 2016-12-01T15:39:42Z

I rewrote bench_builtins.py to use my new perf module.

Python 3.7 is between 1.27x and 1.42x faster than Python 3.6, but sorted(list) is slower. I don't know why sorted(list) is slower. It doesn't use a key function, and so should not be impacted by FASTCALL changes made since Python 3.6.

Benchmark results of Python 3.7 (include latest FASTCALL patches) compared to Python 3.5 (no FASTCALL):

filter(lambda x: x, range(1000)): Median +- std dev: [3.5] 126 us +- 7 us -> [3.7] 94.6 us +- 4.8 us: 1.34x faster (-25%)
map(lambda x, y: x+y, range(1000), range(1000)): Median +- std dev: [3.5] 163 us +- 10 us -> [3.7] 115 us +- 7 us: 1.42x faster (-30%)
map(lambda x: x, range(1000)): Median +- std dev: [3.5] 115 us +- 6 us -> [3.7] 90.9 us +- 5.4 us: 1.27x faster (-21%)
sorted(list): Median +- std dev: [3.5] 17.5 us +- 1.0 us -> [3.7] 19.7 us +- 1.1 us: 1.12x slower (+12%)
sorted(list, key=lambda x: x): Median +- std dev: [3.5] 122 us +- 10 us -> [3.7] 91.8 us +- 5.7 us: 1.33x faster (-25%)

Note: you need the development version of perf to run the script (future version 0.9.2).

vstinner · 2016-12-01T15:49:13Z

Serhiy: "I don't propose to commit this complicated patch, but these results can be used as a guide to the optimization of tuple creating. It is surprising to me that this patch has any effect at all."

A new "fast call" calling convention was added to Python 3.6 to avoid the creation of temporary tuples to pass function arguments. Benchmarks confirm that Python 3.7 is faster than Python 3.5 of functions that Serhiy suggested to optimize. I confirm that I was also surprised that fast call has a significant effect on performance!

Fast calls don't require complex code like reuse_argtuples_3.patch which requires to check carefully the reference counter. In the past, a similar optimization on property_descr_get() introduced complex bugs: see issue bpo-24276 and issue bpo-26811. Fast calls don't use such issues.

The implementation of fast calls is currently incomplete: tp_new, tp_init and tp_call slots still require a tuple and a dict for positional and keyword arguments. I'm working on an enhancement for Python 3.7 to support also fast calls for these slots. In the meanwhile, callbacks using tp_call don't benefit yet on fast calls.

That's why I had to keep the tuple/refcount optimization in property_descr_get(), to not introduce a performance regression.

I consider that this issue can now be closed as "fixed".

serhiy-storchaka · 2016-12-01T16:02:20Z

Victor, your changes introduced a regression.

vstinner · 2016-12-01T16:23:32Z

Serhiy Storchaka: "Victor, your changes introduced a regression."

I suppose that you are refering to sorted(list) which seems to be slower. Do you have a specific change in mind? As I wrote, it doesn't use a key function, and so should not be impacted by my work on fast calls. I opened a new issue bpo-28852 to track this performance regression.

vstinner · 2016-12-02T02:02:51Z

I opened a new issue bpo-28852 to track this performance regression.

So can I close again this issue?

serhiy-storchaka · 2016-12-02T05:32:37Z

No, I'm referring to the crashing. The code that worked before your changes now are crashing. Please revert your changes and find a way to optimize a function calling without increasing stack consumption.

vstinner · 2016-12-02T07:54:42Z

haypo: "status: open -> closed"

Oops, it's a mistake, sorry. I only wanted to ask the question, not to close the issue.

Serhiy Storchaka: "No, I'm referring to the crashing. The code that worked before your changes now are crashing."

Sorry, I didn't notice that you attached a script! I opened the issue bpo-28858.

serhiy-storchaka · 2016-12-18T13:02:51Z

Here is rebased patch (just for history). Caching argtuple is not faster (or even slower) than using fast call.

Regression in stack consumption is fixed in bpo-28858.

Thanks Victor!

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 24, 2015

serhiy-storchaka changed the title ~~Tuple creation is two slow~~ Tuple creation is too slow Feb 24, 2015

rhettinger self-assigned this Feb 26, 2015

rhettinger removed their assignment Apr 14, 2015

vstinner closed this as completed Dec 1, 2016

serhiy-storchaka reopened this Dec 1, 2016

vstinner closed this as completed Dec 2, 2016

serhiy-storchaka reopened this Dec 2, 2016

serhiy-storchaka closed this as completed Dec 18, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuple creation is too slow #67695

Tuple creation is too slow #67695

serhiy-storchaka commented Feb 24, 2015

serhiy-storchaka commented Feb 24, 2015

rhettinger commented Feb 26, 2015

serhiy-storchaka commented Feb 26, 2015

serhiy-storchaka commented Feb 26, 2015

pitrou commented Feb 26, 2015

pitrou commented Feb 26, 2015

serhiy-storchaka commented Feb 26, 2015

pitrou commented Apr 13, 2015

rhettinger commented Apr 14, 2015

serhiy-storchaka commented Apr 14, 2015

vstinner commented Apr 22, 2016

vstinner commented Apr 22, 2016 •

edited

vstinner commented Apr 22, 2016

serprex mannequin commented Apr 23, 2016

python-dev mannequin commented Dec 1, 2016

serhiy-storchaka commented Dec 1, 2016

serhiy-storchaka commented Dec 1, 2016

vstinner commented Dec 1, 2016

vstinner commented Dec 1, 2016

serhiy-storchaka commented Dec 1, 2016

vstinner commented Dec 1, 2016

vstinner commented Dec 2, 2016

serhiy-storchaka commented Dec 2, 2016

vstinner commented Dec 2, 2016

serhiy-storchaka commented Dec 18, 2016

Tuple creation is too slow #67695

Tuple creation is too slow #67695

Comments

serhiy-storchaka commented Feb 24, 2015

serhiy-storchaka commented Feb 24, 2015

rhettinger commented Feb 26, 2015

serhiy-storchaka commented Feb 26, 2015

serhiy-storchaka commented Feb 26, 2015

pitrou commented Feb 26, 2015

pitrou commented Feb 26, 2015

serhiy-storchaka commented Feb 26, 2015

pitrou commented Apr 13, 2015

rhettinger commented Apr 14, 2015

serhiy-storchaka commented Apr 14, 2015

vstinner commented Apr 22, 2016

vstinner commented Apr 22, 2016 • edited

vstinner commented Apr 22, 2016

serprex mannequin commented Apr 23, 2016

python-dev mannequin commented Dec 1, 2016

serhiy-storchaka commented Dec 1, 2016

serhiy-storchaka commented Dec 1, 2016

vstinner commented Dec 1, 2016

vstinner commented Dec 1, 2016

serhiy-storchaka commented Dec 1, 2016

vstinner commented Dec 1, 2016

vstinner commented Dec 2, 2016

serhiy-storchaka commented Dec 2, 2016

vstinner commented Dec 2, 2016

serhiy-storchaka commented Dec 18, 2016

vstinner commented Apr 22, 2016 •

edited