Issue 23507: Tuple creation is too slow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67695

classification

Title:	Tuple creation is too slow
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Demur Rumed, ezio.melotti, jstasiak, pitrou, python-dev, rhettinger, scoder, serhiy.storchaka, vstinner, yselivanov
Priority:	low	Keywords:	patch

Created on 2015-02-24 08:08 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
reuse_argtuples.patch	serhiy.storchaka, 2015-02-24 08:08		review
reuse_argtuples_2.patch	serhiy.storchaka, 2015-02-26 10:44		review
bench_builtins.py	vstinner, 2016-04-22 13:54
reuse_argtuples_3.patch	vstinner, 2016-04-22 16:22		review
stack_overflow.py	serhiy.storchaka, 2016-12-01 15:38
bench_builtins.py	vstinner, 2016-12-01 15:39
reuse_argtuples_4.patch	serhiy.storchaka, 2016-12-18 13:02		review

Messages (25)
msg236480 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-24 08:08
Currently tuples creating uses free lists. But this optimization is not enough. When reuse cached tuple for arguments of repeatedly called functions, the performance of some builtins can be significantly increased. $ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "list(filter(f, s))" Unpatched: 1000 loops, best of 3: 773 usec per loop Patched: 1000 loops, best of 3: 558 usec per loop $ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "list(map(f, s))" Unpatched: 1000 loops, best of 3: 689 usec per loop Patched: 1000 loops, best of 3: 556 usec per loop $ ./python -m timeit -s "f = lambda x: x" -s "s = list(range(1000))" -- "sorted(s, key=f)" Unpatched: 1000 loops, best of 3: 758 usec per loop Patched: 1000 loops, best of 3: 550 usec per loop The same effect can be achieved for itertools functions. I don't propose to commit this complicated patch, but these results can be used as a guide to the optimization of tuple creating. It is surprising to me that this patch has any effect at all.
msg236654 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-02-26 08:50
I had done something like this for zip() long ago and had gotten a nice speed boost. I remember looking at whether it should be applied elsewhere but don't remember why I decided against it.
msg236656 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-26 08:58
Note the check for Py_REFCNT() == 1. Whithout it the code would be incorrect.
msg236661 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-26 10:44
Fixed bugs found by Victor. Thank you Victor. Yes, zip() already uses similar approach.
msg236691 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-02-26 17:45
Maybe we want a facility to create on-stack static-size tuples?
msg236692 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-02-26 17:45
How many functions can benefit from this approach, though?
msg236696 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-26 18:21
> Maybe we want a facility to create on-stack static-size tuples? There is no guarantee that called function doesn't save the reference to args. > How many functions can benefit from this approach, though? From hand-writing caching? Any function, that repeatedly call other functions many times. In additional to sorted()/list.sort() with the key argument, filter() and map(), these are min() and max() with the key argument, some iterators from itertools (groupby(), dropwhile(), takewhile(), accumulate(), filterfalse()). May be some functions that are thin wrappers around special method (e.g. round()). But it is more interesting to investigate why PyTuple_New() is so slow in comparison with hand-written caching. It it can be speeded up, then mauch more code can benefit from this.
msg240693 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-04-13 18:15
By the way, is PyTuple_New() really the culprit here? Or is it the valist handling in PyObject_CallFunctionObjArgs()?
msg240845 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-04-14 06:13
Unassigning because I don't have an interest in going further down this path. I used this technique with zip() because tuple reuse nicely benefited the common case of "for a,b,c in zip(alist, blist, clist): ...". On another thread, I did see mention of removing the memset() from tuple creation and that seems like a promising direction.
msg240862 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-04-14 09:28
map doesn't use PyObject_CallFunctionObjArgs(). If compare results for map with results for filter and list.sort, seems that PyTuple_New() is responsible for 24% and PyObject_CallFunctionObjArgs() adds other 14%.
msg263991 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 09:38
Hi, I started to work on a new top-down approach avoiding completly temporary tuple/dict: "FASTCALL", issue #26814. As you may expect, the patch is much larger that Serhiy's patches attached to his issue, but IMHO the final code is much simpler: it doesn't have to use complex tricks on the reference count or setting tuple items to NULL (which lead to segfaults like the issue #26811). Antoine Pitrou: "Maybe we want a facility to create on-stack static-size tuples?" My current implementation of FASTCALL uses a buffer allocated on the stack to handle calls with up to 10 parameters (10 positional parameters or 5 keyword parameters, since a keyword uses 2 PyObject). Antoine Pitrou: "How many functions can benefit from this approach, though?" In my experimental FASTCALL branch, slot wrappers, PyObject_Call() functions (except of PyObject_Call()!), ceval.c, etc. benefit from FASTCALL. The question is not really how much benefit from FASTCALL, but more how much changes are worth to avoid temporary tuple/dict. For example, right now I decided to not add a FASTCALL flavor of tp_new and tp_init (instanciate a class or create a new type) because it becomes much more complex when you have to handle inheritance. Maybe my FASTCALL requires too many changes and is overkill. Maybe we need to find a compromise between FASTCALL (issue #26814) and Serhiy's changes which are limited to a few functions. Today, it's too early to decide, but I have fun with my FASTCALL experiment ;-) Slot wrappers are 40% faster, getting a namedtuple attribute is 25% faster (whereas Serhiy already optimized this specific case!), etc.
msg264012 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 13:54
msg264009: Serhiy Storchaka "Could you compare filter(), map() and sorted() performance with your patch and with issue23507 patch?" Here you have the result of bench_builtins.py. The performance look to be the same, even if I expect better performance with less trivial callbacks (where fastcall would avoid the creation of other temporary tuples). ------------------------------------------------+-------------+-------------- Tests \| argtuples \| fastcall ------------------------------------------------+-------------+-------------- filter(lambda x: x, range(1000)) \| 80.2 us () \| 72.7 us (-9%) map(lambda x: x, range(1000)) \| 79.5 us () \| 79.9 us map(lambda x, y: x+y, range(1000), range(1000)) \| 111 us () \| 109 us sorted(list, key=lambda x: x) \| 82.6 us () \| 75.7 us (-8%) sorted(list) \| 15.7 us () \| 15.6 us ------------------------------------------------+-------------+-------------- Total \| 369 us () \| 353 us ------------------------------------------------+-------------+-------------- Comparison to the original Python 3.6: ------------------------------------------------+-------------+----------------+--------------- Tests \| original \| argtuples \| fastcall ------------------------------------------------+-------------+----------------+--------------- filter(lambda x: x, range(1000)) \| 109 us () \| 80.2 us (-27%) \| 72.7 us (-33%) map(lambda x: x, range(1000)) \| 96.9 us () \| 79.5 us (-18%) \| 79.9 us (-18%) map(lambda x, y: x+y, range(1000), range(1000)) \| 126 us () \| 111 us (-12%) \| 109 us (-13%) sorted(list, key=lambda x: x) \| 114 us () \| 82.6 us (-28%) \| 75.7 us (-34%) sorted(list) \| 16.1 us () \| 15.7 us \| 15.6 us ------------------------------------------------+-------------+----------------+--------------- Total \| 463 us () \| 369 us (-20%) \| 353 us (-24%) ------------------------------------------------+-------------+----------------+--------------- The difference is more on the changes in bltinmodule.c. Example with filter_next(): Using fastcall: - good = PyObject_CallFunctionObjArgs(lz->func, item, NULL); + good = PyObject_CallArg1(lz->func, item); reuse_argtuples_2.patch: - good = PyObject_CallFunctionObjArgs(lz->func, - item, NULL); + PyObject *argtuple = lz->argtuple; + lz->argtuple = NULL; + if (argtuple == NULL) { + argtuple = PyTuple_New(1); + if (argtuple == NULL) { + Py_DECREF(item); + return NULL; + } + Py_INCREF(argtuple); + } + assert(Py_REFCNT(argtuple) == 2); + assert(Py_SIZE(argtuple) == 1); + assert(PyTuple_GET_ITEM(argtuple, 0) == NULL); + PyTuple_SET_ITEM(argtuple, 0, item); + good = PyObject_Call(lz->func, argtuple, NULL); + if (Py_REFCNT(argtuple) == 2) { + PyTuple_SET_ITEM(argtuple, 0, NULL); + if (lz->argtuple == NULL) + lz->argtuple = argtuple; + else { + Py_DECREF(argtuple); + Py_DECREF(argtuple); + } + } + else { + Py_INCREF(item); + Py_DECREF(argtuple); + Py_DECREF(argtuple); + } (reuse_argtuples_2.patch requires a few other changes related to filter_next().)
msg264026 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 16:22
reuse_argtuples_3.patch: rebased reuse_argtuples_2.patch.
msg264061 - (view)	Author: Philip Dubé (Demur Rumed) *	Date: 2016-04-23 12:11
This code could be made to look a lot less hackish if the potential tuple reuse was abstracted by a PyTuple_UniqueOrNew(PyTuple) which returns its argument if REFCNT==1 otherwise allocates a tuple of the same size & returns that. It decrefs PyTuple if REFCNT != 1
msg282179 - (view)	Author: Roundup Robot (python-dev)	Date: 2016-12-01 14:09
New changeset b9c9691c72c5 by Victor Stinner in branch 'default': Replace PyObject_CallFunctionObjArgs() with fastcall https://hg.python.org/cpython/rev/b9c9691c72c5
msg282180 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-01 14:12
I think it would be better to optimize PyObject_CallFunctionObjArgs().
msg282186 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-01 15:38
Here is an example that shows increased stack consumption. With old code it counts to 74700, with the patch it counts only to 65400. Please revert your changes.
msg282187 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-01 15:39
I rewrote bench_builtins.py to use my new perf module. Python 3.7 is between 1.27x and 1.42x faster than Python 3.6, but sorted(list) is slower. I don't know why sorted(list) is slower. It doesn't use a key function, and so should not be impacted by FASTCALL changes made since Python 3.6. Benchmark results of Python 3.7 (include latest FASTCALL patches) compared to Python 3.5 (no FASTCALL): filter(lambda x: x, range(1000)): Median +- std dev: [3.5] 126 us +- 7 us -> [3.7] 94.6 us +- 4.8 us: 1.34x faster (-25%) map(lambda x, y: x+y, range(1000), range(1000)): Median +- std dev: [3.5] 163 us +- 10 us -> [3.7] 115 us +- 7 us: 1.42x faster (-30%) map(lambda x: x, range(1000)): Median +- std dev: [3.5] 115 us +- 6 us -> [3.7] 90.9 us +- 5.4 us: 1.27x faster (-21%) sorted(list): Median +- std dev: [3.5] 17.5 us +- 1.0 us -> [3.7] 19.7 us +- 1.1 us: 1.12x slower (+12%) sorted(list, key=lambda x: x): Median +- std dev: [3.5] 122 us +- 10 us -> [3.7] 91.8 us +- 5.7 us: 1.33x faster (-25%) Note: you need the development version of perf to run the script (future version 0.9.2).
msg282188 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-01 15:49
Serhiy: "I don't propose to commit this complicated patch, but these results can be used as a guide to the optimization of tuple creating. It is surprising to me that this patch has any effect at all." A new "fast call" calling convention was added to Python 3.6 to avoid the creation of temporary tuples to pass function arguments. Benchmarks confirm that Python 3.7 is faster than Python 3.5 of functions that Serhiy suggested to optimize. I confirm that I was also surprised that fast call has a significant effect on performance! Fast calls don't require complex code like reuse_argtuples_3.patch which requires to check carefully the reference counter. In the past, a similar optimization on property_descr_get() introduced complex bugs: see issue #24276 and issue #26811. Fast calls don't use such issues. The implementation of fast calls is currently incomplete: tp_new, tp_init and tp_call slots still require a tuple and a dict for positional and keyword arguments. I'm working on an enhancement for Python 3.7 to support also fast calls for these slots. In the meanwhile, callbacks using tp_call don't benefit yet on fast calls. That's why I had to keep the tuple/refcount optimization in property_descr_get(), to not introduce a performance regression. I consider that this issue can now be closed as "fixed".
msg282190 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-01 16:02
Victor, your changes introduced a regression.
msg282195 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-01 16:23
Serhiy Storchaka: "Victor, your changes introduced a regression." I suppose that you are refering to sorted(list) which seems to be slower. Do you have a specific change in mind? As I wrote, it doesn't use a key function, and so should not be impacted by my work on fast calls. I opened a new issue #28852 to track this performance regression.
msg282214 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-02 02:02
> I opened a new issue #28852 to track this performance regression. So can I close again this issue?
msg282217 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-02 05:32
No, I'm referring to the crashing. The code that worked before your changes now are crashing. Please revert your changes and find a way to optimize a function calling without increasing stack consumption.
msg282229 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-02 07:54
haypo: "status: open -> closed" Oops, it's a mistake, sorry. I only wanted to ask the question, not to close the issue. Serhiy Storchaka: "No, I'm referring to the crashing. The code that worked before your changes now are crashing." Sorry, I didn't notice that you attached a script! I opened the issue #28858.
msg283554 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-12-18 13:02
Here is rebased patch (just for history). Caching argtuple is not faster (or even slower) than using fast call. Regression in stack consumption is fixed in issue28858. Thanks Victor!

History
Date	User	Action	Args
2022-04-11 14:58:13	admin	set	github: 67695
2016-12-18 13:02:51	serhiy.storchaka	set	status: open -> closed files: + reuse_argtuples_4.patch messages: + msg283554 resolution: fixed stage: resolved
2016-12-02 07:54:42	vstinner	set	messages: + msg282229
2016-12-02 05:32:37	serhiy.storchaka	set	status: closed -> open resolution: fixed -> (no value) messages: + msg282217
2016-12-02 02:02:50	vstinner	set	status: open -> closed resolution: fixed messages: + msg282214
2016-12-01 16:23:31	vstinner	set	messages: + msg282195
2016-12-01 16:02:20	serhiy.storchaka	set	status: closed -> open resolution: fixed -> (no value) messages: + msg282190
2016-12-01 15:49:13	vstinner	set	status: open -> closed resolution: fixed messages: + msg282188
2016-12-01 15:39:41	vstinner	set	files: + bench_builtins.py messages: + msg282187
2016-12-01 15:38:27	serhiy.storchaka	set	files: + stack_overflow.py messages: + msg282186
2016-12-01 14:12:01	serhiy.storchaka	set	messages: + msg282180
2016-12-01 14:09:50	python-dev	set	nosy: + python-dev messages: + msg282179
2016-05-09 22:56:48	jstasiak	set	nosy: + jstasiak
2016-04-23 12:11:11	Demur Rumed	set	nosy: + Demur Rumed messages: + msg264061
2016-04-22 20:41:19	yselivanov	set	nosy: + yselivanov
2016-04-22 16:22:13	vstinner	set	files: + reuse_argtuples_3.patch messages: + msg264026
2016-04-22 13:54:27	vstinner	set	files: + bench_builtins.py messages: + msg264012
2016-04-22 09:38:05	vstinner	set	nosy: + vstinner messages: + msg263991
2015-04-14 09:28:12	serhiy.storchaka	set	priority: normal -> low messages: + msg240862
2015-04-14 06:13:47	rhettinger	set	assignee: rhettinger -> messages: + msg240845
2015-04-13 18:15:11	pitrou	set	messages: + msg240693
2015-02-26 18:21:06	serhiy.storchaka	set	messages: + msg236696
2015-02-26 17:45:35	pitrou	set	messages: + msg236692
2015-02-26 17:45:18	pitrou	set	nosy: + pitrou messages: + msg236691
2015-02-26 10:44:08	serhiy.storchaka	set	files: + reuse_argtuples_2.patch messages: + msg236661
2015-02-26 08:58:26	serhiy.storchaka	set	messages: + msg236656
2015-02-26 08:50:57	rhettinger	set	assignee: rhettinger messages: + msg236654 nosy: + rhettinger
2015-02-24 17:41:00	scoder	set	nosy: + scoder
2015-02-24 08:34:06	ezio.melotti	set	nosy: + ezio.melotti
2015-02-24 08:09:04	serhiy.storchaka	set	title: Tuple creation is two slow -> Tuple creation is too slow
2015-02-24 08:08:32	serhiy.storchaka	create