This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: add internal API function to create tuple without items array initialization
Type: performance Stage: resolved
Components: Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: josh.r, rhettinger, serhiy.storchaka, sir-sigurd, vstinner
Priority: normal Keywords: patch

Created on 2019-02-19 08:02 by sir-sigurd, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
build_tuple_bench.py sir-sigurd, 2019-02-27 08:14
list_as_tuple_bench.py sir-sigurd, 2019-02-27 08:14
Pull Requests
URL Status Linked Edit
PR 11927 closed sir-sigurd, 2019-02-19 08:05
PR 11954 merged sir-sigurd, 2019-02-20 11:30
PR 12032 merged sir-sigurd, 2019-02-25 17:18
PR 12052 merged sir-sigurd, 2019-02-26 10:14
PR 15670 merged ZackerySpytz, 2019-09-03 22:37
Messages (25)
msg335897 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-02-19 08:02
PyTuple_New() fills items array with NULLs to make usage of Py_DECREF() safe even when array is not fully filled with real items.
There are multiple cases when this initialization step can be avoided to improve performance. For example it gives such speed-up for PyList_AsTuple():

Before:
$ python -m perf timeit -s "l = [None] * 10**6" "tuple(l)"
.....................
Mean +- std dev: 4.43 ms +- 0.01 ms

After:
$ python -m perf timeit -s "l = [None] * 10**6" "tuple(l)"
.....................
Mean +- std dev: 4.11 ms +- 0.03 ms
msg335923 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-19 10:44
Sergey Fedoseev wrote similar change for list: bpo-34151.
msg336043 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-02-20 06:47
Zeroing memory is usually not expensive relative to the cost of filling it in.  Also, the timing loop is unrealistic.  We mostly care about small tuples.  


For long term maintenance of the project, I recommend filling the code with these unsafe variants which will segfault whenever someone follows the normal practice of decreffing when an error is encountered.  This would be yet another special case a person would have to learn to do maintenance or code review for CPython.

In general, there are a lot of small optimizations to be had if we were to forgo principles of loose coupling, defensive programming, and ease of maintenance.  Once in a while, we find one that actually is worth it, but usually it is a best practice sprinkle these special cases all over the code (see MS Code Complete for more discussion on the long term costs of this kind of coding).
msg336065 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-20 10:53
Raymond:
> Also, the timing loop is unrealistic.  We mostly care about small tuples.

Sergey: can you please run benchmarks on small tuples? Example, 1, 5, 10, 20 and 100 items. Maybe not only tuple(list), but try some other operations?
msg336074 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-20 11:37
I prefer PyTuple_FromArray() API idea. It's safer by design than you "list to tuple" function stealing references to list items if refcnt(list)==1.
msg336142 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-02-20 19:50
Here's benchmarks for PyTuple_FromArray() PR:

$ python -m perf timeit -s "l = [None]*0; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python  
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 50.5 ns +- 1.0 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 45.6 ns +- 1.2 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 50.5 ns +- 1.0 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 45.6 ns +- 1.2 ns: 1.11x faster (-10%)

$ python -m perf timeit -s "l = [None]*1; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 61.8 ns +- 1.1 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 54.8 ns +- 0.8 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 61.8 ns +- 1.1 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 54.8 ns +- 0.8 ns: 1.13x faster (-11%)

$ python -m perf timeit -s "l = [None]*5; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 68.2 ns +- 1.3 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 61.8 ns +- 1.5 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 68.2 ns +- 1.3 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 61.8 ns +- 1.5 ns: 1.10x faster (-9%)

$ python -m perf timeit -s "l = [None]*10; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 88.1 ns +- 2.3 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 78.9 ns +- 3.1 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 88.1 ns +- 2.3 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 78.9 ns +- 3.1 ns: 1.12x faster (-10%)

$ python -m perf timeit -s "l = [None]*100; tuple_ = tuple" "tuple_(l)" --duplicate=100 --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 477 ns +- 7 ns
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 452 ns +- 6 ns
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 477 ns +- 7 ns -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 452 ns +- 6 ns: 1.05x faster (-5%)
msg336207 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-21 11:24
Please make PyTuple_FromArray private: rename to _PyTuple_FromArray. I would prefer to only add it to Include/internals/pycore_tupleobject.c.
msg336228 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-21 15:15
Once I had wrote a similar patch that adds PyTuple_FromArray, but never published it because I did not found enough use cases for this function. Although I considered using it only for removing some code duplication, but Sergey shown that it can be used for small performance boost in some special cases. I am still not sure, but this argument makes this change a tiny bit more attractive. Leave this on Raymond.
msg336229 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-21 15:20
> Once I had wrote a similar patch that adds PyTuple_FromArray, but never published it because I did not found enough use cases for this function.  Although I considered using it only for removing some code duplication, but Sergey shown that it can be used for small performance boost in some special cases. I am still not sure, but this argument makes this change a tiny bit more attractive. Leave this on Raymond.

The micro-benchmark results are not really impressive. I still like PR 11954 because it removes code (simply loops). _PyTuple_FromArray() has a well defined API and is safe (I'm saying that because PR 11927 adds an "unsafe" function). As soon as it's private, I'm fine with it.

I'm more attracted by the code simplification than performance boost here.
msg336230 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-21 15:30
I reviewed PR 11954: I asked to rework the PR to only add _PyTuple_FromArray() and the "unrelated" micro-optimization. So it would be easier to see code simplification and the micro-optimization.

If the micro-optimization doesn't make the code more complex and doesn't introduce subtle issue with the GC, I'm fine with taking 10 ns optimization ;-)
msg336530 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-25 15:06
This change is virtually renaming _PyStack_AsTuple to _PyTuple_FromArray and using it in appropriate places for code simplification. I am +1 for such change.
msg336541 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-25 16:59
New changeset 234531b4462b20d668762bd78406fd2ebab129c9 by Victor Stinner (Sergey Fedoseev) in branch 'master':
bpo-36030: Add _PyTuple_FromArray() function (GH-11954)
https://github.com/python/cpython/commit/234531b4462b20d668762bd78406fd2ebab129c9
msg336561 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-25 21:37
New changeset f1b9abe35f75393351b3d954a392122a3f8f6951 by Victor Stinner (Sergey Fedoseev) in branch 'master':
bpo-36030: Remove _PyStack_AsTuple() and _PyStack_AsTupleSlice() (GH-12032)
https://github.com/python/cpython/commit/f1b9abe35f75393351b3d954a392122a3f8f6951
msg336638 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-02-26 10:23
I added WIP PR with discussed micro-optimization, here are benchmark results:

$ python -m perf compare_to --min-speed 1 -G master.json tuple-untracked.json
Slower (1):
- sqlite_synth: 5.16 us +- 0.10 us -> 5.22 us +- 0.08 us: 1.01x slower (+1%)

Faster (19):
- python_startup: 12.9 ms +- 0.7 ms -> 12.2 ms +- 0.0 ms: 1.06x faster (-5%)
- python_startup_no_site: 8.96 ms +- 0.29 ms -> 8.56 ms +- 0.03 ms: 1.05x faster (-4%)
- raytrace: 882 ms +- 11 ms -> 854 ms +- 12 ms: 1.03x faster (-3%)
- mako: 27.9 ms +- 0.8 ms -> 27.1 ms +- 0.3 ms: 1.03x faster (-3%)
- scimark_monte_carlo: 176 ms +- 4 ms -> 171 ms +- 5 ms: 1.03x faster (-3%)
- logging_format: 17.7 us +- 0.4 us -> 17.2 us +- 0.3 us: 1.03x faster (-3%)
- telco: 11.0 ms +- 0.2 ms -> 10.8 ms +- 0.4 ms: 1.02x faster (-2%)
- richards: 123 ms +- 2 ms -> 120 ms +- 2 ms: 1.02x faster (-2%)
- pathlib: 35.1 ms +- 0.7 ms -> 34.6 ms +- 0.5 ms: 1.01x faster (-1%)
- scimark_sparse_mat_mult: 6.97 ms +- 0.20 ms -> 6.88 ms +- 0.29 ms: 1.01x faster (-1%)
- scimark_sor: 327 ms +- 6 ms -> 323 ms +- 3 ms: 1.01x faster (-1%)
- scimark_fft: 570 ms +- 5 ms -> 562 ms +- 4 ms: 1.01x faster (-1%)
- float: 184 ms +- 2 ms -> 182 ms +- 2 ms: 1.01x faster (-1%)
- logging_simple: 15.8 us +- 0.4 us -> 15.6 us +- 0.3 us: 1.01x faster (-1%)
- deltablue: 12.6 ms +- 0.2 ms -> 12.5 ms +- 0.3 ms: 1.01x faster (-1%)
- crypto_pyaes: 186 ms +- 2 ms -> 184 ms +- 2 ms: 1.01x faster (-1%)
- hexiom: 17.3 ms +- 0.1 ms -> 17.2 ms +- 0.1 ms: 1.01x faster (-1%)
- sqlalchemy_declarative: 253 ms +- 4 ms -> 251 ms +- 3 ms: 1.01x faster (-1%)
- spectral_norm: 225 ms +- 2 ms -> 223 ms +- 3 ms: 1.01x faster (-1%)
msg336640 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-02-26 10:32
This optimization also can be used for BUILD_TUPLE opcode and in pickle module, if it's OK to add _PyTuple_StealFromArray() function :-)
msg336683 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-26 15:26
> This optimization also can be used for BUILD_TUPLE opcode and in pickle module, if it's OK to add _PyTuple_StealFromArray() function :-)

I would like to see a micro-benchmark showing that it's faster.
msg336684 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-26 15:29
Can you please convert msg336142 into a perf script? See the doc:
https://perf.readthedocs.io/en/latest/developer_guide.html

And then run again these benchmarks on PR 12052.

If you have a script, you can do:

./python-ref script.py -o ref.json
./python-untracked script.py -o untracked.json
./python-untracked -m perf compare_to ref.json untracked.json

https://perf.readthedocs.io/en/latest/cli.html#compare-to-cmd
msg336693 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-02-26 16:42
Victor/vstinner: Isn't PR 12032 reintroducing the issue fixed in #29234? _PyStack_AsTuple was explicitly marked as _Py_NO_INLINE because inlining was creating excessive stack consumption in the callers (which were the bytecode interpreter loop), but the new _PyTuple_FromArray isn't marked as _Py_NO_INLINE, so the swap reintroduces the problem.

Seems like either:

1. _PyTuple_FromArray should also be marked _Py_NO_INLINE

or

2. _PyStack_AsTuple should continue to exist as the non-inlined version of _PyTuple_FromArray

It's possible this isn't as much of an issue because _PyTuple_FromArray is in tupleobject.c (where it's only called in one place), while _PyStack_AsTuple was in call.c and was called from within call.c in four places, but only if link-time optimization isn't involved (and in practice, most public distributions of CPython are built with link-time optimization now, correct?); if LTO is enabled, the same stack bloat issues are possible.
msg336717 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-26 22:31
> Victor/vstinner: Isn't PR 12032 reintroducing the issue fixed in #29234?

No according to manual tests:
https://github.com/python/cpython/pull/12032#issuecomment-467110233

Usage of the stack memory doesn't change with PR 12032.

FYI I also tested manually since I was very surprised, and I confirm that the PR doesn't change the usage of the stack memory :-)
msg336718 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-02-26 22:35
> if LTO is enabled, the same stack bloat issues are possible

Please test, I'm not interested to spend too much time on that topic.

To be clear, _Py_NO_INLINE was a hack and a micro-optimization. It doesn't solve a real bug. Python has very weak promises on the maximum stack depth.

My work on reducing the stack memory usage was mostly motivated by my work on FASTCALL, since some patches reduced the maximum stack depth. They increased the stack memory usage with micro-optimizations like "PyObject *small_stack[_PY_FASTCALL_SMALL_STACK];" which is allocated on ths stack: see Objects/call.c.
msg336734 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-02-27 07:57
> Can you please convert msg336142 into a perf script?
> And then run again these benchmarks on PR 12052.

+--------------------+---------+------------------------------+
| Benchmark          | ref     | untracked                    |
+====================+=========+==============================+
| list_as_tuple(0)   | 50.7 ns | 45.5 ns: 1.12x faster (-10%) |
+--------------------+---------+------------------------------+
| list_as_tuple(1)   | 64.5 ns | 56.5 ns: 1.14x faster (-12%) |
+--------------------+---------+------------------------------+
| list_as_tuple(5)   | 72.0 ns | 62.6 ns: 1.15x faster (-13%) |
+--------------------+---------+------------------------------+
| list_as_tuple(10)  | 86.3 ns | 77.1 ns: 1.12x faster (-11%) |
+--------------------+---------+------------------------------+
| list_as_tuple(100) | 469 ns  | 450 ns: 1.04x faster (-4%)   |
+--------------------+---------+------------------------------+
msg336735 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-02-27 08:11
>> This optimization also can be used for BUILD_TUPLE opcode and in pickle module, if it's OK to add _PyTuple_StealFromArray() function :-)

> I would like to see a micro-benchmark showing that it's faster.

+-----------------+-----------------+-----------------------------+
| Benchmark       | build_tuple_ref | build_tuple_untracked       |
+=================+=================+=============================+
| (a, )           | 19.9 ns         | 19.4 ns: 1.03x faster (-3%) |
+-----------------+-----------------+-----------------------------+
| (a, 1)          | 24.0 ns         | 22.6 ns: 1.06x faster (-6%) |
+-----------------+-----------------+-----------------------------+
| (a, 1, 1)       | 28.2 ns         | 25.9 ns: 1.09x faster (-8%) |
+-----------------+-----------------+-----------------------------+
| (a, 1, 1, 1)    | 31.0 ns         | 29.0 ns: 1.07x faster (-6%) |
+-----------------+-----------------+-----------------------------+
| (a, 1, 1, 1, 1) | 34.7 ns         | 32.2 ns: 1.08x faster (-7%) |
+-----------------+-----------------+-----------------------------+
msg349700 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 14:10
New changeset 4fa10dde40356d7c71e5524233bafc221d9e2deb by Victor Stinner (Sergey Fedoseev) in branch 'master':
bpo-36030: Improve performance of some tuple operations (GH-12052)
https://github.com/python/cpython/commit/4fa10dde40356d7c71e5524233bafc221d9e2deb
msg349701 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 14:11
Well done Sergey Fedoseev: I merged your last PR for this issue ;-) I'm not sure if it's worth it to document these tuple micro-optimizations in What's New in Python 3.8 / 3.9. In the meanwhile, I close the issue.
msg351128 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-04 13:58
New changeset 60bd1f88f21073965a444c8b39c4202d015da5d6 by Victor Stinner (Zackery Spytz) in branch 'master':
bpo-36030: Fix a possible segfault in PyTuple_New() (GH-15670)
https://github.com/python/cpython/commit/60bd1f88f21073965a444c8b39c4202d015da5d6
History
Date User Action Args
2022-04-11 14:59:11adminsetgithub: 80211
2019-09-04 13:58:08vstinnersetmessages: + msg351128
2019-09-03 22:37:24ZackerySpytzsetpull_requests: + pull_request15333
2019-08-14 14:11:29vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg349701

stage: patch review -> resolved
2019-08-14 14:10:39vstinnersetmessages: + msg349700
2019-02-27 08:14:51sir-sigurdsetfiles: + list_as_tuple_bench.py
2019-02-27 08:14:41sir-sigurdsetfiles: + build_tuple_bench.py
2019-02-27 08:11:53sir-sigurdsetmessages: + msg336735
2019-02-27 07:57:24sir-sigurdsetmessages: + msg336734
2019-02-26 22:35:25vstinnersetmessages: + msg336718
2019-02-26 22:31:28vstinnersetmessages: + msg336717
2019-02-26 16:42:10josh.rsetnosy: + josh.r
messages: + msg336693
2019-02-26 15:29:56vstinnersetmessages: + msg336684
2019-02-26 15:26:58vstinnersetmessages: + msg336683
2019-02-26 10:32:52sir-sigurdsetmessages: + msg336640
2019-02-26 10:23:02sir-sigurdsetmessages: + msg336638
2019-02-26 10:14:58sir-sigurdsetpull_requests: + pull_request12078
2019-02-25 21:37:41vstinnersetmessages: + msg336561
2019-02-25 17:18:10sir-sigurdsetpull_requests: + pull_request12062
2019-02-25 16:59:15vstinnersetmessages: + msg336541
2019-02-25 15:06:57serhiy.storchakasetmessages: + msg336530
2019-02-21 15:30:08vstinnersetmessages: + msg336230
2019-02-21 15:20:55vstinnersetmessages: + msg336229
2019-02-21 15:15:17serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg336228
2019-02-21 11:24:21vstinnersetmessages: + msg336207
2019-02-20 19:50:37sir-sigurdsetmessages: + msg336142
2019-02-20 11:37:48vstinnersetmessages: + msg336074
2019-02-20 11:30:20sir-sigurdsetpull_requests: + pull_request11979
2019-02-20 10:53:27vstinnersetmessages: + msg336065
2019-02-20 06:47:28rhettingersetnosy: + rhettinger
messages: + msg336043
2019-02-19 10:44:18vstinnersetmessages: + msg335923
2019-02-19 08:06:56matrixisesetnosy: + vstinner
2019-02-19 08:05:28sir-sigurdsetkeywords: + patch
stage: patch review
pull_requests: + pull_request11952
2019-02-19 08:02:44sir-sigurdcreate