classification
Title: Optimize functools.partial() for positional arguments
Type: performance Stage: resolved
Components: Extension Modules Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: methane, ncoghlan, rhettinger, serhiy.storchaka, vstinner, yselivanov
Priority: normal Keywords:

Created on 2017-03-06 13:22 by vstinner, last changed 2017-03-24 22:19 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
bench_fastcall_partial.py vstinner, 2017-03-14 12:08
partial_stack_usage.py vstinner, 2017-03-14 15:02
Pull Requests
URL Status Linked Edit
PR 516 merged vstinner, 2017-03-06 13:29
Messages (10)
msg289100 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-06 13:22
The pull request makes functools.partial() faster for positional arguments. It avoids the creation of a tuple for positional arguments. It allocates a small buffer for up to 5 parameters. But it seems like even if the small buffer is not used, it's still faster.

Use small buffer, total: 2 positional arguments.

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch
ref: ..................... 138 ns +- 1 ns
patch: ..................... 121 ns +- 1 ns

Median +- std dev: [ref] 138 ns +- 1 ns -> [patch] 121 ns +- 1 ns: 1.14x faster (-12%)


Don't use small buffer, total: 6 positional arguments.

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda a1, a2, a3, a4, a5, a6: None; g = partial(f, 1, 2, 3, 4, 5)' 'g(6)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch
ref: ..................... 156 ns +- 1 ns
patch: ..................... 136 ns +- 0 ns

Median +- std dev: [ref] 156 ns +- 1 ns -> [patch] 136 ns +- 0 ns: 1.15x faster (-13%)


Another benchmark  with 10 position arguments:

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda a1, a2, a3, a4, a5, a6, a7, a8, a9, a10: None; g = partial(f, 1, 2, 3, 4, 5)' 'g(6, 7, 8, 9, 10)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch
ref: ..................... 193 ns +- 1 ns
patch: ..................... 166 ns +- 2 ns

Median +- std dev: [ref] 193 ns +- 1 ns -> [patch] 166 ns +- 2 ns: 1.17x faster (-14%)
msg289103 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-06 13:32
functools.partial() is commonly used in the the asyncio module. The asyncio doc suggests to use it, because of deliberate limitations of the asyncio API.
msg289112 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-06 14:58
What about C stack consumption? Is not this increase it?

Since nested partial()`s are collapsed, you need to interlace them with other wrapper for testing.

def decorator(f):
    def wrapper(*args):
        return f(*args)
    return wrapper

def f(*args): pass

for i in range(n):
    f = partial(f)
    f = decorator(f)

f(1, 2)
msg289120 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-06 16:52
If the underlying function doesn't support fast call, and either args or pto->args are empty, partial_call() makes two unneeded copyings. Arguments are copied from a tuple to the raw array and from the array to new tuple. This is what the current code does, but this can be avoided.

If the underlying function doesn't support fast call, and both args and pto->args are not empty, patched partial_call() makes one unneeded copyings. Arguments are copied from tuples to the raw array and from the array to the new tuple. Only one copying is needed (from tuples to the new tuple).
msg289578 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-14 10:35
> If the underlying function doesn't support fast call, and both args and pto->args are not empty, patched partial_call() makes one unneeded copyings.

The simple workaround is to revert changes using FASTCALL in partial_call().

But for best performances, it seems like we need two code paths depending if the function supports fastcall or not. I will try to write a patch for that.
msg289579 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-14 12:08
bench_fastcall_partial.py: more complete microbenchmark.

I rewrote my patch:

* I added _PyObject_HasFastCall(callable): return 1 if callable supports FASTCALL calling convention for positional arguments
* I splitted partial_call() into 2 subfunctions: partial_fastcall() is specialized for FASTCALL, partial_call_impl() uses PyObject_Call() with a tuple for positional arguments

The patch fixes the performance regression for VARARGS and optimize FASTCALL:

haypo@smithers$ ./python -m perf compare_to ref.json patch.json --table 
+-----------------------------+---------+------------------------------+
| Benchmark                   | ref     | patch                        |
+=============================+=========+==============================+
| partial Python, 1+1 arg     | 135 ns  | 118 ns: 1.15x faster (-13%)  |
+-----------------------------+---------+------------------------------+
| partial Python, 2+0 arg     | 114 ns  | 91.4 ns: 1.25x faster (-20%) |
+-----------------------------+---------+------------------------------+
| partial Python, 5+1 arg     | 151 ns  | 135 ns: 1.12x faster (-11%)  |
+-----------------------------+---------+------------------------------+
| partial Python, 5+5 arg     | 192 ns  | 168 ns: 1.15x faster (-13%)  |
+-----------------------------+---------+------------------------------+
| partial C VARARGS, 2+0 arg  | 153 ns  | 127 ns: 1.20x faster (-17%)  |
+-----------------------------+---------+------------------------------+
| partial C FASTCALL, 1+1 arg | 111 ns  | 93.7 ns: 1.18x faster (-15%) |
+-----------------------------+---------+------------------------------+
| partial C FASTCALL, 2+0 arg | 63.9 ns | 64.6 ns: 1.01x slower (+1%)  |
+-----------------------------+---------+------------------------------+

Not significant (1): partial C VARARGS, 1+1 arg
msg289580 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-14 12:10
> What about C stack consumption? Is not this increase it?

Yes, my optimization consumes more C stack: small_stack allocates 80 bytes on the stack (for 5 positional arguments). Is it an issue?
msg289582 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-14 13:25
Nice results.

You made a great work for decreasing C stack consumption. It would be sad to lose it without good reasons. Could you please compare two variants, with and without small stack?
msg289594 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-14 15:02
I measured that my patch (pull request) increases the stack usage of 64 bytes per partial_call() call. I consider that it's accepted for a speedup between 1.12x faster and 1.25x faster.

Attached partial_stack_usage.py requires  testcapi_stack_pointer.patch of issue #28870.

Original:

f(): [1000 calls] 624.0 B per call
f2(): [1000 calls] 624.0 B per call

Patched:

f(): [1000 calls] 688.0 B per call (+64 B)
f2(): [1000 calls] 688.0 B per call (+64 B)
msg290183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-24 22:19
New changeset 0f7b0b397e12514ee213bc727c9939b66585cbe2 by Victor Stinner in branch 'master':
bpo-29735: Optimize partial_call(): avoid tuple (#516)
https://github.com/python/cpython/commit/0f7b0b397e12514ee213bc727c9939b66585cbe2
History
Date User Action Args
2017-03-24 22:19:27vstinnersetmessages: + msg290183
2017-03-14 20:42:37vstinnersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-03-14 15:02:03vstinnersetfiles: + partial_stack_usage.py

messages: + msg289594
2017-03-14 13:25:02serhiy.storchakasetmessages: + msg289582
2017-03-14 12:10:04vstinnersetmessages: + msg289580
2017-03-14 12:08:12vstinnersetfiles: + bench_fastcall_partial.py

messages: + msg289579
2017-03-14 10:35:05vstinnersetmessages: + msg289578
2017-03-06 16:52:11serhiy.storchakasetmessages: + msg289120
2017-03-06 14:58:01serhiy.storchakasetmessages: + msg289112
components: + Extension Modules
stage: patch review
2017-03-06 13:32:55vstinnersetnosy: + rhettinger, ncoghlan
2017-03-06 13:32:22vstinnersetnosy: + methane, serhiy.storchaka, yselivanov
messages: + msg289103
2017-03-06 13:29:15vstinnersetpull_requests: + pull_request425
2017-03-06 13:22:45vstinnercreate