This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Performance regression in functools.partial()
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: abarry, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2016-09-21 18:27 by serhiy.storchaka, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (13)
msg277176 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-21 18:27
There is 10% performance regression in calling functools.partial() in 3.6.

$ ./python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' -- 'g(2)'

Python 3.5:  Median +- std dev: 452 ns +- 25 ns
Python 3.6:  Median +- std dev: 491 ns +- 12 ns
msg277178 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-21 19:11
The perf regression can be related to the new fastcall calling
convention or the work in ceval.c (new 16-bit regular bytecode, new
CALL_FUNCTION bytecodes).
msg277180 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-21 19:50
Oh, functools.partial.__call__() doesn't use fastcall yet. So compared to Python 3.5, fastcall shouldn't explain a major performance difference. FYI I'm working on an extension of fastcall to also support fastcall calling convention for obj.__call__() ;-)

But maybe the regression is related to code moved in ceval.c to support fastcall. I noticed differences with fastcall when you don't compile Python with LTO+PGO.

--

About the bytecode, "g(2)" in Python 3.6b1 is:

              0 LOAD_GLOBAL              0 (g)
              2 LOAD_CONST               1 (2)
              4 CALL_FUNCTION            1
              6 POP_TOP

In Python 3.5 the bytecode is similar, but the bytecode size is variable and CALL_FUNCTION is made of two parts: number of positional arguments and number of keyword arguments.

              0 LOAD_GLOBAL              0 (g)
              3 LOAD_CONST               1 (2)
              6 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
              9 POP_TOP
msg277203 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-22 07:18
If revert the issue27809 changes, the performance is returned to the level of 3.5.
msg277204 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-22 07:21
I meant just c1a698edfa1b.

Median +- std dev: 441 ns +- 26 ns
msg277221 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-22 11:40
"If revert c1a698edfa1b, the performance is returned to the level of 3.5."

Oh, so using "fastcall" makes partial_call() slower? That's really something bad :-/ It would be nice if you can confirm using all optimizations enabled (PGO+LTO): ./configure --with-optimizations.

For faster compilation and best performances, you might also try to modify PROFILE_TASK in Makefile.pre.in to run your microbenchmark (but you need to run it long enough, so the compiler is able to detect hot code).
msg277237 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-22 21:06
With using all optimizations enabled the difference is much smaller if not disappeared.

Python 3.5:  Median +- std dev: 423 ns +- 9 ns
Python 3.7:  Median +- std dev: 427 ns +- 13 ns
msg277238 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-22 21:11
> Python 3.5:  Median +- std dev: 423 ns +- 9 ns
> Python 3.7:  Median +- std dev: 427 ns +- 13 ns

0.9% slower on a microbenchmark is not really what I would call significant :-)

But there is an underlying issue: when PGO+LTO is not used, Python 3.7 (and Python 3.6, no?) seems slower than Python 3.5. I recall that I moved some code from Python/ceval.c to Objects/abstract.c and made subtle changes on how functions are called. I guess that code locality has an impact on such microbenchmark (CPU-bound). Maybe we should move code, but I don't know where nor how. I understood that PGO puts "hot" code in a special section to make the hot code closer.
msg289083 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-06 10:09
Can this issue be closed now?
msg289088 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-06 10:29
I just ran a microbenchmark, 3.6 compared to 3.5:

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' --duplicate=100 --compare-to ../3.5/python  --python-names=3.5:3.6
3.5: ..................... 151 ns +- 4 ns
3.6: ..................... 150 ns +- 4 ns

Median +- std dev: [3.5] 151 ns +- 4 ns -> [3.6] 150 ns +- 4 ns: 1.00x faster (-0%)
Not significant!

=> not significant, so I close the issue.


FYI 3.7 is not significant neither:

$ ./python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' --duplicate=100 --compare-to ../3.5/python  --python-names=3.5:3.7

Median +- std dev: [3.5] 150 ns +- 4 ns -> [3.7] 150 ns +- 3 ns: 1.00x faster (-0%)
Not significant!
msg289089 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-06 10:37
Thanks Victor!
msg289097 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-06 12:59
Oh wait, there was a major regression in my perf module :-( The --compare-to option was completely broken in the development branch (but it works for timeit --compare-to in the latest release). It's now fixed! So please ignore results of my previous comment.


New benchmark, 3.6 compared to 3.5:

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' --duplicate=100 --compare-to ../3.5/python --python-names=ref:patch --python-names=3.5:3.6
3.5: ..................... 152 ns +- 4 ns
3.6: ..................... 152 ns +- 1 ns

Median +- std dev: [3.5] 152 ns +- 4 ns -> [3.6] 152 ns +- 1 ns: 1.00x faster (-0%)
Not significant!

=> Ah! No change, it's still not significant! Same speed.


3.7 compared to 3.6:

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' --duplicate=100 --compare-to ../3.6/python --python-names=ref:patch --python-names=3.6:3.7
3.6: ..................... 152 ns +- 1 ns
3.7: ..................... 138 ns +- 1 ns

Median +- std dev: [3.6] 152 ns +- 1 ns -> [3.7] 138 ns +- 1 ns: 1.10x faster (-9%)

=> Oh! 3.7 is 1.10x faster! I didn't compile Python with PGO, so maybe it's a minor change due to code placement? At least, it's not slower ;-)


I'm unable to see any performance slowdown in 3.6 compared to 3.5, so I keep the issue closed.

Anyway, thanks for the report Serhiy! Our performance watcher ;-)
msg289102 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-06 13:30
> Oh, functools.partial.__call__() doesn't use fastcall yet.

This issue reminded me that I didn't finish to optimize partial_call(): see issue #29735 for a minor optimization.
History
Date User Action Args
2022-04-11 14:58:37adminsetgithub: 72430
2017-03-06 13:30:38vstinnersetmessages: + msg289102
2017-03-06 12:59:34vstinnersetmessages: + msg289097
2017-03-06 10:37:06serhiy.storchakasetmessages: + msg289089
2017-03-06 10:29:55vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg289088

stage: resolved
2017-03-06 10:09:22serhiy.storchakasetmessages: + msg289083
2016-09-22 21:11:16vstinnersetmessages: + msg277238
2016-09-22 21:06:32serhiy.storchakasetmessages: + msg277237
2016-09-22 11:40:22vstinnersetmessages: + msg277221
2016-09-22 07:21:45serhiy.storchakasetmessages: + msg277204
2016-09-22 07:18:07serhiy.storchakasetmessages: + msg277203
2016-09-22 02:26:18abarrysetnosy: + abarry
2016-09-21 19:50:34vstinnersetmessages: + msg277180
2016-09-21 19:11:57vstinnersetmessages: + msg277178
2016-09-21 18:27:41serhiy.storchakacreate