classification
Title: Performance regression 3.10b1 and later on Windows: Py_DECREF() not inlined in PGO build
Type: performance Stage: patch review
Components: C API, Interpreter Core, Windows Versions: Python 3.11, Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, erlendaasland, gvanrossum, kj, lemburg, malin, neonene, pablogsal, paul.moore, rhettinger, steve.dower, tim.golden, vstinner, zach.ware
Priority: release blocker Keywords: patch

Created on 2021-09-06 15:27 by neonene, last changed 2021-09-23 07:35 by neonene.

Files
File name Uploaded Description Edit
310rc1_confirm_overhead.patch neonene, 2021-09-06 15:27
ceval_310rc1_patched.c neonene, 2021-09-06 15:29
b98e-no-inline-in-all.diff neonene, 2021-09-11 02:15
b98e-no-inline-in-eval.diff neonene, 2021-09-11 02:15
b98e-no-inline-in-the-others.diff neonene, 2021-09-11 02:16
pyproject_inlinestat.patch neonene, 2021-09-11 02:17
x64_28d2.log neonene, 2021-09-11 02:17
x64_b98e.log neonene, 2021-09-11 02:17
310rc2_benchmarks.txt neonene, 2021-09-13 23:37
310a7_vs_310rc2_bench.txt kj, 2021-09-18 16:26
PR28475_inline.log neonene, 2021-09-21 06:01
PR28475_vs_310rc2_vs_310a7.txt kj, 2021-09-21 09:10
PR28475_skip1test_bench.txt neonene, 2021-09-23 07:35
Pull Requests
URL Status Linked Edit
PR 28390 merged vstinner, 2021-09-16 16:08
PR 28419 merged vstinner, 2021-09-17 14:39
PR 28427 closed vstinner, 2021-09-17 21:30
PR 28475 open Mark.Shannon, 2021-09-20 13:16
Messages (46)
msg401143 - (view) Author: neonene (neonene) * Date: 2021-09-06 15:27
pyperformance on Windows shows some gap between 3.10a7 and 3.10b1.
The following are the ratios compared with 3.10a7 (the higher the slower).

-------------------------------------------------
Windows x64     |  PGO   release  official-binary
----------------+--------------------------------
20210405        |
    3.10a7      |  1.00   1.24    1.00 (PGO?)
20210408-07:58  |
    b98eba5     |  0.98
20210408-10:22  |
  * PR25244     |  1.04
20210503        |
    3.10b1      |  1.07   1.21    1.07
-------------------------------------------------
Windows x86     |  PGO   release  official-binary
----------------+--------------------------------
20210405        |
    3.10a7      |  1.00   1.25    1.27 (release?)
20210408-07:58  |
    b98eba5bc   |  1.00
20210408-10:22  |
  * PR25244     |  1.11
20210503        |
    3.10b1      |  1.14   1.28    1.29

Since PR25244 (28d28e053db6b69d91c2dfd579207cd8ccbc39e7),
_PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10).
At least the functions below have become un-inlined there at all.

  (1) _Py_DECREF()                 (from Py_DECREF,Py_CLEAR,Py_SETREF)
  (2) _Py_XDECREF()                (from Py_XDECREF,SETLOCAL)
  (3) _Py_IS_TYPE()                (from PyXXX_CheckExact)
  (4) _Py_atomic_load_32bit_impl() (from CHECK_EVAL_BREAKER)

I tried in vain other linker options like thread-safe-profiling, agressive-code-generation, /OPT:NOREF.
3.10a7 can inline them in the eval-loop even if profiling only test_array.py.

I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

-----------------------------------------------------------------
Windows x64     |  PGO   patched  overhead in eval-loop
----------------+------------------------------------------------
    3.10a7      |  1.00
20210802        |
    3.10rc1     |  1.09   1.05    4%  (slow 43, fast  5, same 10)
20210831-20:42  |
    863154c     |  0.95   0.90    5%  (slow 48, fast  3, same  7)
   (3.11a0+)    |
-----------------------------------------------------------------
Windows x86     |  PGO   patched  overhead in eval-loop
----------------+------------------------------------------------
    3.10a7      |  1.00
20210802        |
    3.10rc1     |  1.15   1.13    2%  (slow 29, fast 14, same 15)
20210831-20:42  |
    863154c     |  1.05   1.02    3%  (slow 44, fast  7, same  7)
   (3.11a0+)    |
msg401152 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-06 16:35
Rather than defining again functions as macro, you should consider using __forceinline function attribute: see bpo-45094.
msg401154 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-06 16:43
Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.
msg401182 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-06 21:49
Raymond:
> Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.

These functions were not converted recently to static inline function. For example, Py_INCREF() was already a static inline function in Python 3.9. I don't think that any decision should be taken before performances have been analyzed in depth.

I'm not convinced that there is really a performance regression. I'm not sure how benchmarks are run on Windows.

neonene:
> I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

I don't understand how to read the table.

Usually, I expect a comparison between a reference build and a patch build, but here you seem to use 3.10a7 as the reference to compare results. I'm not sure that geometric means can be compared this way.
msg401183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-06 21:52
> Since PR25244 (28d28e053db6b69d91c2dfd579207cd8ccbc39e7),
_PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10).
> At least the functions below have become un-inlined there at all.

I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

The compiler is free to inline or not depending on the pressure on registers and stack memory. I'm not sure that inlining always make the code faster, since inlining have many side effects.

I don't know well MSC compiler.
msg401319 - (view) Author: neonene (neonene) * Date: 2021-09-07 18:13
@vstinner: __forceinline suggestion

Since PR25244 (mentioned above), it seems link.exe has got to get stuck on python310.dll.
Before the PR, it took 10x~ longer to link than without __forceinline function.
I can confirm with _Py_DECREF() and _Py_XDECREF() and one training-job (the more fucntions forced/jobs used, the slower to link).
Have you tried __forceinline on PGO ?


> I don't understand how to read the table.

Overhead field is the output of pyperf command, not subtraction (the answers are the same just luckily).

ex) 3.10rc1x86 PGO: 
     PGO      : pyperf compare_to 3.10a7 left
     patched  : pyperf compare_to 3.10a7 right
     overhead : pyperf compare_to right  left 
  are
     1.15x slower (slower 52, faster  4, not significant  2)
     1.13x slower (slower 50, faster  4, not significant  4)
     1.02x slower (slower 29, faster 14, not significant 15)


> I'm not sure if PGO builds are reproducible,

MSVC does not produce the same code. Inlining (all or nothing) might be a quite special case in the hottest section.
I suspect the profiler doesn't work well only for _PyEval_EvalFrameDefault(), including branch/align optimization.
So my posted macro or inlining is just for a mesureing, not the solution.
msg401329 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-07 19:15
I don't understand if adding __forceinline on the mentioned static inline functions has an impact on the performance of a PGO build on Windows.
msg401346 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-09-07 21:57
> I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

I'm 99% sure it's a tracing PGO rather than sampling, so provided everything runs in the same order and follows the same codepaths, wall-clock timings shouldn't have a significant effect.

Without looking at the generated code, it may be more effective to try and force the functions called from the macro to never be inlined. The optimiser may be missing that the calls are uncommon, and trying to inline them first, then deciding that the whole merged function is too big to inline in places where it would be useful.

There's also no particular reason we need to use these tests as the profile, it's just that nobody has taken the time to come up with anything better. I'd rather see us invest time in generating a good profile rather than trying to hand-optimise inlining. (Though the test suite is good in many ways, because it makes sure all the extension modules are covered. We definitely want as close to full code coverage as we can get, at least for happy paths, but may need more eval loop in the profile if that's what needs help.)
msg401364 - (view) Author: Ma Lin (malin) * Date: 2021-09-08 09:48
This article briefly introduces the inlining decisions in MSVC. 
https://devblogs.microsoft.com/cppblog/inlining-decisions-in-visual-studio/
msg401623 - (view) Author: neonene (neonene) * Date: 2021-09-11 02:15
Thanks for all suggestions. I focused on my bisected commit and the previous.

I run pyperformance with 4 functions never inlined in the sections below.

   _Py_DECREF()
   _Py_XDECREF()
   _Py_IS_TYPE()
   _Py_atomic_load_32bit_impl()

are

   (1) never inlined in _PyEval_EvalFrameDefault().
   (2) never inlined in the other funcitons.
   (3) never inlined in all functions.


slow downs                     [4-funcs never inlined section]
--------------------------------------------------------------
Windows x64 PGO (44job)            (*)    (1)    (2)    (3)
rebuild                            none   eval  others  all
--------------------------------------------------------------
b98eba5 (4 funcs inlined in eval)  1.00   1.05   1.09   1.14
PR25244 (    not inlined in eval)  1.06   1.07   1.18   1.17

pyperf compare_to upper lower:
   (*) 1.06x slower  (slower 45, faster  4, not not significant  9)
   (1) 1.02x slower  (slower 33, faster 13, not not significant 12)
   (2) 1.08x slower  (slower 48, faster  6, not not significant  4)
   (3) 1.03x slower  (slower 39, faster  6, not not significant 13)


--------------------------------------------------------------
Windows x86 PGO (44job)            (*)    (1)    (2)    (3)
rebuild                            none   eval  others  all
--------------------------------------------------------------
b98eba5 (4 funcs inlined in eval)  1.00   1.03   1.06   1.15
PR25244 (    not inlined in eval)  1.13   1.13   1.22   1.24

pyperf compare_to upper lower:
   (*) 1.13x slower  (slower 54, faster  2, not not significant  2)
   (1) 1.10x slower  (slower 47, faster  3, not not significant  8)
   (2) 1.14x slower  (slower 54, faster  1, not not significant  3)
   (3) 1.08x slower  (slower 43, faster  3, not not significant 12)


In both x64 and x86, it looks column (2) and (*) has similar gaps.
So, I would like to simply focus on the eval-loop.

I built PGO with "/d2inlinestats" and "/d2inlinelogfull:_PyEval_EvalFrameDefault" according to the blog.

I posted logs. As for PR25244, the logsize is 3x smaller than the previous and pgo rejects the 4 funcs above. I will look into it later.


Collecting:
> Before the PR, it took 10x~ longer to link than without __forceinline function.

Current build is 10x~ shorter than before to link.
Before the PR, __forceinline had no impact to me.
msg401624 - (view) Author: Ma Lin (malin) * Date: 2021-09-11 02:45
MSVC 2019 has a /Ob3 option:
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion

From the experience of another project, I conjecture /Ob3 increase the "global budget" mentioned in the blog.
I used /Ob3 for the 3.10 branch, and there seems to be no significant performance change. If you have time, neonene welcome to verify this.
msg401628 - (view) Author: neonene (neonene) * Date: 2021-09-11 03:30
According to:
https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-160

PGO seems to override /Ob3.
Around this issue, I posted benchmark on issue44381.
On python building, /Ob3 works for only non-pgo-supported dlls like,

_ctypes_test
_freeze_importlib
_msi
_testbuffer
_testcapi
_testconsole
_testembed
_testimportmultiple
_testinternalcapi
_testmultiphase
_uuid
liblzma
pylauncher
pyshellext
pywlauncher
sqlite3
venvlauncher
venvwlauncher
winsound

I use this option in _msvccompiler.py for my pyd.
I will try and report when PGO with /Ob3 makes difference in the log.
msg401743 - (view) Author: neonene (neonene) * Date: 2021-09-13 23:37
With msvc 16.10.3 and 16.11.2 (latest),
PR25244 told me the amount of code in _PyEval_EvalFrameDefault() is over the limit of PGO.
In the old version of _PyEval_EvalFrameDefault (b98eba5), the same issue can be caused adding any-code anywhere with more than 20 expressions/statements. For example, at the top/middle/end of the function, repeating "if (0) {}" 10times, or "if (0) {19 statements}". As for python3.9.7, more than 800 expressions/statements.

Here is just a workaround for 3.10rc2 on windows.
==================================================
--- Python/ceval.c
+++ Python/ceval.c
@@ -1306,9 +1306 @@
-#define DISPATCH() \
-    { \
-        if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
-            goto tracing_dispatch; \
-        } \
-        f->f_lasti = INSTR_OFFSET(); \
-        NEXTOPARG(); \
-        DISPATCH_GOTO(); \
-    }
+#define DISPATCH() goto tracing_dispatch
@@ -1782,4 +1774,9 @@
     tracing_dispatch:
     {
+        if (!(trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE)) {
+            f->f_lasti = INSTR_OFFSET();
+            NEXTOPARG();
+            DISPATCH_GOTO();
+        }
         int instr_prev = f->f_lasti;
         f->f_lasti = INSTR_OFFSET();
==================================================

This patch becomes ineffective just adding one expression to DISPATCH macro as below

   #define DISPATCH() {if (1) goto tracing_dispatch;}

And this approach is not sufficient for 3.11 with bigger eval-func.
I don't know a cl/link option to lift such restriction of function size.


3.10rc2 x86 pgo : 1.00
        patched : 1.09x faster (slower  5, faster 48, not significant 5)

3.10rc2 x64 pgo : 1.00         (roughly the same speed as official bin)
        patched : 1.07x faster (slower  5, faster 47, not significant 6)
  patched(/Ob3) : 1.07x faster (slower  7, faster 45, not significant 6)

x64 results are posted.

Fixing inlining rejection also made __forceinline buildable with normal processing time and memory usage.
msg401964 - (view) Author: neonene (neonene) * Date: 2021-09-16 15:44
I reported this issue to developercommunity of microsoft.

https://developercommunity.visualstudio.com/t/1531987
msg401970 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-16 16:26
These should be changed back to macros where inlining is guaranteed.
msg401972 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-09-16 16:42
I agree with Raymond. Let's stop throwing more code at this until we've figured out what's going on and revert the change for now.
msg402025 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-17 09:58
@neonene
Thanks for the truly excellent investigation!

@Raymond and @Steve,
If I understood OP (neonene) properly, changing Py_DECREF to a macro won't get back the entire 7% lost performance in pyperformance. neonene's investigations suggest that the entire eval function is now too big for PGO on MSVC. Fixing Py_DECREF may get us a few %, but all the other hot functions in eval are likely not being inlined as well. Their suggested patch in https://bugs.python.org/msg401743 works, but IMO, _may_ slow down DISPATCH() _slightly_ since it seems to introduce another jump.

I suggest using their patch only for MSVC until we think of something better in 3.11. The additional jump may not matter after PGO (and it surely isn't a 7% slowdown :). WDYT?
msg402040 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 11:47
> the entire eval function is now too big for PGO on MSVC

I don't think that the issue is specific to MSVC. If a function becomes too big, it becomes less efficient for CPU caches.

One idea would be to move the least common opcodes into a slow-path, in a separated function, and make sure that this function is *not* inlined (use Py_NO_INLINE macro).

@Mark: What do you think?

Maybe we can keep current targets in the big switch, and call the function there. Something like:

TARGET(DUP_TOP):
TARGET(DUP_TOP_TWO):
(...)
   ceval_slow_path();
   break;

_PyEval_EvalFrameDefault() takes around 3500 lines of C code.
msg402043 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 12:09
New changeset 6b413551284a94cfe31377c9c607ff890aa06c26 by Victor Stinner in branch 'main':
bpo-45116: Add the Py_ALWAYS_INLINE macro (GH-28390)
https://github.com/python/cpython/commit/6b413551284a94cfe31377c9c607ff890aa06c26
msg402044 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 12:11
I added Py_ALWAYS_INLINE to run benchmarks more easily. Even if Py_INCREF() is converted back to a macro, there are now multiple static inline functions which are short and performance critical.

Using Py_ALWAYS_INLINE *may* speed up the Python debug builds and the PGO build on Windows if I understood correctly.

Right now, I'm not sure. The heuristic to decide if a function is inlined or not seems to depend a lot on the compiler and the compiler options.
msg402063 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-17 15:35
> Right now, I'm not sure. The heuristic to decide
> if a function is inlined or not seems to depend
> a lot on the compiler and the compiler options.

That is exactly correct.  And it is why we should
use the macro form which is certain to be inlined. 

Please do as Steve asked  and revert back to the
previous stable, reliable code.
msg402064 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-17 15:39
Pablo, should this be a release blocker?
msg402065 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-09-17 16:36
FWIW: Back in the days of Python 1.5.2, the ceval loop was too big for CPU caches as well and one of the things I experimented with at the time was rearranging the opcodes based on how often they were used and splitting the whole switch statement we had back then in two parts. This results in a 10-20% speedup.

CPU caches have since gotten much larger, but the size of the loop still is something to keep in mind and optimize for, as more and more logic gets added to the inner loop of Python.

IMO, we should definitely keep forced inlines / macros where they are used inside hot loops, perhaps even in all of the CPython code, since the conversion to inline functions is mostly for hiding internals from extensions, not to hide them from CPython itself.

@neonene: Could you provide more details about the CPU you're using to run the tests ?

BTW: Perhaps the PSF could get a few sponsors to add more hosts to speed.python.org, to provide a better overview. It looks as if the system is only compiling on Ubuntu 14.04 and running on an 11 year old system (https://speed.python.org/about/). If that's the case, the system uses a server CPU with 12MB cache (https://www.intel.com/content/www/us/en/products/sku/47916/intel-xeon-processor-x5680-12m-cache-3-33-ghz-6-40-gts-intel-qpi/specifications.html).
msg402067 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-17 17:08
> Pablo, should this be a release blocker?

How severe is the regression? If is severe enough we can mark it as a release blocker, but a conclusion needs to be reached ASAP because I don't want to change a fundamental macro a few days before the release
msg402068 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-17 17:24
> How severe is the regression?

OP provided pyperformance of current 3.10 vs their patched version at https://bugs.python.org/file50280/310rc2_benchmarks.txt. The patch is at https://bugs.python.org/msg401743.
msg402071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 18:32
Raymond: "Please do as Steve asked  and revert back to the previous stable, reliable code."

Is this issue really about Py_INCREF() being a static inline macro? Or is it more about the increased size of the _PyEval_EvalFrameDefault() function?

neonene's analyzis seems to show that the the PGO optimizer of the MSC compiler has thresholds depending on the function size, and _PyEval_EvalFrameDefault() crossed these thresholds. Py_INCREF() static inline function only seems to be the top of the iceberg, it's more a "side effect" than the root issue, no?

neonene showed in msg401743 that even adding *dead code* changes Python performance. So for me, it sounds weird to decide to change Py_INCREF() implementation only based on this analysis. Or maybe I missed something.

neonene's analyzis starts with the commit 28d28e053db6b69d91c2dfd579207cd8ccbc39e7 of PR 25244. Are you suggesting to revert this change? Mark Shannon is pushing many changes in ceval.c and the frame object. Multiple changes have been pushed on top of it since this commit, I don't think that this commit can be easily reverted.

What I understood is that adding __forceinline on some static inline functions can make the performance regression of the commit 28d28e053db6b69d91c2dfd579207cd8ccbc39e7 less bad. But I would prefer to validate that, since neonene's comparisons are not what I'm looking for: compare main to main+Py_ALWAYS_INLINE.
msg402090 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-17 20:41
If https://bugs.python.org/file50280/310rc2_benchmarks.txt is correct, this means that we have a 7% slowdown in Windows PGO builds for 3.10, which I don't think is acceptable.

I am marking this as a release blocker until there is some agreement. I remind you that if an agreement cannot be reached, you may reach the Steering Council for help reaching some conclusion.
msg402091 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 20:46
New changeset e4044e9f893350b4623677c048d33414a77edf55 by Victor Stinner in branch 'main':
bpo-45116: Py_DEBUG ignores Py_ALWAYS_INLINE (GH-28419)
https://github.com/python/cpython/commit/e4044e9f893350b4623677c048d33414a77edf55
msg402092 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 20:56
> If https://bugs.python.org/file50280/310rc2_benchmarks.txt is correct, this means that we have a 7% slowdown in Windows PGO builds for 3.10, which I don't think is acceptable.

What I understood is that the https://bugs.python.org/msg401743 patch makes 1.07x faster.

But https://bugs.python.org/issue45116#msg401143 compares Python 3.10b1 to Python 3.10a7: Python 3.10b1 is slower (32-bit: "1.07", 64-bit: "1.14": "higher the slower" wrote neonene).
msg402098 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 21:32
I created a draft PR to mark functions like Py_INCREF() and Py_IS_TYPE() with __forceinline: PR 28427.
msg402099 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 21:43
Can someone compare the main branch (commit e4044e9f893350b4623677c048d33414a77edf55) to the main branch + PR 28427 patch?

You can download the patch from:
https://patch-diff.githubusercontent.com/raw/python/cpython/pull/28427.patch

How can I build Python with PGO on Windows? Visual Studio has two profiles: PGinstrument and PGupdate.

I cannot find "PGO", "PGinstrument" or "PGupdate" in the https://devguide.python.org/
msg402117 - (view) Author: neonene (neonene) * Date: 2021-09-18 02:01
> (32-bit: "1.07", 64-bit: "1.14": "higher the slower" wrote neonene)

32-bit and 64-bit are in reverse. I compared b1 and a7 because this can be confirmed by anyone with official binary. If 7% of my patch has little to do with the gap, then I will be happy that 3.10 can be far faster.

>How can I build Python with PGO on Windows?

Try the following,

   PCbuild\build.bat -p x64 --no-tkinter --pgo

Before building, your object.h needs to replace
    static inline int Py_ALWAYS_INLINE
with
    static Py_ALWAYS_INLINE int 

In my case, pgo got stuck on linking with the object.h.


I'm waiting the reply from developercommunity.
msg402135 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-18 16:26
@Pablo,
> If <benchmark_link> is correct ...

For some verification, I benched pyperformance on Win10 AMD64, with the Python 3.10a7 and 3.10rc2 x64 binaries downloaded directly from python.org website release pages. The results corroborate with neonene's (please see the attached file). In short, there was a 1.09x slowdown on average.

FYI, `pyperf system tune` doesn't work on Windows. So I manually disabled turbo boost and intel speedstep, but I didn't have time to research setting core affinity and the other stabilizations. Nonetheless, most of the benches were stable.

> I am marking this as a release blocker until there is some agreement.
Got it. Setting as advised.
msg402143 - (view) Author: Ma Lin (malin) * Date: 2021-09-19 03:27
> In my case, pgo got stuck on linking with the object.h.

Me too. Since commit 28d28e0 (the first commit to slow down the PGO build), if add `__forceinline` attribute to _Py_DECREF() function in object.h, the PGO build hangs (>50 minutes).

So PR 28427 may not be a short-term solution.
msg402189 - (view) Author: Ma Lin (malin) * Date: 2021-09-20 02:37
Like OP's benchmark, if convert the inline functions to macros in object.h, the 3.10 branch is 1.03x faster, but still 1.07x slower than 28d28e0~1.
@vstinner could you prepare such a PR as a candidate fix.

There seem to be two ways to solve it in short-term.
1, Split the giant function.
2, Contact MSVC team to see if there is a quick solution, such as undocumented options.

But the release time is too close. The worst result is to release with the performance regression, and note  in the download page that there is a performance regression, if you care about performance please use 3.9.
msg402190 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-20 05:30
I concur with Ma Lin.
msg402217 - (view) Author: neonene (neonene) * Date: 2021-09-20 10:12
>release with the performance regression

I'm OK with the option. The limitation of PGO seems to me a bit weird and it might be unexpected for MSVC team.
msg402229 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-09-20 13:26
If we are hitting a size limit for PGO, then we need to reduce the size of _PyEval_EvalFrameDefault, to let the compiler do its job.
Force inlining stuff is not going to help.

Reverting https://github.com/python/cpython/pull/25244 for 3.10 seems to be the cleanest way to do this.

Would someone with a reliable way to test performance on Windows test the effect of https://github.com/python/cpython/pull/28475, please?

Longer term we need get PGO in MSVC working on larger functions, but I doubt that will be possible for 3.10.
msg402230 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-20 13:28
Can someone repeat the benchmarks with https://github.com/python/cpython/pull/28475 ?
msg402287 - (view) Author: neonene (neonene) * Date: 2021-09-21 06:01
I built 3.10rc2 PGO with PR28475 applied, and posted the inliner's log.
In the log, the 4-callees mentioned above are now inlined, which were "hard reject"ed before.

As for the performance, a few reporters may be needed, but it's not necessary for them to care about noises in the apparent gap.

310rc2 x64 PGO           : 1.00
 + PR28475 build1 bench1 : 1.05x faster (slower  7, faster 43, nochange  8)
                  bench2 : 1.05x faster (slower  2, faster 43, nochange 13)
           build2        : 1.05x faster (slower  4, faster 45, nochange  9)

310rc2 x64 release       : 1.00
 + PR28475               : 1.01x faster (slower 14, faster 25, nochange 19)


Is Windows involved in the faster-cpython project? If so, the project should be provided with Windows machines for validation.
msg402289 - (view) Author: Ma Lin (malin) * Date: 2021-09-21 06:47
PR28475:
64-bit build is 1.03x slower than 28d28e0~1
32-bit build is 1.04x slower than 28d28e0~1

28d28e0~1 is the last good commit.
msg402296 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-21 09:10
Like what Ma Lin and neonene have mentioned above, PR28475 recovered half of the lost performance. It's unfortunately still 4% slower than 3.10a7.

>pyperf compare_to 310a7.json 310rc2.json 310rc2patched.json

Geometric mean (versus 3.10a7)
==============

310rc2: 1.09x slower
310rc2patched: 1.04x slower

Attached pyperf benchmark comparisons file.
msg402307 - (view) Author: neonene (neonene) * Date: 2021-09-21 09:58
To be fair, the slowdowns between PR25244 and b1 seems to be an accumulation of "1.00x slower" of every commit. I don't know after b1.
msg402308 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-09-21 10:02
The only other change of any obvious significance to _PyEval_EvalFrameDefault since 3.10a7 are the changes to MATCH_MAPPING and MATCH_SEQUENCE and those make _PyEval_EvalFrameDefault smaller.

We may need to look elsewhere for the remaining ~4% performance.
msg402320 - (view) Author: neonene (neonene) * Date: 2021-09-21 15:05
PR28475 PGO is 2% slower than the patch I pasted on msg401743.
The function sizes are almost the same (+1:goto,+1:label), and there is no performance gap between release builds.

I suspect the following.

1. PGO is too sensitive to a function size at near the limit.
2. PR28475 is not fully covered by 44 tests. (msg401346)
msg402480 - (view) Author: neonene (neonene) * Date: 2021-09-23 07:35
3.10rc2 Python/ceval.c
1306: #define DISPATCH() \
1307:     { \
1308:         if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
1309:             goto tracing_dispatch; \

Among the 44 pgo-tests, only test_patma.TestTracing hits the condition above. On Windows, it seems that skipping it tightens the profile of PR28475 a bit. Additional tests such as test_threading(.ThreadTests.test_frame_tstate_tracing) might also cause some amount of variation or vice versa.

3.10rc2 x64 PGO    : 1.00
+ PR28475 
  with TestTracing : 1.05x faster (slow  3, fast 46, same  9)
  without          : 1.06x faster (slow  5, fast 52, same  1)

  with TestTracing : 1.00
  without          : 1.01x faster (slow 19, fast 27, same 12)

(Details: PR28475_skip1test_bench.txt)


Does test_patma.TestTracing need training for match-case performance?
History
Date User Action Args
2021-09-23 07:35:57neonenesetfiles: + PR28475_skip1test_bench.txt

messages: + msg402480
2021-09-21 15:05:39neonenesetmessages: + msg402320
2021-09-21 10:02:22Mark.Shannonsetmessages: + msg402308
2021-09-21 09:58:47neonenesetmessages: + msg402307
2021-09-21 09:10:54kjsetfiles: + PR28475_vs_310rc2_vs_310a7.txt

messages: + msg402296
2021-09-21 06:47:24malinsetmessages: + msg402289
2021-09-21 06:01:53neonenesetfiles: + PR28475_inline.log

messages: + msg402287
2021-09-20 19:59:28gvanrossumsetnosy: + gvanrossum
2021-09-20 13:28:50pablogsalsetmessages: + msg402230
2021-09-20 13:26:32Mark.Shannonsetmessages: + msg402229
2021-09-20 13:16:46Mark.Shannonsetpull_requests: + pull_request26873
2021-09-20 10:12:03neonenesetmessages: + msg402217
2021-09-20 05:30:48rhettingersetmessages: + msg402190
2021-09-20 02:37:22malinsetmessages: + msg402189
2021-09-19 08:24:22erlendaaslandsetnosy: + erlendaasland
2021-09-19 03:27:14malinsetmessages: + msg402143
2021-09-18 16:26:55kjsetpriority: high -> release blocker
files: + 310a7_vs_310rc2_bench.txt
messages: + msg402135
2021-09-18 02:01:52neonenesetmessages: + msg402117
2021-09-17 21:43:41vstinnersetmessages: + msg402099
2021-09-17 21:32:04vstinnersetmessages: + msg402098
2021-09-17 21:30:59vstinnersetpull_requests: + pull_request26837
2021-09-17 20:56:56vstinnersetmessages: + msg402092
2021-09-17 20:46:48vstinnersetmessages: + msg402091
2021-09-17 20:41:03pablogsalsetmessages: + msg402090
2021-09-17 18:32:03vstinnersetmessages: + msg402071
2021-09-17 17:24:54kjsetmessages: + msg402068
2021-09-17 17:08:01pablogsalsetmessages: + msg402067
2021-09-17 16:36:16lemburgsetmessages: + msg402065
2021-09-17 15:39:03rhettingersetnosy: + lemburg, pablogsal
messages: + msg402064
2021-09-17 15:35:25rhettingersetmessages: + msg402063
2021-09-17 14:39:45vstinnersetpull_requests: + pull_request26831
2021-09-17 12:11:52vstinnersetmessages: + msg402044
2021-09-17 12:09:28vstinnersetmessages: + msg402043
2021-09-17 11:47:58vstinnersetmessages: + msg402040
2021-09-17 09:58:33kjsetpriority: normal -> high

messages: + msg402025
2021-09-16 16:42:54steve.dowersetmessages: + msg401972
2021-09-16 16:26:53rhettingersetmessages: + msg401970
2021-09-16 16:08:11vstinnersetstage: patch review
pull_requests: + pull_request26803
2021-09-16 15:44:50neonenesetmessages: + msg401964
2021-09-16 15:14:42kjsetnosy: + kj
2021-09-13 23:37:47neonenesetfiles: + 310rc2_benchmarks.txt

messages: + msg401743
2021-09-13 09:29:46vstinnersettitle: Performance regression 3.10b1 and later on Windows -> Performance regression 3.10b1 and later on Windows: Py_DECREF() not inlined in PGO build
2021-09-11 03:30:27neonenesetmessages: + msg401628
2021-09-11 02:45:22malinsetmessages: + msg401624
2021-09-11 02:17:57neonenesetfiles: + x64_b98e.log
2021-09-11 02:17:40neonenesetfiles: + x64_28d2.log
2021-09-11 02:17:00neonenesetfiles: + pyproject_inlinestat.patch
2021-09-11 02:16:06neonenesetfiles: + b98e-no-inline-in-the-others.diff
2021-09-11 02:15:52neonenesetfiles: + b98e-no-inline-in-eval.diff
2021-09-11 02:15:36neonenesetfiles: + b98e-no-inline-in-all.diff

messages: + msg401623
2021-09-08 09:48:24malinsetnosy: + malin
messages: + msg401364
2021-09-07 21:57:07steve.dowersetmessages: + msg401346
2021-09-07 19:15:36vstinnersetmessages: + msg401329
2021-09-07 18:13:04neonenesetmessages: + msg401319
2021-09-06 21:52:54vstinnersetmessages: + msg401183
2021-09-06 21:49:38vstinnersetmessages: + msg401182
2021-09-06 16:43:19rhettingersetnosy: + rhettinger
messages: + msg401154
2021-09-06 16:35:18vstinnersetmessages: + msg401152
2021-09-06 16:07:30pablogsalsetnosy: - pablogsal
2021-09-06 15:29:03neonenesetfiles: + ceval_310rc1_patched.c
2021-09-06 15:27:18neonenecreate