This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC)
Type: performance Stage: patch review
Components: C API, Interpreter Core, Windows Versions: Python 3.11, Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, brandtbucher, erlendaasland, gvanrossum, kj, lemburg, malin, neonene, pablogsal, paul.moore, rhettinger, steve.dower, tim.golden, vstinner, zach.ware
Priority: Keywords: patch

Created on 2021-09-06 15:27 by neonene, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
310rc1_confirm_overhead.patch neonene, 2021-09-06 15:27
ceval_310rc1_patched.c neonene, 2021-09-06 15:29
b98e-no-inline-in-all.diff neonene, 2021-09-11 02:15
b98e-no-inline-in-eval.diff neonene, 2021-09-11 02:15
b98e-no-inline-in-the-others.diff neonene, 2021-09-11 02:16
pyproject_inlinestat.patch neonene, 2021-09-11 02:17
x64_28d2.log neonene, 2021-09-11 02:17
x64_b98e.log neonene, 2021-09-11 02:17
310rc2_benchmarks.txt neonene, 2021-09-13 23:37
310a7_vs_310rc2_bench.txt kj, 2021-09-18 16:26
PR28475_inline.log neonene, 2021-09-21 06:01
PR28475_vs_310rc2_vs_310a7.txt kj, 2021-09-21 09:10
PR28475_skip1test_bench.txt neonene, 2021-09-23 07:35
310rc2patched_vs_310rc2notrace.txt kj, 2021-09-29 14:25
switch-case_unarranged_bench.txt neonene, 2021-10-16 15:26
ceval_PR29565_split_func.c neonene, 2021-11-19 18:54
Pull Requests
URL Status Linked Edit
PR 28390 merged vstinner, 2021-09-16 16:08
PR 28419 merged vstinner, 2021-09-17 14:39
PR 28427 closed vstinner, 2021-09-17 21:30
PR 28475 merged Mark.Shannon, 2021-09-20 13:16
PR 28630 closed neonene, 2021-09-29 16:54
PR 28631 closed neonene, 2021-09-29 16:54
PR 31436 closed neonene, 2022-02-20 03:30
PR 31459 open neonene, 2022-02-21 04:16
PR 32387 open gvanrossum, 2022-04-07 00:09
Messages (82)
msg401143 - (view) Author: neonene (neonene) * Date: 2021-09-06 15:27
pyperformance on Windows shows some gap between 3.10a7 and 3.10b1.
The following are the ratios compared with 3.10a7 (the higher the slower).

-------------------------------------------------
Windows x64     |  PGO   release  official-binary
----------------+--------------------------------
20210405        |
    3.10a7      |  1.00   1.24    1.00 (PGO?)
20210408-07:58  |
    b98eba5     |  0.98
20210408-10:22  |
  * PR25244     |  1.04
20210503        |
    3.10b1      |  1.07   1.21    1.07
-------------------------------------------------
Windows x86     |  PGO   release  official-binary
----------------+--------------------------------
20210405        |
    3.10a7      |  1.00   1.25    1.27 (release?)
20210408-07:58  |
    b98eba5bc   |  1.00
20210408-10:22  |
  * PR25244     |  1.11
20210503        |
    3.10b1      |  1.14   1.28    1.29

Since PR25244 (28d28e053db6b69d91c2dfd579207cd8ccbc39e7),
_PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10).
At least the functions below have become un-inlined there at all.

  (1) _Py_DECREF()                 (from Py_DECREF,Py_CLEAR,Py_SETREF)
  (2) _Py_XDECREF()                (from Py_XDECREF,SETLOCAL)
  (3) _Py_IS_TYPE()                (from PyXXX_CheckExact)
  (4) _Py_atomic_load_32bit_impl() (from CHECK_EVAL_BREAKER)

I tried in vain other linker options like thread-safe-profiling, agressive-code-generation, /OPT:NOREF.
3.10a7 can inline them in the eval-loop even if profiling only test_array.py.

I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

-----------------------------------------------------------------
Windows x64     |  PGO   patched  overhead in eval-loop
----------------+------------------------------------------------
    3.10a7      |  1.00
20210802        |
    3.10rc1     |  1.09   1.05    4%  (slow 43, fast  5, same 10)
20210831-20:42  |
    863154c     |  0.95   0.90    5%  (slow 48, fast  3, same  7)
   (3.11a0+)    |
-----------------------------------------------------------------
Windows x86     |  PGO   patched  overhead in eval-loop
----------------+------------------------------------------------
    3.10a7      |  1.00
20210802        |
    3.10rc1     |  1.15   1.13    2%  (slow 29, fast 14, same 15)
20210831-20:42  |
    863154c     |  1.05   1.02    3%  (slow 44, fast  7, same  7)
   (3.11a0+)    |
msg401152 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-06 16:35
Rather than defining again functions as macro, you should consider using __forceinline function attribute: see bpo-45094.
msg401154 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-06 16:43
Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.
msg401182 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-06 21:49
Raymond:
> Perhaps these critical code sections should have been left as macros. It is difficult to assuring system wide inlining across modules.

These functions were not converted recently to static inline function. For example, Py_INCREF() was already a static inline function in Python 3.9. I don't think that any decision should be taken before performances have been analyzed in depth.

I'm not convinced that there is really a performance regression. I'm not sure how benchmarks are run on Windows.

neonene:
> I measured overheads of (1)~(4) on my own build whose eval-loop uses macros instead of them.

I don't understand how to read the table.

Usually, I expect a comparison between a reference build and a patch build, but here you seem to use 3.10a7 as the reference to compare results. I'm not sure that geometric means can be compared this way.
msg401183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-06 21:52
> Since PR25244 (28d28e053db6b69d91c2dfd579207cd8ccbc39e7),
_PyEval_EvalFrameDefault() in ceval.c has seemed to be unoptimized with PGO (msvc14.29.16.10).
> At least the functions below have become un-inlined there at all.

I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

The compiler is free to inline or not depending on the pressure on registers and stack memory. I'm not sure that inlining always make the code faster, since inlining have many side effects.

I don't know well MSC compiler.
msg401319 - (view) Author: neonene (neonene) * Date: 2021-09-07 18:13
@vstinner: __forceinline suggestion

Since PR25244 (mentioned above), it seems link.exe has got to get stuck on python310.dll.
Before the PR, it took 10x~ longer to link than without __forceinline function.
I can confirm with _Py_DECREF() and _Py_XDECREF() and one training-job (the more fucntions forced/jobs used, the slower to link).
Have you tried __forceinline on PGO ?


> I don't understand how to read the table.

Overhead field is the output of pyperf command, not subtraction (the answers are the same just luckily).

ex) 3.10rc1x86 PGO: 
     PGO      : pyperf compare_to 3.10a7 left
     patched  : pyperf compare_to 3.10a7 right
     overhead : pyperf compare_to right  left 
  are
     1.15x slower (slower 52, faster  4, not significant  2)
     1.13x slower (slower 50, faster  4, not significant  4)
     1.02x slower (slower 29, faster 14, not significant 15)


> I'm not sure if PGO builds are reproducible,

MSVC does not produce the same code. Inlining (all or nothing) might be a quite special case in the hottest section.
I suspect the profiler doesn't work well only for _PyEval_EvalFrameDefault(), including branch/align optimization.
So my posted macro or inlining is just for a mesureing, not the solution.
msg401329 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-07 19:15
I don't understand if adding __forceinline on the mentioned static inline functions has an impact on the performance of a PGO build on Windows.
msg401346 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-09-07 21:57
> I'm not sure if PGO builds are reproducible, since there is a training step which requires to run a non-deterministic workload (the Python test suite which is somehow randomized by I/O timings and other stuffs).

I'm 99% sure it's a tracing PGO rather than sampling, so provided everything runs in the same order and follows the same codepaths, wall-clock timings shouldn't have a significant effect.

Without looking at the generated code, it may be more effective to try and force the functions called from the macro to never be inlined. The optimiser may be missing that the calls are uncommon, and trying to inline them first, then deciding that the whole merged function is too big to inline in places where it would be useful.

There's also no particular reason we need to use these tests as the profile, it's just that nobody has taken the time to come up with anything better. I'd rather see us invest time in generating a good profile rather than trying to hand-optimise inlining. (Though the test suite is good in many ways, because it makes sure all the extension modules are covered. We definitely want as close to full code coverage as we can get, at least for happy paths, but may need more eval loop in the profile if that's what needs help.)
msg401364 - (view) Author: Ma Lin (malin) * Date: 2021-09-08 09:48
This article briefly introduces the inlining decisions in MSVC. 
https://devblogs.microsoft.com/cppblog/inlining-decisions-in-visual-studio/
msg401623 - (view) Author: neonene (neonene) * Date: 2021-09-11 02:15
Thanks for all suggestions. I focused on my bisected commit and the previous.

I run pyperformance with 4 functions never inlined in the sections below.

   _Py_DECREF()
   _Py_XDECREF()
   _Py_IS_TYPE()
   _Py_atomic_load_32bit_impl()

are

   (1) never inlined in _PyEval_EvalFrameDefault().
   (2) never inlined in the other funcitons.
   (3) never inlined in all functions.


slow downs                     [4-funcs never inlined section]
--------------------------------------------------------------
Windows x64 PGO (44job)            (*)    (1)    (2)    (3)
rebuild                            none   eval  others  all
--------------------------------------------------------------
b98eba5 (4 funcs inlined in eval)  1.00   1.05   1.09   1.14
PR25244 (    not inlined in eval)  1.06   1.07   1.18   1.17

pyperf compare_to upper lower:
   (*) 1.06x slower  (slower 45, faster  4, not not significant  9)
   (1) 1.02x slower  (slower 33, faster 13, not not significant 12)
   (2) 1.08x slower  (slower 48, faster  6, not not significant  4)
   (3) 1.03x slower  (slower 39, faster  6, not not significant 13)


--------------------------------------------------------------
Windows x86 PGO (44job)            (*)    (1)    (2)    (3)
rebuild                            none   eval  others  all
--------------------------------------------------------------
b98eba5 (4 funcs inlined in eval)  1.00   1.03   1.06   1.15
PR25244 (    not inlined in eval)  1.13   1.13   1.22   1.24

pyperf compare_to upper lower:
   (*) 1.13x slower  (slower 54, faster  2, not not significant  2)
   (1) 1.10x slower  (slower 47, faster  3, not not significant  8)
   (2) 1.14x slower  (slower 54, faster  1, not not significant  3)
   (3) 1.08x slower  (slower 43, faster  3, not not significant 12)


In both x64 and x86, it looks column (2) and (*) has similar gaps.
So, I would like to simply focus on the eval-loop.

I built PGO with "/d2inlinestats" and "/d2inlinelogfull:_PyEval_EvalFrameDefault" according to the blog.

I posted logs. As for PR25244, the logsize is 3x smaller than the previous and pgo rejects the 4 funcs above. I will look into it later.


Collecting:
> Before the PR, it took 10x~ longer to link than without __forceinline function.

Current build is 10x~ shorter than before to link.
Before the PR, __forceinline had no impact to me.
msg401624 - (view) Author: Ma Lin (malin) * Date: 2021-09-11 02:45
MSVC 2019 has a /Ob3 option:
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion

From the experience of another project, I conjecture /Ob3 increase the "global budget" mentioned in the blog.
I used /Ob3 for the 3.10 branch, and there seems to be no significant performance change. If you have time, neonene welcome to verify this.
msg401628 - (view) Author: neonene (neonene) * Date: 2021-09-11 03:30
According to:
https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-160

PGO seems to override /Ob3.
Around this issue, I posted benchmark on issue44381.
On python building, /Ob3 works for only non-pgo-supported dlls like,

_ctypes_test
_freeze_importlib
_msi
_testbuffer
_testcapi
_testconsole
_testembed
_testimportmultiple
_testinternalcapi
_testmultiphase
_uuid
liblzma
pylauncher
pyshellext
pywlauncher
sqlite3
venvlauncher
venvwlauncher
winsound

I use this option in _msvccompiler.py for my pyd.
I will try and report when PGO with /Ob3 makes difference in the log.
msg401743 - (view) Author: neonene (neonene) * Date: 2021-09-13 23:37
With msvc 16.10.3 and 16.11.2 (latest),
PR25244 told me the amount of code in _PyEval_EvalFrameDefault() is over the limit of PGO.
In the old version of _PyEval_EvalFrameDefault (b98eba5), the same issue can be caused adding any-code anywhere with more than 20 expressions/statements. For example, at the top/middle/end of the function, repeating "if (0) {}" 10times, or "if (0) {19 statements}". As for python3.9.7, more than 800 expressions/statements.

Here is just a workaround for 3.10rc2 on windows.
==================================================
--- Python/ceval.c
+++ Python/ceval.c
@@ -1306,9 +1306 @@
-#define DISPATCH() \
-    { \
-        if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
-            goto tracing_dispatch; \
-        } \
-        f->f_lasti = INSTR_OFFSET(); \
-        NEXTOPARG(); \
-        DISPATCH_GOTO(); \
-    }
+#define DISPATCH() goto tracing_dispatch
@@ -1782,4 +1774,9 @@
     tracing_dispatch:
     {
+        if (!(trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE)) {
+            f->f_lasti = INSTR_OFFSET();
+            NEXTOPARG();
+            DISPATCH_GOTO();
+        }
         int instr_prev = f->f_lasti;
         f->f_lasti = INSTR_OFFSET();
==================================================

This patch becomes ineffective just adding one expression to DISPATCH macro as below

   #define DISPATCH() {if (1) goto tracing_dispatch;}

And this approach is not sufficient for 3.11 with bigger eval-func.
I don't know a cl/link option to lift such restriction of function size.


3.10rc2 x86 pgo : 1.00
        patched : 1.09x faster (slower  5, faster 48, not significant 5)

3.10rc2 x64 pgo : 1.00         (roughly the same speed as official bin)
        patched : 1.07x faster (slower  5, faster 47, not significant 6)
  patched(/Ob3) : 1.07x faster (slower  7, faster 45, not significant 6)

x64 results are posted.

Fixing inlining rejection also made __forceinline buildable with normal processing time and memory usage.
msg401964 - (view) Author: neonene (neonene) * Date: 2021-09-16 15:44
I reported this issue to developercommunity of microsoft.

https://developercommunity.visualstudio.com/t/1531987
msg401970 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-16 16:26
These should be changed back to macros where inlining is guaranteed.
msg401972 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-09-16 16:42
I agree with Raymond. Let's stop throwing more code at this until we've figured out what's going on and revert the change for now.
msg402025 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-17 09:58
@neonene
Thanks for the truly excellent investigation!

@Raymond and @Steve,
If I understood OP (neonene) properly, changing Py_DECREF to a macro won't get back the entire 7% lost performance in pyperformance. neonene's investigations suggest that the entire eval function is now too big for PGO on MSVC. Fixing Py_DECREF may get us a few %, but all the other hot functions in eval are likely not being inlined as well. Their suggested patch in https://bugs.python.org/msg401743 works, but IMO, _may_ slow down DISPATCH() _slightly_ since it seems to introduce another jump.

I suggest using their patch only for MSVC until we think of something better in 3.11. The additional jump may not matter after PGO (and it surely isn't a 7% slowdown :). WDYT?
msg402040 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 11:47
> the entire eval function is now too big for PGO on MSVC

I don't think that the issue is specific to MSVC. If a function becomes too big, it becomes less efficient for CPU caches.

One idea would be to move the least common opcodes into a slow-path, in a separated function, and make sure that this function is *not* inlined (use Py_NO_INLINE macro).

@Mark: What do you think?

Maybe we can keep current targets in the big switch, and call the function there. Something like:

TARGET(DUP_TOP):
TARGET(DUP_TOP_TWO):
(...)
   ceval_slow_path();
   break;

_PyEval_EvalFrameDefault() takes around 3500 lines of C code.
msg402043 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 12:09
New changeset 6b413551284a94cfe31377c9c607ff890aa06c26 by Victor Stinner in branch 'main':
bpo-45116: Add the Py_ALWAYS_INLINE macro (GH-28390)
https://github.com/python/cpython/commit/6b413551284a94cfe31377c9c607ff890aa06c26
msg402044 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 12:11
I added Py_ALWAYS_INLINE to run benchmarks more easily. Even if Py_INCREF() is converted back to a macro, there are now multiple static inline functions which are short and performance critical.

Using Py_ALWAYS_INLINE *may* speed up the Python debug builds and the PGO build on Windows if I understood correctly.

Right now, I'm not sure. The heuristic to decide if a function is inlined or not seems to depend a lot on the compiler and the compiler options.
msg402063 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-17 15:35
> Right now, I'm not sure. The heuristic to decide
> if a function is inlined or not seems to depend
> a lot on the compiler and the compiler options.

That is exactly correct.  And it is why we should
use the macro form which is certain to be inlined. 

Please do as Steve asked  and revert back to the
previous stable, reliable code.
msg402064 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-17 15:39
Pablo, should this be a release blocker?
msg402065 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2021-09-17 16:36
FWIW: Back in the days of Python 1.5.2, the ceval loop was too big for CPU caches as well and one of the things I experimented with at the time was rearranging the opcodes based on how often they were used and splitting the whole switch statement we had back then in two parts. This results in a 10-20% speedup.

CPU caches have since gotten much larger, but the size of the loop still is something to keep in mind and optimize for, as more and more logic gets added to the inner loop of Python.

IMO, we should definitely keep forced inlines / macros where they are used inside hot loops, perhaps even in all of the CPython code, since the conversion to inline functions is mostly for hiding internals from extensions, not to hide them from CPython itself.

@neonene: Could you provide more details about the CPU you're using to run the tests ?

BTW: Perhaps the PSF could get a few sponsors to add more hosts to speed.python.org, to provide a better overview. It looks as if the system is only compiling on Ubuntu 14.04 and running on an 11 year old system (https://speed.python.org/about/). If that's the case, the system uses a server CPU with 12MB cache (https://www.intel.com/content/www/us/en/products/sku/47916/intel-xeon-processor-x5680-12m-cache-3-33-ghz-6-40-gts-intel-qpi/specifications.html).
msg402067 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-17 17:08
> Pablo, should this be a release blocker?

How severe is the regression? If is severe enough we can mark it as a release blocker, but a conclusion needs to be reached ASAP because I don't want to change a fundamental macro a few days before the release
msg402068 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-17 17:24
> How severe is the regression?

OP provided pyperformance of current 3.10 vs their patched version at https://bugs.python.org/file50280/310rc2_benchmarks.txt. The patch is at https://bugs.python.org/msg401743.
msg402071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 18:32
Raymond: "Please do as Steve asked  and revert back to the previous stable, reliable code."

Is this issue really about Py_INCREF() being a static inline macro? Or is it more about the increased size of the _PyEval_EvalFrameDefault() function?

neonene's analyzis seems to show that the the PGO optimizer of the MSC compiler has thresholds depending on the function size, and _PyEval_EvalFrameDefault() crossed these thresholds. Py_INCREF() static inline function only seems to be the top of the iceberg, it's more a "side effect" than the root issue, no?

neonene showed in msg401743 that even adding *dead code* changes Python performance. So for me, it sounds weird to decide to change Py_INCREF() implementation only based on this analysis. Or maybe I missed something.

neonene's analyzis starts with the commit 28d28e053db6b69d91c2dfd579207cd8ccbc39e7 of PR 25244. Are you suggesting to revert this change? Mark Shannon is pushing many changes in ceval.c and the frame object. Multiple changes have been pushed on top of it since this commit, I don't think that this commit can be easily reverted.

What I understood is that adding __forceinline on some static inline functions can make the performance regression of the commit 28d28e053db6b69d91c2dfd579207cd8ccbc39e7 less bad. But I would prefer to validate that, since neonene's comparisons are not what I'm looking for: compare main to main+Py_ALWAYS_INLINE.
msg402090 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-17 20:41
If https://bugs.python.org/file50280/310rc2_benchmarks.txt is correct, this means that we have a 7% slowdown in Windows PGO builds for 3.10, which I don't think is acceptable.

I am marking this as a release blocker until there is some agreement. I remind you that if an agreement cannot be reached, you may reach the Steering Council for help reaching some conclusion.
msg402091 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 20:46
New changeset e4044e9f893350b4623677c048d33414a77edf55 by Victor Stinner in branch 'main':
bpo-45116: Py_DEBUG ignores Py_ALWAYS_INLINE (GH-28419)
https://github.com/python/cpython/commit/e4044e9f893350b4623677c048d33414a77edf55
msg402092 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 20:56
> If https://bugs.python.org/file50280/310rc2_benchmarks.txt is correct, this means that we have a 7% slowdown in Windows PGO builds for 3.10, which I don't think is acceptable.

What I understood is that the https://bugs.python.org/msg401743 patch makes 1.07x faster.

But https://bugs.python.org/issue45116#msg401143 compares Python 3.10b1 to Python 3.10a7: Python 3.10b1 is slower (32-bit: "1.07", 64-bit: "1.14": "higher the slower" wrote neonene).
msg402098 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 21:32
I created a draft PR to mark functions like Py_INCREF() and Py_IS_TYPE() with __forceinline: PR 28427.
msg402099 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-17 21:43
Can someone compare the main branch (commit e4044e9f893350b4623677c048d33414a77edf55) to the main branch + PR 28427 patch?

You can download the patch from:
https://patch-diff.githubusercontent.com/raw/python/cpython/pull/28427.patch

How can I build Python with PGO on Windows? Visual Studio has two profiles: PGinstrument and PGupdate.

I cannot find "PGO", "PGinstrument" or "PGupdate" in the https://devguide.python.org/
msg402117 - (view) Author: neonene (neonene) * Date: 2021-09-18 02:01
> (32-bit: "1.07", 64-bit: "1.14": "higher the slower" wrote neonene)

32-bit and 64-bit are in reverse. I compared b1 and a7 because this can be confirmed by anyone with official binary. If 7% of my patch has little to do with the gap, then I will be happy that 3.10 can be far faster.

>How can I build Python with PGO on Windows?

Try the following,

   PCbuild\build.bat -p x64 --no-tkinter --pgo

Before building, your object.h needs to replace
    static inline int Py_ALWAYS_INLINE
with
    static Py_ALWAYS_INLINE int 

In my case, pgo got stuck on linking with the object.h.


I'm waiting the reply from developercommunity.
msg402135 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-18 16:26
@Pablo,
> If <benchmark_link> is correct ...

For some verification, I benched pyperformance on Win10 AMD64, with the Python 3.10a7 and 3.10rc2 x64 binaries downloaded directly from python.org website release pages. The results corroborate with neonene's (please see the attached file). In short, there was a 1.09x slowdown on average.

FYI, `pyperf system tune` doesn't work on Windows. So I manually disabled turbo boost and intel speedstep, but I didn't have time to research setting core affinity and the other stabilizations. Nonetheless, most of the benches were stable.

> I am marking this as a release blocker until there is some agreement.
Got it. Setting as advised.
msg402143 - (view) Author: Ma Lin (malin) * Date: 2021-09-19 03:27
> In my case, pgo got stuck on linking with the object.h.

Me too. Since commit 28d28e0 (the first commit to slow down the PGO build), if add `__forceinline` attribute to _Py_DECREF() function in object.h, the PGO build hangs (>50 minutes).

So PR 28427 may not be a short-term solution.
msg402189 - (view) Author: Ma Lin (malin) * Date: 2021-09-20 02:37
Like OP's benchmark, if convert the inline functions to macros in object.h, the 3.10 branch is 1.03x faster, but still 1.07x slower than 28d28e0~1.
@vstinner could you prepare such a PR as a candidate fix.

There seem to be two ways to solve it in short-term.
1, Split the giant function.
2, Contact MSVC team to see if there is a quick solution, such as undocumented options.

But the release time is too close. The worst result is to release with the performance regression, and note  in the download page that there is a performance regression, if you care about performance please use 3.9.
msg402190 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-09-20 05:30
I concur with Ma Lin.
msg402217 - (view) Author: neonene (neonene) * Date: 2021-09-20 10:12
>release with the performance regression

I'm OK with the option. The limitation of PGO seems to me a bit weird and it might be unexpected for MSVC team.
msg402229 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-09-20 13:26
If we are hitting a size limit for PGO, then we need to reduce the size of _PyEval_EvalFrameDefault, to let the compiler do its job.
Force inlining stuff is not going to help.

Reverting https://github.com/python/cpython/pull/25244 for 3.10 seems to be the cleanest way to do this.

Would someone with a reliable way to test performance on Windows test the effect of https://github.com/python/cpython/pull/28475, please?

Longer term we need get PGO in MSVC working on larger functions, but I doubt that will be possible for 3.10.
msg402230 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-20 13:28
Can someone repeat the benchmarks with https://github.com/python/cpython/pull/28475 ?
msg402287 - (view) Author: neonene (neonene) * Date: 2021-09-21 06:01
I built 3.10rc2 PGO with PR28475 applied, and posted the inliner's log.
In the log, the 4-callees mentioned above are now inlined, which were "hard reject"ed before.

As for the performance, a few reporters may be needed, but it's not necessary for them to care about noises in the apparent gap.

310rc2 x64 PGO           : 1.00
 + PR28475 build1 bench1 : 1.05x faster (slower  7, faster 43, nochange  8)
                  bench2 : 1.05x faster (slower  2, faster 43, nochange 13)
           build2        : 1.05x faster (slower  4, faster 45, nochange  9)

310rc2 x64 release       : 1.00
 + PR28475               : 1.01x faster (slower 14, faster 25, nochange 19)


Is Windows involved in the faster-cpython project? If so, the project should be provided with Windows machines for validation.
msg402289 - (view) Author: Ma Lin (malin) * Date: 2021-09-21 06:47
PR28475:
64-bit build is 1.03x slower than 28d28e0~1
32-bit build is 1.04x slower than 28d28e0~1

28d28e0~1 is the last good commit.
msg402296 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-21 09:10
Like what Ma Lin and neonene have mentioned above, PR28475 recovered half of the lost performance. It's unfortunately still 4% slower than 3.10a7.

>pyperf compare_to 310a7.json 310rc2.json 310rc2patched.json

Geometric mean (versus 3.10a7)
==============

310rc2: 1.09x slower
310rc2patched: 1.04x slower

Attached pyperf benchmark comparisons file.
msg402307 - (view) Author: neonene (neonene) * Date: 2021-09-21 09:58
To be fair, the slowdowns between PR25244 and b1 seems to be an accumulation of "1.00x slower" of every commit. I don't know after b1.
msg402308 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-09-21 10:02
The only other change of any obvious significance to _PyEval_EvalFrameDefault since 3.10a7 are the changes to MATCH_MAPPING and MATCH_SEQUENCE and those make _PyEval_EvalFrameDefault smaller.

We may need to look elsewhere for the remaining ~4% performance.
msg402320 - (view) Author: neonene (neonene) * Date: 2021-09-21 15:05
PR28475 PGO is 2% slower than the patch I pasted on msg401743.
The function sizes are almost the same (+1:goto,+1:label), and there is no performance gap between release builds.

I suspect the following.

1. PGO is too sensitive to a function size at near the limit.
2. PR28475 is not fully covered by 44 tests. (msg401346)
msg402480 - (view) Author: neonene (neonene) * Date: 2021-09-23 07:35
3.10rc2 Python/ceval.c
1306: #define DISPATCH() \
1307:     { \
1308:         if (trace_info.cframe.use_tracing OR_DTRACE_LINE OR_LLTRACE) { \
1309:             goto tracing_dispatch; \

Among the 44 pgo-tests, only test_patma.TestTracing hits the condition above. On Windows, it seems that skipping it tightens the profile of PR28475 a bit. Additional tests such as test_threading(.ThreadTests.test_frame_tstate_tracing) might also cause some amount of variation or vice versa.

3.10rc2 x64 PGO    : 1.00
+ PR28475 
  with TestTracing : 1.05x faster (slow  3, fast 46, same  9)
  without          : 1.06x faster (slow  5, fast 52, same  1)

  with TestTracing : 1.00
  without          : 1.01x faster (slow 19, fast 27, same 12)

(Details: PR28475_skip1test_bench.txt)


Does test_patma.TestTracing need training for match-case performance?
msg402856 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-29 12:12
If someone wants this issue to be solved in 3.10.0 it must be resolved ASAP. I am going to start freezing the release branch in one day or two to start testing the final candidate as much as possible so this issue has 24h at max to be merged into 3.10 and cherry picked into the release branch
msg402857 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-29 12:15
I'm landing PR 28475 for now as it improves the situation from 7%  to 4% slowdown and is contained enough.
msg402858 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-29 12:15
This means that if anyone wants to pursue the 4% that is left the fix must be committed within 24 hours.
msg402864 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-29 12:42
> If someone wants this issue to be solved in 3.10.0 it must be resolved ASAP.

neonene suggested that the tracing tests for pattern matching (added in 3.10b4/rc1) caused PGO to wrongly optimize the more uncommon tracing paths in ceval. I will verify their one-line fix to disable that test for PGO and report back within the next 48 hours.

Related issue: https://bugs.python.org/issue44600
msg402867 - (view) Author: neonene (neonene) * Date: 2021-09-29 13:12
I have another fix.
msg402871 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-29 13:23
> I have another fix.

If you have another fix, please create a PR ASAN and get it reviewed and merged by a core dev in the next 24 hours, otherwise it will need to wait until 3.10.1
msg402878 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-29 14:25
Sadly, I can't reproduce the speedups OP reported from disabling test_patma.TestTracing. It's not any faster than what we have with PR28475. (See attached pyperformance).

I'm looking forward to their other fix :). Even if it comes in 3.10.1 that's still a huge win. I don't think everyone immediately upgrades when a new Python version arrives.

IMO, we should note in What's New that only for Windows, 3.10.0 has a slight slowdown. Some use cases are slower (by >10%!), while some people won't feel a thing. (Then again, maybe this is offset by the LOAD_ATTR opcache in 3.10 and we get a net zero effect?). I'll submit a PR soon if the full fix misses 3.10.0.
msg402886 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-29 15:38
> IMO, we should note in What's New that only for Windows, 3.10.0 has a slight slowdown.

I disagree. This is a regression/bug and we don't advertise "known bugs" in the what's new, the same for any other bugfix that has been delayed until 3.10.1

>  Some use cases are slower (by >10%!)

Can you still reproduce this with PR 28475?
msg402891 - (view) Author: neonene (neonene) * Date: 2021-09-29 17:10
I submitted 2 drafts in a hurry. Sorry for short explanations.
I'll add more reports.
msg402893 - (view) Author: neonene (neonene) * Date: 2021-09-29 17:33
@pablogsal
I'm OK with more effective fixes in 3.10.1 and later.

Thanks all, thanks kj and malin for many help.
msg402928 - (view) Author: Ma Lin (malin) * Date: 2021-09-30 01:14
I think this is a bug of MSVC2019, not a really regression of CPython. So changing the code of CPython is just a workaround, maybe the right direction is to prompt MSVC to fix the bug, otherwise there will be more trouble when 3.11 is released a year later.

Seeing MSVC's reply, it seems they didn't realize that it was a bug, but suggested to adjust the training samples and use `__forceinline`. They don't know `__forceinline` hangs the build process since 28d28e0.
msg402930 - (view) Author: neonene (neonene) * Date: 2021-09-30 01:48
_PyEval_EvalFrameDefault() may also need to be divided.
msg402954 - (view) Author: Ken Jin (kj) * (Python committer) Date: 2021-09-30 12:53
@Pablo
> I disagree. This is a regression/bug and we don't advertise "known bugs" in the what's new, the same for any other bugfix that has been delayed until 3.10.1

Alright, in hindsight 3.10 What's New was a bad suggestion on my part. I wonder if there's a better location for such news though.

>>  Some use cases are slower (by >10%!)
> Can you still reproduce this with PR 28475?

Yes that number is *with* PR28475. Without that PR it was worse. The second pyperformance comparison in this file is 3.10a7 vs PR28475 https://bugs.python.org/file50293/PR28475_vs_310rc2_vs_310a7.txt. Omitting python_startup (unstable on Windows) and unpack_sequence (microbenchmark):

- logging_silent: 250 ns +- 7 ns -> 291 ns +- 10 ns: 1.16x slower
- hexiom: 14.0 ms +- 0.3 ms -> 15.7 ms +- 3.0 ms: 1.12x slower
- logging_simple: 16.1 us +- 0.2 us -> 18.0 us +- 0.5 us: 1.12x slower
- nbody: 215 ms +- 7 ms -> 235 ms +- 4 ms: 1.09x slower
- logging_format: 17.8 us +- 0.3 us -> 19.4 us +- 0.5 us: 1.09x slower
- richards: 104 ms +- 6 ms -> 112 ms +- 3 ms: 1.08x slower
- xml_etree_parse: 218 ms +- 3 ms -> 235 ms +- 3 ms: 1.08x slower
- sqlalchemy_imperative: 34.5 ms +- 0.9 ms -> 37.1 ms +- 1.1 ms: 1.08x slower
- xml_etree_iterparse: 158 ms +- 2 ms -> 168 ms +- 2 ms: 1.06x slower
- pathlib: 255 ms +- 6 ms -> 271 ms +- 3 ms: 1.06x slower
- pyflate: 963 ms +- 10 ms -> 1.02 sec +- 0.02 sec: 1.06x slower
- unpickle_pure_python: 446 us +- 11 us -> 471 us +- 9 us: 1.06x slower
---- anything <= 1.05x slower is snipped since it could be noise -----

At this point I don't know if we have any quick fixes left. So maybe we should open another issue for 3.11 and consider factoring out uncommon opcodes into functions like Victor and Mark suggested. We could make use of the opcode stats the faster-cpython folks have collected https://github.com/faster-cpython/tools.
msg403403 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-10-07 13:00
Sadly the MSVC team are claiming that this isn't a bug in their compiler.
Not sure how we convince them that it is. The website rejects any attempt to reopen the issue.

How feasible would it be to use Clang or GCC on Windows?
msg403409 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-10-07 13:54
> How feasible would it be to use Clang or GCC on Windows?

clang seems to have a good Windows support and tries to the ABI compatible with MSC which is a must have to keep wheel package support (especially for the stable ABI, used by PyQt on Windows for example).

Moreover, there are ways to cross-build Python from another platform to Windows which can be convenient ;-)

I don't know the Windows ecosystem. Do people want to get VS debugger for example? Is clang compatible with the VS debugger?

See the discussion of 2014: "Status of C compilers for Python on Windows"
https://mail.python.org/archives/list/python-dev@python.org/thread/SYWDJ23AQDPWQN7HD6M6YCSGXERCHWA2/
msg403430 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-10-07 18:52
I would very much appreciate any new compiler be compatible with the 
standard Windows debuggers (windbg primarily, but I imagine most 
contributors would like it to keep working from VS).

Last I heard, clang is fine as a compiler for debugging if you use the 
MSVC linker to generate debug info, though it still isn't as complete as 
MSVC (ultimately by definition, since MSVC is the 
standard-by-implementation for this stuff). And I've got no idea 
how/whether link-time optimisation works when you mix tools, but I'd 
have to assume it doesn't.

Switching compiler may prevent me from being able to analyse crash 
reports (and by me, I mean the automated internal tools that do it for 
me), and certainly parts of the Windows build rely on MSVC-specific 
functionality right now (not in the main DLL) so we'd end up needing 
both for a full build.

Also, just to put it out there, I'm not volunteering to rewrite the 
build system :) If the steering council signs off on switching, I won't 
block it, but I have more interesting things to work on.
msg403432 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-10-07 19:03
If we know which parts of the function are critical, perhaps we should 
be designing a PGO profile that actually hits them all? The current 
profile is very arbitrary, basically just waiting for someone motivated 
enough to figure out a better one.
msg403464 - (view) Author: Ma Lin (malin) * Date: 2021-10-08 09:21
Today I tested with msvc2022-preview, `__forceinline` attribute will not hang the build.

64-bit PGO builds:

    28d28e0~1,vc2022   : baseline
    28d28e0~1+F,vc2022 : 1.02x slower  <1>
    28d28e0,vc2022     : 1.03x slower  <2>
    28d28e0+F,vc2022   : 1.03x slower
    3.10 final,vc2022  : 1.03x slower
    3.10 final+F,vc2022: 1.03x slower
    28d28e0~1,vc2019   : 1.00x slower  <3>

28d28e0~1 is the last fast commit, 28d28e0 is the first slow commit.
`+F` means add `__forceinline` attribute to all inline functions in object.h
vc2019 and vc2022 are the latest version.

<1> Forcing inline is slower.
<2> 28d28e0 is still slow, but not that much.
<3> Normally, msvc2019 and msvc2022 have the same performance.

Is it possible to write a PGO profile for 28d28e0? 
https://github.com/python/cpython/commit/28d28e053db6b69d91c2dfd579207cd8ccbc39e7

msvc2022 will be released in November this year, and maybe subsequent versions can be built with msvc2022.
msg403559 - (view) Author: neonene (neonene) * Date: 2021-10-09 23:02
PR28475 is not in the official source archive.
https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tar.xz

I'll check later whether official binary has the fix.
msg403587 - (view) Author: neonene (neonene) * Date: 2021-10-10 14:04
3.10.0 official binary is as slow as rc2.

Many files are not updated in the source archive or b494f5935c92951e75597bfe1c8b1f3112fec270, so I'm not sure if the delay is intentional or not.

We have no choice except waiting for 3.10.1.
msg403609 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-10-10 20:42
Someone whose name I don't recognize (MagzoB Mall) just changed the issue type to "crash" without explaining why. That user was created today, and has no triage permissions. Mind if I change it back? It feels like vandalism.
msg404089 - (view) Author: neonene (neonene) * Date: 2021-10-16 15:26
msg402954
>https://github.com/faster-cpython/tools

According to the suggested stats and pgomgr.exe, I experimentally moved LOAD_FAST and LOAD_CONST cases out of switch as below.

        if (opcode == LOAD_FAST) {
            ...
            DISPATCH();
        }

        if (opcode == LOAD_CONST) {
            ...
            DISPATCH();
        }

        switch (opcode) {


x64 performance results after patched (msvc2019)

Good inliner ver.
    3.10.0+    1.03x faster than before
    28d28e0~1  1.04x faster
    3.8.12     1.03x faster

Bad inliner ver. (too big evalfunc. Has msvc2022 increased the capacity?)
    3.10.0/rc2 1.00x faster
    3.11a1+    1.02x faster


It seems to me since quite a while ago the optimizer has stopped at some place after successful inlining. So the performance may be sensitive to code changes and it could be possible to detect where the optimization is aborted.

(Benchmarks: switch-case_unarranged_bench.txt)
msg406354 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-11-15 16:57
The total size of the main interpreter loop was recently reduced somewhat by an unrelated change:
 
https://github.com/python/cpython/commit/9178f533ff5ea7462a2ca22cfa67afd78dad433b

I wonder if this issue still exists?
msg406386 - (view) Author: neonene (neonene) * Date: 2021-11-16 08:06
I still have the issue in current main and PR29565 with msvc2022 (v142 or v143 toolset).
msg406407 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-11-16 14:55
Hm. If removing 26 opcodes didn't fix this, then maybe the size of _PyEval_EvalFrameDefault isn't really the issue?
msg406416 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-11-16 16:44
I'd like to know how to reproduce this. @neonene can you write down the
steps I should do to get the results you get? I have VS 2019, if I need VS
2022 I can install that.
msg406471 - (view) Author: neonene (neonene) * Date: 2021-11-17 14:26
Here are the 3 steps to reproduce with minimal pgo training. (vs2019)

1. Download the source archive of PR29565 and extract.
   https://github.com/python/cpython/archive/6a84d61c55f2e543cf5fa84522d8781a795bba33.zip

2. Apply the following patch.

==============================
--- PCbuild/build.bat
+++ PCbuild/build.bat
@@ -66 +66 @@
-set pgo_job=-m test --pgo
+set pgo_job=-c"pass"
--- PCbuild/pyproject.props
+++ PCbuild/pyproject.props
@@ -47,2 +47,3 @@
       <AdditionalOptions>/utf-8 %(AdditionalOptions)</AdditionalOptions>
+      <AdditionalOptions Condition="$(SupportPGO) and $(Configuration) == 'PGUpdate'">/d2inlinelogfull:_PyEval_EvalFrameDefault %(AdditionalOptions)</AdditionalOptions>
     </ClCompile>
==============================

3. Build [Rebuild]

   PCbuild\build --no-tkinter --pgo > build.log [-r]

   According to the inlining section in the log, any function that has one or more conditional expressions got "reject" from inliner.

   > Inlinee for function _PyEval_EvalFrameDefault 
   >  -_Py_EnsureFuncTstateNotNULL (pgo hard reject)
   >  ...
   >  _Py_INCREF (pgu decision)
   >  _Py_INCREF (pgu decision)
   >  -_Py_XDECREF (pgo hard reject)
   >  -_Py_XDECREF (pgo hard reject)
   >  -_Py_DECREF (pgo hard reject)
   >  -_Py_DECREF (pgo hard reject)
   >  ...


Profiling scores can be shown on VS2019 Command Prompt.

   pgomgr PCbuild\amd64\python311.pgd /summary [/detail] > largefile.txt

   * pgomgr.exe (or profile itself) has an issue.
     https://developercommunity.visualstudio.com/t/1560909


Unused opcodes in this training

   ROT_THREE, DUP_TOP_TWO, UNARY_POSITIVE, UNARY_NEGATIVE,
   BINARY_OP_ADD_FLOAT, UNARY_INVERT, BINARY_OP_MULTIPLY_INT,
   BINARY_OP_MULTIPLY_FLOAT, GET_LEN, MATCH_MAPPING, MATCH_SEQUENCE,
   MATCH_KEYS, LOAD_ATTR_SLOT, LOAD_METHOD_CLASS, GET_AITER, GET_ANEXT,
   BEFORE_ASYNC_WITH, END_ASYNC_FOR, STORE_ATTR_SLOT,
   STORE_ATTR_WITH_HINT, GET_YIELD_FROM_ITER, PRINT_EXPR, YIELD_FROM,
   GET_AWAITABLE, LOAD_ASSERTION_ERROR, SETUP_ANNOTATIONS, UNPACK_EX,
   DELETE_ATTR, DELETE_GLOBAL, ROT_N, COPY, DELETE_DEREF,
   LOAD_CLASSDEREF, MATCH_CLASS, SET_UPDATE, DO_TRACING

   I managed to activate inliner experimentally by removing the 36 op-cases from switch and merging/removing many macros.


Static instruction counts of _PyEval_EvalFrameDefault()

   PR29565   : 6882 (down to 4400 with above change)

   PR29482   : 7035
   PR29482~1 : 7742
   3.10.0+   : 3980 (well inlined sharing DISPATCH macro)
   3.10.0    : 5559
   3.10b1    : 5680
   3.10a7    : 4117 (well inlined)
msg406474 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-11-17 16:16
> -set pgo_job=-m test --pgo
> +set pgo_job=-c"pass"

This essentially disables PGO. You won't get anything valid or useful from analysing its results if you don't give it a somewhat reasonable profile (preferably one that exercises the interpreter loop, which "pass" does not).
msg406479 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-11-17 17:00
@neonene what's the importance of PR29565?
msg406487 - (view) Author: neonene (neonene) * Date: 2021-11-17 19:34
>This essentially disables PGO.

Thank you for the suggestion. I'll take another experimental aproach to reduce the size of 3.11 evalfunc for stronger validation.


>@neonene what's the importance of PR29565?

While we are talking about function size, I would like to use around PR29565 for consistent reporting. I think any commit is okay to reproduce the issue.

And please ignore the patch to build.bat.
msg406613 - (view) Author: neonene (neonene) * Date: 2021-11-19 18:54
In the eval-loop of PR29565, inlining seems to be enabled within about 70 op-brahches, trained with 44 tests.

log & source: ceval_PR29565_split_func.c  (not for performance)
msg407188 - (view) Author: neonene (neonene) * Date: 2021-11-28 05:03
I requested the MSVC team to reconsider the inlining issues, including __forceinline.
https://developercommunity.visualstudio.com/t/1595341


The stuck at link due to __forceinline can be avoided by completing the _Py_DECREF optimization outside _PyEval_EvalFrameDefault:

    static inline void         // no __forceinline
    _Py_DECREF_impl(...) {
        ...
    }
    static __forceinline void
    _Py_DECREF(...) {          // no conditional branch in the function
        _Py_DECREF_impl(...);    
    }


In _PyEval_EvalFrameDefault, wrapping the callees like above seems better for performance than just specifying __forceinline under the current MSVC.
msg415378 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-03-17 00:36
I can't yet confirm a regression in 3.11 (the main branch, currently) compared to 3.10. See my adventures in https://github.com/faster-cpython/ideas/discussions/315.
msg416911 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-04-06 23:30
>  -_Py_DECREF (pgo hard reject)

What exactly does "pgo hard reject" mean? I Googled it and found no hits besides this very issue.

I am trying to redefine the top three from this error log as macros, but since I still don't have stable benchmark results it's hard to know if this has any effect.
msg416950 - (view) Author: neonene (neonene) * Date: 2022-04-07 23:41
>What exactly does "pgo hard reject" mean?

In my recognition, "pgo hard reject" is based on the PGOptimizer's heuristic, "reject" is related to the probe count (hot/cold).

  https://developercommunity.visualstudio.com/t/1531987#T-N1535774


And there was a reply from MSVC team, closing the issue. MSVC won't be fixed in the near future.

  https://developercommunity.visualstudio.com/t/1595341#T-N1695626

From the reply and my investigation, 3.11 would need the following:

1. Some callsites such as tp_* pointer should not inline its fastpaths in the eval switch-case. They often conflict. Each pointer needs to be wrapped with a function or maybe _PyEval_EvalFrameDefault needs to be enclosed with "inline_depth(0)" pragma.

2. __assume(0) should be replaced with other function, inside the eval switch-case or in the inlined paths of callees. This is critical with PGO.

3. For inlining, use __forceinline / macro / const function pointer.

   MSVC's stuck can be avoided in many ways, when force-inlining in the evalloop a ton of Py_DECREF()s, unless tp_dealloc does not create a inlined callsite:

     void
     _Py_Dealloc(PyObject *op)
     {
      ...
     #pragma inline_depth(0) // effects from here, PGO accepts only 0.
         (*dealloc)(op);     // conflicts when inlined.
     }
     #pragma inline_depth()  // can be reset only outside the func.



* Virtual Call Speculation:
  https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-170#optimizations-performed-by-pgo


* The profiler runs under /GENPROFILE:PATH option, but at the big ceval-func, the optimizer merges the profiles into one like /GENPROFILE:NOPATH mode.
https://docs.microsoft.com/en-us/cpp/build/reference/genprofile-fastgenprofile-generate-profiling-instrumented-build?view=msvc-170#arguments


* __assume(0) (Py_UNREACHABLE):
  https://devblogs.microsoft.com/cppblog/visual-studio-2017-throughput-improvements-and-advice/#remove-usages-of-__assume
msg416977 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2022-04-08 11:04
> __assume(0) should be replaced with other function, inside the eval switch-case or in the inlined paths of callees. This is critical with PGO.

Out of interest, have you done other experiments confirming this? The reference linked is talking about compiler throughput (i.e. how long it takes to compile), and while it hints that using __assume(0) may interfere with other optimisations, that isn't supported with any detail or analysis in the post.
History
Date User Action Args
2022-04-11 14:59:49adminsetgithub: 89279
2022-04-08 11:04:12steve.dowersetmessages: + msg416977
2022-04-07 23:41:04neonenesetmessages: + msg416950
2022-04-07 00:09:15gvanrossumsetpull_requests: + pull_request30425
2022-04-06 23:30:14gvanrossumsetmessages: + msg416911
2022-03-17 00:36:11gvanrossumsetmessages: + msg415378
2022-02-21 04:16:47neonenesetpull_requests: + pull_request29588
2022-02-20 03:30:25neonenesetpull_requests: + pull_request29570
2021-11-28 05:03:09neonenesetmessages: + msg407188
2021-11-19 18:54:51neonenesetfiles: + ceval_PR29565_split_func.c

messages: + msg406613
2021-11-17 19:34:47neonenesetmessages: + msg406487
2021-11-17 17:00:42gvanrossumsetmessages: + msg406479
2021-11-17 16:16:42steve.dowersetmessages: + msg406474
2021-11-17 14:26:10neonenesetmessages: + msg406471
2021-11-16 16:44:41gvanrossumsetmessages: + msg406416
2021-11-16 14:55:49brandtbuchersetmessages: + msg406407
2021-11-16 08:06:23neonenesetmessages: + msg406386
2021-11-15 16:57:08brandtbuchersetmessages: + msg406354
2021-10-16 15:26:06neonenesetfiles: + switch-case_unarranged_bench.txt

messages: + msg404089
2021-10-15 15:08:28vstinnersettitle: Performance regression 3.10b1 and later on Windows: Py_DECREF() not inlined in PGO build -> Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC)
2021-10-11 02:55:00gvanrossumsettype: crash -> performance
2021-10-10 20:42:46gvanrossumsetmessages: + msg403609
2021-10-10 19:30:14magzobmallsettype: performance -> crash
2021-10-10 16:25:40brandtbucher2setnosy: + brandtbucher, - brandtbucher2
2021-10-10 16:04:23brandtbucher2setnosy: + brandtbucher2
2021-10-10 14:04:40neonenesetmessages: + msg403587
2021-10-09 23:02:46neonenesetmessages: + msg403559
2021-10-08 09:21:13malinsetmessages: + msg403464
2021-10-07 19:03:01steve.dowersetmessages: + msg403432
2021-10-07 18:52:59steve.dowersetmessages: + msg403430
2021-10-07 13:54:29vstinnersetmessages: + msg403409
2021-10-07 13:00:40Mark.Shannonsetmessages: + msg403403
2021-09-30 12:53:32kjsetmessages: + msg402954
2021-09-30 01:48:59neonenesetmessages: + msg402930
2021-09-30 01:14:50malinsetmessages: + msg402928
2021-09-29 17:33:59neonenesetmessages: + msg402893
2021-09-29 17:10:14neonenesetmessages: + msg402891
2021-09-29 16:54:47neonenesetpull_requests: + pull_request27001
2021-09-29 16:54:17neonenesetpull_requests: + pull_request27000
2021-09-29 15:38:07pablogsalsetmessages: + msg402886
2021-09-29 14:25:54kjsetfiles: + 310rc2patched_vs_310rc2notrace.txt

messages: + msg402878
2021-09-29 13:23:17pablogsalsetmessages: + msg402871
2021-09-29 13:12:45neonenesetmessages: + msg402867
2021-09-29 12:42:14kjsetmessages: + msg402864
2021-09-29 12:15:55pablogsalsetpriority: release blocker ->

messages: + msg402858
2021-09-29 12:15:18pablogsalsetmessages: + msg402857
2021-09-29 12:12:37pablogsalsetmessages: + msg402856
2021-09-23 07:35:57neonenesetfiles: + PR28475_skip1test_bench.txt

messages: + msg402480
2021-09-21 15:05:39neonenesetmessages: + msg402320
2021-09-21 10:02:22Mark.Shannonsetmessages: + msg402308
2021-09-21 09:58:47neonenesetmessages: + msg402307
2021-09-21 09:10:54kjsetfiles: + PR28475_vs_310rc2_vs_310a7.txt

messages: + msg402296
2021-09-21 06:47:24malinsetmessages: + msg402289
2021-09-21 06:01:53neonenesetfiles: + PR28475_inline.log

messages: + msg402287
2021-09-20 19:59:28gvanrossumsetnosy: + gvanrossum
2021-09-20 13:28:50pablogsalsetmessages: + msg402230
2021-09-20 13:26:32Mark.Shannonsetmessages: + msg402229
2021-09-20 13:16:46Mark.Shannonsetpull_requests: + pull_request26873
2021-09-20 10:12:03neonenesetmessages: + msg402217
2021-09-20 05:30:48rhettingersetmessages: + msg402190
2021-09-20 02:37:22malinsetmessages: + msg402189
2021-09-19 08:24:22erlendaaslandsetnosy: + erlendaasland
2021-09-19 03:27:14malinsetmessages: + msg402143
2021-09-18 16:26:55kjsetpriority: high -> release blocker
files: + 310a7_vs_310rc2_bench.txt
messages: + msg402135
2021-09-18 02:01:52neonenesetmessages: + msg402117
2021-09-17 21:43:41vstinnersetmessages: + msg402099
2021-09-17 21:32:04vstinnersetmessages: + msg402098
2021-09-17 21:30:59vstinnersetpull_requests: + pull_request26837
2021-09-17 20:56:56vstinnersetmessages: + msg402092
2021-09-17 20:46:48vstinnersetmessages: + msg402091
2021-09-17 20:41:03pablogsalsetmessages: + msg402090
2021-09-17 18:32:03vstinnersetmessages: + msg402071
2021-09-17 17:24:54kjsetmessages: + msg402068
2021-09-17 17:08:01pablogsalsetmessages: + msg402067
2021-09-17 16:36:16lemburgsetmessages: + msg402065
2021-09-17 15:39:03rhettingersetnosy: + lemburg, pablogsal
messages: + msg402064
2021-09-17 15:35:25rhettingersetmessages: + msg402063
2021-09-17 14:39:45vstinnersetpull_requests: + pull_request26831
2021-09-17 12:11:52vstinnersetmessages: + msg402044
2021-09-17 12:09:28vstinnersetmessages: + msg402043
2021-09-17 11:47:58vstinnersetmessages: + msg402040
2021-09-17 09:58:33kjsetpriority: normal -> high

messages: + msg402025
2021-09-16 16:42:54steve.dowersetmessages: + msg401972
2021-09-16 16:26:53rhettingersetmessages: + msg401970
2021-09-16 16:08:11vstinnersetstage: patch review
pull_requests: + pull_request26803
2021-09-16 15:44:50neonenesetmessages: + msg401964
2021-09-16 15:14:42kjsetnosy: + kj
2021-09-13 23:37:47neonenesetfiles: + 310rc2_benchmarks.txt

messages: + msg401743
2021-09-13 09:29:46vstinnersettitle: Performance regression 3.10b1 and later on Windows -> Performance regression 3.10b1 and later on Windows: Py_DECREF() not inlined in PGO build
2021-09-11 03:30:27neonenesetmessages: + msg401628
2021-09-11 02:45:22malinsetmessages: + msg401624
2021-09-11 02:17:57neonenesetfiles: + x64_b98e.log
2021-09-11 02:17:40neonenesetfiles: + x64_28d2.log
2021-09-11 02:17:00neonenesetfiles: + pyproject_inlinestat.patch
2021-09-11 02:16:06neonenesetfiles: + b98e-no-inline-in-the-others.diff
2021-09-11 02:15:52neonenesetfiles: + b98e-no-inline-in-eval.diff
2021-09-11 02:15:36neonenesetfiles: + b98e-no-inline-in-all.diff

messages: + msg401623
2021-09-08 09:48:24malinsetnosy: + malin
messages: + msg401364
2021-09-07 21:57:07steve.dowersetmessages: + msg401346
2021-09-07 19:15:36vstinnersetmessages: + msg401329
2021-09-07 18:13:04neonenesetmessages: + msg401319
2021-09-06 21:52:54vstinnersetmessages: + msg401183
2021-09-06 21:49:38vstinnersetmessages: + msg401182
2021-09-06 16:43:19rhettingersetnosy: + rhettinger
messages: + msg401154
2021-09-06 16:35:18vstinnersetmessages: + msg401152
2021-09-06 16:07:30pablogsalsetnosy: - pablogsal
2021-09-06 15:29:03neonenesetfiles: + ceval_310rc1_patched.c
2021-09-06 15:27:18neonenecreate