classification
Title: Inline PyEval_EvalFrameEx() in callers
Type: performance Stage:
Components: Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, dino.viehland, haypo, serhiy.storchaka, steve.dower
Priority: normal Keywords: patch

Created on 2016-12-09 18:00 by haypo, last changed 2016-12-15 14:37 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
inline_evalframeex.patch haypo, 2016-12-09 18:00 review
Messages (11)
msg282794 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-09 18:00
Inline PyEval_EvalFrameEx() in callers. The PEP 523 modified PyEval_EvalFrameEx(): it's now an indirection to interp->eval_frame().

Inline the call in performance critical code. Leave PyEval_EvalFrame() unchanged, this function is only kept for backward compatibility (and so not important for performance).

I pushed directly my change as the revision 99c34e47348b, but it broke test_gdb. So now I doubt that it's 100% "safe" to inline the code. Outside test_gdb, does something else rely on PyEval_EvalFrameEx()? So I chose to open an issue to discuss the change. By "something", I'm thinking to Pyjion :-)

Attached patch updates also python-gdb.py to fix test_gdb.
msg282868 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2016-12-10 19:50
Inlining wouldn't break Pyjion since all of its smarts would be in the trampoline function in PyInterpreterState. It may break other debuggers like Python Tools for Visual Studio, though (+steve.dower for that).

But is the overhead honestly that high to warrant inlining? What kind of perf gain do you see from doing this? My worry is that if the perf isn't that much better that inlining will simply make it harder to tweak that function in the future.
msg282919 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-12-11 13:53
MSVC automatically inlines it on Windows (and yes, it broke some of our debugging support in Visual Studio, but we can fix it by setting the eval func).

IMHO, inlining is best left to profiling optimizers. If you notice a regression, add a test case that drives it to the point where it gets inlined. But it's not worth reducing maintainability of the code base for this.
msg283249 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-15 08:03
> It may break other debuggers like Python Tools for Visual Studio, though (+steve.dower for that).

Do you mean debuggers expecting that bytecode is run in PyEval_EvalFrameEx() rather than PyEval_EvalFrameDefault()? Can't we fix these debuggers?


> IMHO, inlining is best left to profiling optimizers.

The problem is that PGO compilation is not used by everyone yet. The patch is a minor enhancement to make sure that regular builds are also fast.
msg283252 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-15 08:27
May be just replace inlined call with _PyEval_EvalFrameDefault?
msg283253 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-15 08:28
Serhiy: "May be just replace inlined call with _PyEval_EvalFrameDefault?"

Do you mean replacing PyEval_EvalFrameEx() with _PyEval_EvalFrameDefault()? It would defeat the purpose of the PEP 523, no?
msg283258 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-15 08:40
Ah, now I see. I was confused by the fact that eval_frame is set only to _PyEval_EvalFrameDefault.

But how large the gain of inlining PyEval_EvalFrameEx()? Is it worth cluttering the code?

Since almost all calls of PyEval_EvalFrameEx() are in the same file as its definition, I suppose the compiler can inline it even without PGO.
msg283263 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-15 09:17
Serhiy: "Since almost all calls of PyEval_EvalFrameEx() are in the same file as its definition, I suppose the compiler can inline it even without PGO."

Let me check. My patch changes gen_send_ex() of genobject.c and _PyEval_EvalCodeWithName() of ceval.c. "gcc -O3" is able to inline PyEval_EvalFrameEx() in the same file (ceval.c), but not in a different file (genobject.c).

My patch also avoids an useless call to PyThreadState_GET()... Well, I'm not sure that it's worth it. I propose the change because the indirection now looks completely useless to me, but I didn't expect that it could break anything. With my patch, python-gdb.py is fixed. Steve says that Visual Studio is already fixed as well. So I don't know, it's up to you ;-)

--

gen_send_ex().

C code:
188	    gen->gi_running = 1;
189	    result = PyEval_EvalFrameEx(f, exc);
190	    gen->gi_running = 0;

x86_64 assembler (gcc -03):
=> 0x000000000047fe37 <gen_iternext+103>:	movb   $0x1,0x18(%rbp)
   0x000000000047fe3b <gen_iternext+107>:	callq  0x54b870 <PyEval_EvalFrameEx>
   0x000000000047fe40 <gen_iternext+112>:	movb   $0x0,0x18(%rbp)
   0x000000000047fe44 <gen_iternext+116>:	mov    %rax,%r12
   0x000000000047fe47 <gen_iternext+119>:	mov    0x18(%rbx),%rdi

I still see the call to PyEval_EvalFrameEx().

--

_PyEval_EvalCodeWithName().

C code:
4169        retval = PyEval_EvalFrameEx(f,0);

x86_64 assembler (gcc -03):
=> 0x000000000054ae30 <+2544>:  mov    0x3827f9(%rip),%rax        # 0x8cd630 <_PyThreadState_Current>
   0x000000000054ae37 <+2551>:  xor    %esi,%esi
   0x000000000054ae39 <+2553>:  mov    0x40(%rsp),%rdi
   0x000000000054ae3e <+2558>:  mov    0x10(%rax),%rax
   0x000000000054ae42 <+2562>:  callq  *0x70(%rax)

In this file, PyEval_EvalFrameEx() is inlined. I understand that the first instruction is the PyThreadState_GET() call which is an atomic read (single x86 instruction). My patch avoids the useless PyThreadState_GET(), since tstate must not change.
msg283310 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-12-15 14:27
At least define EvalFrameEx in a header as an inline func, rather than copying the body.

VS expected to walk the native stack and locate the f parameter in EvalFrameEx. Since the function gets inlined, it couldn't find the frame. I use the JIT hook to insert my own frame onto the stack which is easier to find, so yes, the debugger has been fixed, yes the function gets inlined, and yes this has an impact other than performance.
msg283311 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-15 14:35
> At least define EvalFrameEx in a header as an inline func, rather than copying the body.

Hum, it would break the stable ABI, no?
msg283312 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-12-15 14:37
Well, since the impact on performance is likely non-existent, whereas drawbacks are real, I close the issue.

Thanks for your feedback ;-)
History
Date User Action Args
2016-12-15 14:37:06hayposetstatus: open -> closed
resolution: rejected
messages: + msg283312
2016-12-15 14:35:39hayposetmessages: + msg283311
2016-12-15 14:27:35steve.dowersetmessages: + msg283310
2016-12-15 09:17:05hayposetmessages: + msg283263
2016-12-15 08:40:35serhiy.storchakasetmessages: + msg283258
2016-12-15 08:28:41hayposetmessages: + msg283253
2016-12-15 08:27:23serhiy.storchakasetmessages: + msg283252
2016-12-15 08:03:01hayposetmessages: + msg283249
2016-12-11 13:53:57steve.dowersetmessages: + msg282919
2016-12-10 19:50:34brett.cannonsetnosy: + dino.viehland, steve.dower
messages: + msg282868
2016-12-09 18:00:50haypocreate