New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AST Optimization: inlining of function calls #54608
Comments
In msg#120541 of issue#1346238 Raymond suggested to "aim high", so here goes... I'm opening this as a separate bug as it's a very different approach to the patches in that bug; adding those on the nosy list for that bug. Sorry in advance about the epic length of this comment. I've been experimenting with AST optimizations, and have a (very) primitive implementation of function-call inlining. As is, it's deeply flawed, but in a spirit of "Release Early, Release Often" I though I'd post what I have so far, and try to enumerate the rest of the work that would need doing to get this into, say Python 3.3 The attached patch adds an AST optimization pass to Python/compiler.c. It does this by adding an __optimizer__.py module: the compiler converts the AST to a Python representation using ast2mod, and calls out to __optimizer__.optimize_ast(). This can then (potentially) apply a series of manipulations to the tree. The result is then converted back from python to ast objects, and compilation proceeds as normal on the modified AST tree. I initially was experimenting with a .c implementation, but moving to .py makes things _much_ easier to develop and debug. In particular, I'm using graphviz's "dot" tool to generate before/after visualizations of the AST. The optimizer doesn't try to optimize itself (or anything that it imports), to avoid infinite recursion when we have a cold .pyo cache. Currently I'm doing the AST optimization before symbol table generation. This means that the inlining is deeply flawed, as it has no knowledge of the scope of names. A robust implementation would compare the scopes of the callsite and that of the function body, and remap locals accordingly. The current implementation renames all name references in the function body, which is clearly wrong for e.g. references to globals. See below for notes on that. Here's my test code:: Here's what it compiles to after going through the inliner (clearly, line numbering needs some work). Note the removal of the CALL_FUNCTION of our target call site; the remaining CALL_FUNCTION is to "print": 4 9 LOAD_CONST 1 (3) 259 27 LOAD_CONST 4 (2) 260 45 LOAD_NAME 5 (print) The idea is that a further optimization pass would go through and ellide the unnecessary store-to-local/load-from-local instructions, followed by const folding, getting this down to: The biggest issue here is dealing with runtime differences between which function was inlined, and which function is being called, either via monkeypatching, or in method calls - we can inline intra-method calls within one class, but we have to cope with overriding of those methods in subclasses. Thinking aloud of a way of solving this (this isn't implemented yet): where you have some interpretation of name (e.g. "self.do_something"), and carry two different implementations. so that e.g. Similarly, would need a new bytecode, say "JUMP_IF_SPECIALIZABLE"
; method self.bar is now on top of stack label_inline_body: label_after: ; etc JUMP_IF_SPECIALIZABLE Action: I'm not sure if this covers all cases (e.g. specializing a CFunction such as the builtins), or if this a sane approach that actually works; need to have a standard interpretation of what a name lookup means, and to inject LOAD_CONST of that value into the bytecode, for use by the JUMP_IF_SPECIALIZABLE operation. As I understand it, functions are "name-resolved" before the arguments are evaluated, so if argument evaluation leads to the name being bound to a different function (as a side effect), the status quo in CPython currently is that the old function is used for that call; JUMP_IF_SPECIALIZABLE would preserve that behavior. It could also slightly slow down overridden method calls, as it would add a test for "am I overridden" that would generally take the "overridden" branch; perhaps we could detect classes with subclasses in the same AST and suppress inlining for this case. Similarly, it could spot the Other issues:
Notes:
Thanks to Red Hat for giving me time to experiment on this |
<i>As I understand it, functions are "name-resolved" before the arguments are evaluated,</i> There is no difference between resolution of function names and of other names. For example, global names (LOAD_GLOBAL) are resolved entirely at runtime (just before the function gets called). The specialization issue means this would cooperate well with a globals cache (I've got a lingering patch for this, will post in a separate issue). |
Globals cache patch posted on bpo-10401. |
Just a thought: it's possible to cover all the cases properly with inlining (locals, globals (even with cross module inlining), tracebacks, sys._getframe(), etc.), however I think you're going to find that manually managing all of those is going to quickly drive you up a tree if you want to also get meaningful speedups (from removing frame allocation, argument parsing, etc.). My 2¢. |
While I have nothing to say directly about the inline optimization, I do have some stuff to say about moving to AST optimizations. First, doing in Python is a good thing. It not only makes prototyping easier, but it allows other VMs to use the optimizations w/o having to re-implement themselves. Second, the symtable pass does need to eventually get exposed (most likely as an optional pass one can do to an AST). I am actually in the middle of an AST-heavy project that will end up wanting the symbol table info as well. Third, for that Graphviz output, was anything special required? If so, I would toss the code into Tools for others to benefit from. |
|
No, it's rather Linux and tool specific to go into ast.py. But adding it to the Tools/ directory makes sense. |
Sorry again for another epic-length comment... I'm attaching the latest work-in-progress on this. The code is still fairly messy (embarrasingly so in places), but it's better to have it out in public in this tracker than stuck on my hard drive. Symbol tables
Specialization I've added a new AST node: "Specialize". It's an expression, but contains statements. The idea is that:
I've added a JUMP_IF_SPECIALIZABLE opcode, mostly as described in the initial comment. This allows us to implement "specialized" bytecode that makes an assumption, but has a fallback ("generalized") for the case when the assumption fails. To wire this up for function inlining, we need some way of recording the "original" version of a function versus what it's now bound to. For now, I'm simply adding an assignment after each function def, equivalent to: def foo():
print('hello world')
__saved__foo = foo
def callsite():
foo() and the "Specialize" in "callsite" is compiled to: LOAD_GLOBAL (foo) ; dynamic lookup Additional optimizations
These combine nicely with the function inlining. For example, given this code (from "test_copy_propagation"):: then "callsite" compiles down to this bytecode:: 5 0 LOAD_GLOBAL 0 (print) Note how the specialized inline implementation is able to do: Other stuff
Correctness Performance The optimizer is currently slow, and adds a noticable delay whey compiling a .py file. Potentially it could be optimized. Alternatively, we could gain a command-line flag to tell python to more heavily optimize its bytecode. This is particularly relevant with my downstream-distributor hat on: in Fedora and RHEL we ship hundreds of Python packages (and .pyc/.pyo) via RPM, and it would be simple to turn up the optimizations when we do a build. bm_call_simple.py gets significantly faster: a 49% speedup (albeit for a do-nothing benchmark, and this isn't counting the start-up time)::
Other benchmarks show no change, or slight slowdowns (or crashes). I hope that this will improve upon implementation method-call inlining. |
If this a work in progress, you could create an SVN branch in the |
There are a couple places you mention not doing the optimization when specific functions are used (e.g. dir, globals, locals), how exactly do you verify that, given those functions could be bound to any name? |
Thanks for reading through this.
I don't: I'm only looking at explicit references to these global names: but I haven't attempted to track these at the object level, just at the name level. This means that if someone is passing these around bound to a different name, the results could be "wrong": the optimizations that I'm doing are synthesizing new local and global variables (in each case with a '__' prefix), and sometimes eliminating local variables. However, it should work as expected for the simple and common case for code like this: def foo():
...
pprint(locals())
... since the reference to "locals" can be seen at the name level. But it won't work for something like this: def inlined_fn(a):
print(a())
def not_inlinable_for_some_reason(b):
inlined_fn(b) # assume this callsite is inlined
def some_other_fn_perhaps_in_another_module():
hidden_ref_to_locals = locals
not_inlinable_for_some_reason(hidden_ref_to_locals) in that (if inlining happens) the local printed will be named "b", rather than "a". But this seems to me like a contrived example: how often in real code do people pass around these builtins, rather than calling them directly? I myself use them for debugging, but I only ever call them directly. At a high level, what I'm interested in is providing additional benefits for python 3 relative to python 2 (to encourage migration), and I picked speed-ups. I think it's reasonable to provide some kind of "optimized" mode that is allowed to take (limited) liberties with the language. Note that with other languages, people seem to accept some mild degradation of debuggability in return for compiler optimizations. I'd like to be able to provide a faster python under similar kinds of tradeoffs. Currently, Python's optimized mode doesn't seem to do much (just compile away assertions iirc). Perhaps these optimizations could be associated with the pre-existing optimizer flag; alternatively, perhaps a 3rd level could be provided? This seems like a PEP-level request, but here's an outline of my thoughts here: I'd like the highly-optimized mode to be permitted the following::
I believe that the above covers all of the places where my patch is taking liberties with existing Python semantics (modulo bugs): I'm still supporting runtime rebinding of other names, and (I believe), preserving existing behavior. (please let me know if I'm missing something here!) There seems to be some precedent for this:
and for locals() it warns:
Does this sound sane? (obviously, actually approving this would be a matter for the BDFL). Perhaps all synthesized vars could have a "__internal__" prefix to signify that they should be left alone. The other PEP-level request would be for the highly-optimized mode to be permitted to take human-noticable time to generate its bytecode (rather than being "instantaneous"). I don't know if that's needed yet: the inliner is currently just a prototype, and I don't have any heuristics yet for deciding whether to inline a callsite (beyond an arbitrary cutoff I put in to stop bm_simple_call from "exploding"); I'm probably introducing various O(N^2) time/space behaviors (and worse) right now. |
There is some precedent for allowing minor differences in -O mode. I think we should push this to see how practical we can make it. On Sat, Nov 20, 2010 at 1:11 PM, Dave Malcolm <report@bugs.python.org> wrote:
|
Good idea; I've created branch "dmalcolm-ast-optimization-branch" (of |
py3k-ast-pyoptimize-2010-11-19-006.patch fixed up and committed to the branch as r86715; I'll work on that branch for the time being. |
From experience developing PyPy, every argument that goes "this theoretically breaks obscure code, but who writes it in that way?" is inherently broken: there *is* code out there that uses any and all Python strangenesses. The only trade-offs you can make is in how much existing code you are going to break -- or make absolutely sure that you don't change semantics in any case. |
"Can we resurrect this, perhaps by taking it up on python-dev?" I created a new "FAT Python" project to reimplement such kind of optimizations with a similar design (similar but different ;-)): |
This looks like an interesting research project that was abandoned 12 years ago. Is there any point leaving the issue open? |
I don't think so. |
I closed the issue. CPython code evolved a lot since 2010. The optimization work in Python 3.11 changed ceval.c a lot. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: