Author dmalcolm
Recipients alex, belopolsky, benjamin.peterson, dmalcolm, jhylton, nnorwitz, rhettinger, sdahlbac, thomaslee, titanstar
Date 2010-11-12.20:16:48
SpamBayes Score 2.22045e-16
Marked as misclassified No
Message-id <>
In msg#120541 of issue#1346238 Raymond suggested to "aim high", so here goes...

I'm opening this as a separate bug as it's a very different approach to the patches in that bug; adding those on the nosy list for that bug.  Sorry in advance about the epic length of this comment.

I've been experimenting with AST optimizations, and have a (very) primitive implementation of function-call inlining.  As is, it's deeply flawed, but in a spirit of "Release Early, Release Often" I though I'd post what I have so far, and try to enumerate the rest of the work that would need doing to get this into, say Python 3.3

The attached patch adds an AST optimization pass to Python/compiler.c.  It does this by adding an module: the compiler converts the AST to a Python representation using ast2mod, and calls out to __optimizer__.optimize_ast().  This can then (potentially) apply a series of manipulations to the tree.  The result is then converted back from python to ast objects, and compilation proceeds as normal on the modified AST tree.

I initially was experimenting with a .c implementation, but moving to .py makes things _much_ easier to develop and debug.  In particular, I'm using graphviz's "dot" tool to generate before/after visualizations of the AST.

The optimizer doesn't try to optimize itself (or anything that it imports), to avoid infinite recursion when we have a cold .pyo cache.

Currently I'm doing the AST optimization before symbol table generation.  This means that the inlining is deeply flawed, as it has no knowledge of the scope of names.  A robust implementation would compare the scopes of the callsite and that of the function body, and remap locals accordingly.  The current implementation renames all name references in the function body, which is clearly wrong for e.g. references to globals.  See below for notes on that.

Here's my test code::
    def function_to_be_inlined(x, y, z):
        return (2 * x * y) + z
    print(function_to_be_inlined(3, 4, 5))

Here's what it compiles to after going through the inliner (clearly, line numbering needs some work).  Note the removal of the CALL_FUNCTION of our target call site; the remaining CALL_FUNCTION is to "print":
  2           0 LOAD_CONST               0 (<code object function_to_be_inlined at 0x1e9b998, file "<dis>", line 2>) 
              3 MAKE_FUNCTION            0 
              6 STORE_NAME               0 (function_to_be_inlined) 

  4           9 LOAD_CONST               1 (3) 
             12 STORE_NAME               1 (__inline1f22840__x) 
             15 LOAD_CONST               2 (4) 
             18 STORE_NAME               2 (__inline1f22840__y) 
             21 LOAD_CONST               3 (5) 
             24 STORE_NAME               3 (__inline1f22840__z) 

259          27 LOAD_CONST               4 (2) 
             30 LOAD_NAME                1 (__inline1f22840__x) 
             33 BINARY_MULTIPLY      
             34 LOAD_NAME                2 (__inline1f22840__y) 
             37 BINARY_MULTIPLY      
             38 LOAD_NAME                3 (__inline1f22840__z) 
             41 BINARY_ADD           
             42 STORE_NAME               4 (__inline1f22840____returnval__) 

260          45 LOAD_NAME                5 (print) 
             48 LOAD_NAME                4 (__inline1f22840____returnval__) 
             51 CALL_FUNCTION            1 
             54 POP_TOP              
             55 LOAD_CONST               5 (None) 
             58 RETURN_VALUE         

The idea is that a further optimization pass would go through and ellide the unnecessary store-to-local/load-from-local instructions, followed by const folding, getting this down to:
   LOAD_CONST (<code object>)
   STORE_NAME (function_to_be_inlined)
; inlined callsite:
   LOAD_NAME  (print)
   LOAD_CONST (29)

The biggest issue here is dealing with runtime differences between which function was inlined, and which function is being called, either via monkeypatching, or in method calls - we can inline intra-method calls within one class, but we have to cope with overriding of those methods in subclasses.

Thinking aloud of a way of solving this (this isn't implemented yet):
   add to AST:
     Specialize(expr name,
                stmt* specialized,
                stmt* generalized)

where you have some interpretation of name (e.g. "self.do_something"), and carry two different implementations.

  so that e.g.
  class Something:
     def foo(self, x, y, z):
           ... # some code
 , y, z)
           ... # more code
     Call(Attribute(Name(id='self'), id='bar'), args=[etc])
     Specialize(name=Attribute(Name(id='self'), id='bar'),
                specialized=[inlined implementation of for "self"],
                generalized=[Call(Attribute(Name(id='self'), id='bar'), args=[etc])])

Similarly, would need a new bytecode, say "JUMP_IF_SPECIALIZABLE"

      LOAD_NAME (self)
      GET_ATTR  ('bar')
  ; method is now on top of stack
      LOAD_CONST (<code object for the regular implementation of "bar">)
      JUMP_IF_SPECIALIZABLE -> label_inline_body
  ; Start of non-inlined call
      eval and push args
      JUMP_ABSOLUTE -> label_after
  ; ...end of non-inlined call; return value is top-of-stack

      eval args (potentially optimizing with knowledge of what's to come)
      inlined call
      push "return value"
  ; ...end of inlined call; "return value" is top-of-stack


   ; etc

TOS    : A: code object we inlined for (via a LOAD_CONST, injected by the inliner)
TOS -1 : B: function object via runtime lookup (LOAD_GLOBAL, or e.g. LOAD_LOCAL "self"; GET_ATTR "do_something")

  if B's __code__ is A, we can specialize:
     PC += oparg
     PC += 1

I'm not sure if this covers all cases (e.g. specializing a CFunction such as the builtins), or if this a sane approach that actually works; need to have a standard interpretation of what a name lookup means, and to inject LOAD_CONST of that value into the bytecode, for use by the JUMP_IF_SPECIALIZABLE operation.

As I understand it, functions are "name-resolved" before the arguments are evaluated, so if argument evaluation leads to the name being bound to a different function (as a side effect), the status quo in CPython currently is that the old function is used for that call; JUMP_IF_SPECIALIZABLE would preserve that behavior.

It could also slightly slow down overridden method calls, as it would add a test for "am I overridden" that would generally take the "overridden" branch; perhaps we could detect classes with subclasses in the same AST and suppress inlining for this case.  Similarly, it could spot the
    def override_me(self):
         raise NotImplementedError
pattern and not try to inline.

Other issues:
  - Somehow need to tie together symbol tables with ast2mod.  One approach: do the symtable generation first, then adjust ast2mod so that it adds PySymtableEntry references to the generated python tree as appopriate.  This would expose the name scope structure.  But after the tree changes, do we need to regenerate the symbol table?  Or we could annotate the python ast.Name objects with scope information, and the ast optimizer would be responsible for keeping this correct.  The compiler could then rerun the symtable analysis on the returned AST, or could trust the optimizer's info.

  - What to do with the "locals()" builtin?  Inlined functions could save a copy of locals with duplicate names, keeping their existing names for the locals in the inlined copy.  This would lead to "locals()" having additional items when called from an inlined function.

  - Many optimizations would have to assume that locals don't change value.  Can we assume that in an optimized build that it's acceptable that code that use the inspect module and manipulates the f_locals of a frame may have surprising behavior in the face of optimization?  (e.g. I'd like to be able to optimize away locals if we can prove that absence of side-effects).

  - PyAST_obj2mod enforces checking for a particular top-level element.  To cope with this, I needed to pass this around in a few places, so I introduced "enum PyCompilationMode" to avoid having the magic 0, 1, or 2 to designate the grammar variant.  This better distinguishes these values from other places which use the ID of the start token: semantically equivalent, but with different constants.  I haven't yet fixed this up in Modules/parsermodule.c

  - Plenty of FIXMEs in the code, many of which are in themselves major issues
  - I'm not yet checking the optimization flag; that should probably be required to be on for this to be called.

  - Is "__optimizer__" a sane name?  perhaps just "optimizer"

  - Too unstable to benchmark as yet.

  - There are probably other issues I haven't thought of yet.

  - the patch also contains a fix for issue 10391 (error handling is important when manipulating the AST)
  - potentially could add an optimize_cfg() hook, and expose the CFG struct basicblock and struct instr as Python objects in a similar way, as a later optimization pass
Thanks to Red Hat for giving me time to experiment on this
Date User Action Args
2010-11-12 20:16:53dmalcolmsetrecipients: + dmalcolm, jhylton, nnorwitz, rhettinger, belopolsky, sdahlbac, titanstar, thomaslee, benjamin.peterson, alex
2010-11-12 20:16:52dmalcolmsetmessageid: <>
2010-11-12 20:16:50dmalcolmlinkissue10399 messages
2010-11-12 20:16:49dmalcolmcreate