Author dmalcolm
Recipients alex, belopolsky, benjamin.peterson, brett.cannon, dmalcolm, jhylton, nnorwitz, orsenthil, pitrou, rhettinger, sdahlbac, thomaslee, titanstar
Date 2010-11-20.01:10:35
SpamBayes Score 0.0
Marked as misclassified No
Message-id <>
Sorry again for another epic-length comment...

I'm attaching the latest work-in-progress on this.

The code is still fairly messy (embarrasingly so in places), but it's better to have it out in public in this tracker than stuck on my hard drive.

Symbol tables
  - I've moved the optimization pass to after symbol table generation.  Pass the symtable to mod2obj et al, adding an "ste" attribute to all objects that have their own PySTEntryObject (adding a PySymtable_TryLookup function to Python/symtable.c to help with this).  The symtable pass is rerun after optimization.
    - the "ste" attribute isn't exposed anywhere in the grammar (do other VMs implement symbol table entries?)
  - in the graphviz debug output, add boxes around scopes (i.e. those nodes with an "ste")
  - use the symbol table info: only inline functions that purely reference locals and globals, not free/cellvars.
  - only rename locals when inlining, not globals
  - don't inline functions that use the "locals" or "dir" builtin

I've implemented a first pass at the specialization ideas in the original comment.

I've added a new AST node: "Specialize".  It's an expression, but contains statements.
            -- Generated by the optimizer:
            | Specialize(expr name,
                          stmt* specialized_body,
                          expr specialized_result,
                          expr generalized) -- must be a Call

The idea is that:
  - "name" should be dynamically evaluated, then compared in some (not yet well specified) way against the expected value during inlining
  - If we have a match:
     - then execute the statements in "specialized_body", then evaluate "specialized_result", the latter giving the overall value of this expression
  - If we don't have a match:
       - then evaluate "generalized", which currently must be a Call, and this gives the overall value of the expression.

I've added  a JUMP_IF_SPECIALIZABLE opcode, mostly as described in the initial comment.

This allows us to implement "specialized" bytecode that makes an assumption, but has a fallback ("generalized") for the case when the assumption fails.

To wire this up for function inlining, we need some way of recording the "original" version of a function versus what it's now bound to.  For now, I'm simply adding an assignment after each function def, equivalent to:

   def foo():
       print('hello world')
   __saved__foo = foo

   def callsite():

and the "Specialize" in "callsite" is compiled to:

   LOAD_GLOBAL   (foo)              ; dynamic lookup
   LOAD_GLOBAL (__saved__foo__)   ; get saved version
   JUMP_IF_SPECIALIZABLE -> inline impl
; inline implementation
   LOAD_GLOBAL ('print')
   LOAD_CONST ('hello world')
   LOAD_CONST (None)   

Additional optimizations
I've added:
  - a very simple copy-propagation pass, which only runs on functions with a single basic block (to keep dataflow ananalysis simple: no edges in the CFG, since we don't have an explicit CFG yet)
  - a pass which removes locals that are only ever written to, never read from, provided that the function doesn't use "globals", "locals", or "dir"

These combine nicely with the function inlining.  For example, given this code (from "test_copy_propagation")::
    def function_to_be_inlined(a, b, c, d):
        return(a + b + c + d)
    def callsite():
        print(function_to_be_inlined('fee', 'fi', 'fo', 'fum'))

then "callsite" compiles down to this bytecode::

  5           0 LOAD_GLOBAL              0 (print) 
              3 LOAD_GLOBAL              1 (function_to_be_inlined) 
              6 LOAD_GLOBAL              2 (__saved__function_to_be_inlined) 
              9 JUMP_IF_SPECIALIZABLE    30 
             12 LOAD_CONST               1 ('fee') 
             15 LOAD_CONST               2 ('fi') 
             18 LOAD_CONST               3 ('fo') 
             21 LOAD_CONST               4 ('fum') 
             24 CALL_FUNCTION            4 
             27 JUMP_FORWARD            15 (to 45) 
        >>   30 LOAD_CONST               0 (None) 
             33 STORE_FAST               0 (__inline_function_to_be_inlined_1976300____returnval__) 
             36 LOAD_CONST               7 ('feefifofum') 
             39 STORE_FAST               0 (__inline_function_to_be_inlined_1976300____returnval__) 
             42 LOAD_FAST                0 (__inline_function_to_be_inlined_1976300____returnval__) 
        >>   45 CALL_FUNCTION            1 
             48 POP_TOP              
             49 LOAD_CONST               0 (None) 
             52 RETURN_VALUE         

Note how the specialized inline implementation is able to do:
               LOAD_CONST               7 ('feefifofum') 
rather than have to compute the sum at runtime.  (the return value handling is still a little messy, alas).

Other stuff
- don't try to optimize anything until the end of Py_InitializeEx, since we can't run arbitrary python code until e.g. "sys" and "codecs" etc are brought up
- don't inline for functions that have rebound names
- co_lnotab's implementation of line numbers described in Objects/lnotab_notes.txt only supports monotonically-increasing line numbers within a code object's bytecode, but with inlining, we can jump around arbitrarily within one file.  I've disabled the check for now, and line numbers sometimes come out wrong.

Just a proof-of-concept for now.  There's a "safety switch" in Lib/ which restricts optimizations to just a few places.  I've been occasionally turning it off and runnning regrtest: many tests pass, many fail; I have fixed some failures as I see them.


The optimizer is currently slow, and adds a noticable delay whey compiling a .py file.  Potentially it could be optimized.  Alternatively, we could gain a command-line flag to tell python to more heavily optimize its bytecode.  This is particularly relevant with my downstream-distributor hat on: in Fedora and RHEL we ship hundreds of Python packages (and .pyc/.pyo) via RPM, and it would be simple to turn up the optimizations when we do a build. gets significantly faster: a 49% speedup (albeit for a do-nothing benchmark, and this isn't counting the start-up time)::

    ### call_simple ###
    Min: 0.320839 -> 0.214237: 1.4976x faster
    Avg: 0.322125 -> 0.215405: 1.4954x faster
    Significant (t=155.641488)
    Stddev: 0.00246 -> 0.00099: 2.4894x smaller
    Timeline: b''

Other benchmarks show no change, or slight slowdowns (or crashes).  I hope that this will improve upon implementation method-call inlining.
Date User Action Args
2010-11-20 01:10:46dmalcolmsetrecipients: + dmalcolm, jhylton, nnorwitz, brett.cannon, rhettinger, belopolsky, sdahlbac, orsenthil, titanstar, pitrou, thomaslee, benjamin.peterson, alex
2010-11-20 01:10:42dmalcolmsetmessageid: <>
2010-11-20 01:10:41dmalcolmlinkissue10399 messages
2010-11-20 01:10:40dmalcolmcreate