Message 79051 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	skip.montanaro
Recipients	ajaksu2, alexandre.vassalotti, bboissin, blaisorblade, christian.heimes, djc, facundobatista, lemburg, pitrou, rhettinger, skip.montanaro, theatrus
Date	2009-01-04.13:14:14
SpamBayes Score	3.0238516e-07
Marked as misclassified	No
Message-id	<1231074857.79.0.91001529339.issue4753@psf.upfronthosting.co.za>
In-reply-to

Content
I'm sure this is the wrong place to bring this up, but I had a thought about simple JIT compilation coupled with the opcode dispatch changes in this issue. Consider this silly function: >>> def f(a, b): ... result = 0 ... while b: ... result += a ... b -= 1 ... return result ... which compiles to 2 0 LOAD_CONST 1 (0) 3 STORE_FAST 2 (result) 3 6 SETUP_LOOP 32 (to 41) >> 9 LOAD_FAST 1 (b) 12 JUMP_IF_FALSE 24 (to 39) 15 POP_TOP 4 16 LOAD_FAST 2 (result) 19 LOAD_FAST 0 (a) 22 INPLACE_ADD 23 STORE_FAST 2 (result) 5 26 LOAD_FAST 1 (b) 29 LOAD_CONST 2 (1) 32 INPLACE_SUBTRACT 33 STORE_FAST 1 (b) 36 JUMP_ABSOLUTE 9 >> 39 POP_TOP 40 POP_BLOCK 6 >> 41 LOAD_FAST 2 (result) 44 RETURN_VALUE What if you built and compiled a "Mini Me" version of PyEval_EvalFrameEx on-the-fly which only contained the prologue and epilog of the real function and a small switch statement which only knew about the the byte-code instructions used by f()? Would the compiler be better able to optimize the code? Would the instructions' placement nearer to each other provide better cache behavior? Would branch prediction by CPU be improved? Another possibility would be to eliminate the for(;;) ... switch altogether and just inline the code for the individual instructions. It would help if the body of each bytecode instruction was implemented as a macro, e.g.: #define _POP_TOP() \ PREDICTED(POP_TOP); \ TARGET(POP_TOP) \ v = POP(); \ Py_DECREF(v); \ FAST_DISPATCH(); The above function could (lots of hand-waving here) be "compiled" to something like PyObject * _MiniMe(PyFrameObject *f, int throwflag) { _PyEVAL_FRAMEEX_PROLOG _LOAD_CONST(1) _STORE_FAST(2) _SETUP_LOOP(_41) _9: _LOAD_FAST(1) _JUMP_IF_FALSE(_39) _POP_TOP() _LOAD_FAST(2) _LOAD_FAST(0) _INPLACE_ADD() _STORE_FAST(2) _26: _LOAD_FAST(1) _LOAD_CONST(2) _INPLACE_SUBTRACT() _STORE_FAST(1) _JUMP_ABSOLUTE(_9) _39: _POP_TOP() _POP_BLOCK() _LOAD_FAST(2) _RETURN_VALUE() _PyEVAL_FRAMEEX_EPILOG } and the resulting binary code saved as an attribute of the code object. Presumably there would be some decision made about whether to compile a function into this form (maybe only after it's been called N times?).

I'm sure this is the wrong place to bring this up, but I had a
thought about simple JIT compilation coupled with the opcode
dispatch changes in this issue.

Consider this silly function:

    >>> def f(a, b):
    ...   result = 0
    ...   while b:
    ...     result += a
    ...     b -= 1
    ...   return result
    ... 

which compiles to

      2           0 LOAD_CONST               1 (0)
		  3 STORE_FAST               2 (result)

      3           6 SETUP_LOOP              32 (to 41)
	    >>    9 LOAD_FAST                1 (b)
		 12 JUMP_IF_FALSE           24 (to 39)
		 15 POP_TOP             

      4          16 LOAD_FAST                2 (result)
		 19 LOAD_FAST                0 (a)
		 22 INPLACE_ADD         
		 23 STORE_FAST               2 (result)

      5          26 LOAD_FAST                1 (b)
		 29 LOAD_CONST               2 (1)
		 32 INPLACE_SUBTRACT    
		 33 STORE_FAST               1 (b)
		 36 JUMP_ABSOLUTE            9
	    >>   39 POP_TOP             
		 40 POP_BLOCK           

      6     >>   41 LOAD_FAST                2 (result)
		 44 RETURN_VALUE        

What if you built and compiled a "Mini Me" version of
PyEval_EvalFrameEx on-the-fly which only contained the prologue and
epilog of the real function and a small switch statement which only
knew about the the byte-code instructions used by f()?  Would the
compiler be better able to optimize the code?  Would the
instructions' placement nearer to each other provide better cache
behavior?  Would branch prediction by CPU be improved?

Another possibility would be to eliminate the for(;;) ... switch
altogether and just inline the code for the individual instructions.
It would help if the body of each bytecode instruction was
implemented as a macro, e.g.:

    #define _POP_TOP() \
	PREDICTED(POP_TOP); \
	TARGET(POP_TOP) \
	v = POP(); \
	Py_DECREF(v); \
	FAST_DISPATCH();

The above function could (lots of hand-waving here) be "compiled" to
something like

    PyObject *
    _MiniMe(PyFrameObject *f, int throwflag)
    {
	_PyEVAL_FRAMEEX_PROLOG

	_LOAD_CONST(1)
	_STORE_FAST(2)
	_SETUP_LOOP(_41)
    _9:
	_LOAD_FAST(1)
	_JUMP_IF_FALSE(_39)
	_POP_TOP()
	_LOAD_FAST(2)
	_LOAD_FAST(0)
	_INPLACE_ADD()
	_STORE_FAST(2)
    _26:
	_LOAD_FAST(1)
	_LOAD_CONST(2)
	_INPLACE_SUBTRACT()
	_STORE_FAST(1)
	_JUMP_ABSOLUTE(_9)
    _39:
	_POP_TOP()
	_POP_BLOCK()
	_LOAD_FAST(2)
	_RETURN_VALUE()

	_PyEVAL_FRAMEEX_EPILOG
    }

and the resulting binary code saved as an attribute of the code
object.  Presumably there would be some decision made about whether
to compile a function into this form (maybe only after it's been
called N times?).

History
Date	User	Action	Args
2009-01-04 13:14:18	skip.montanaro	set	recipients: + skip.montanaro, lemburg, rhettinger, facundobatista, pitrou, christian.heimes, ajaksu2, alexandre.vassalotti, djc, bboissin, blaisorblade, theatrus
2009-01-04 13:14:17	skip.montanaro	set	messageid: <1231074857.79.0.91001529339.issue4753@psf.upfronthosting.co.za>
2009-01-04 13:14:16	skip.montanaro	link	issue4753 messages
2009-01-04 13:14:14	skip.montanaro	create