Title: LOOKUP_METHOD and CALL_METHOD optimization
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.2
Status: closed Resolution:
Dependencies: Superseder: Speedup method calls 1.2x
View: 26110
Assigned To: Nosy List: benjamin.peterson, collinwinter, jyasskin, pitrou, r.david.murray, rnk
Priority: low Keywords: patch

Created on 2009-05-16 01:43 by benjamin.peterson, last changed 2016-01-14 17:55 by yselivanov. This issue is now closed.

File name Uploaded Description Edit
call_method.patch benjamin.peterson, 2009-05-16 01:43
Messages (10)
msg87850 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-05-16 01:43
This is an optimization ported from PyPy. [1] It tries to prevent bound
methods from being created by using the stack as a cache. I couldn't
apply this to builtin methods because those use a method-wrapper
descriptor. The results were not very impressive. However, I'm attaching
the patch to see if anyone else wants to look at it.


Test                             minimum run-time        average  run-time
                                 this    other   diff    this    other 
          BuiltinFunctionCalls:   342ms   340ms   +0.6%   378ms   361ms
           BuiltinMethodLookup:   315ms   308ms   +2.3%   333ms   319ms
                 CompareFloats:   250ms   251ms   -0.3%   257ms   258ms
         CompareFloatsIntegers:   266ms   265ms   +0.4%   273ms   273ms
               CompareIntegers:   233ms   232ms   +0.4%   238ms   238ms
        CompareInternedStrings:   279ms   275ms   +1.3%   285ms   284ms
                  CompareLongs:   225ms   223ms   +1.1%   231ms   229ms
                CompareStrings:   238ms   235ms   +1.3%   244ms   243ms
                CompareUnicode:   243ms   246ms   -0.9%   252ms   252ms
    ComplexPythonFunctionCalls:   307ms   301ms   +2.0%   315ms   309ms
                 ConcatStrings:   372ms   366ms   +1.6%   376ms   385ms
                 ConcatUnicode:   260ms   259ms   +0.4%   266ms   269ms
               CreateInstances:   351ms   336ms   +4.7%   365ms   346ms
            CreateNewInstances:   265ms   256ms   +3.6%   281ms   264ms
       CreateStringsWithConcat:   290ms   289ms   +0.1%   304ms   301ms
       CreateUnicodeWithConcat:   220ms   219ms   +0.8%   227ms   223ms
                  DictCreation:   201ms   200ms   +0.4%   204ms   206ms
             DictWithFloatKeys:   400ms   418ms   -4.4%   410ms   424ms
           DictWithIntegerKeys:   298ms   294ms   +1.2%   306ms   304ms
            DictWithStringKeys:   260ms   264ms   -1.5%   270ms   275ms
                      ForLoops:   224ms   223ms   +0.2%   232ms   232ms
                    IfThenElse:   160ms   160ms   +0.0%   168ms   182ms
                   ListSlicing:   293ms   292ms   +0.5%   302ms   306ms
                NestedForLoops:   301ms   300ms   +0.2%   305ms   308ms
      NestedListComprehensions:   323ms   328ms   -1.7%   331ms   335ms
          NormalClassAttribute:   313ms   314ms   -0.1%   323ms   330ms
       NormalInstanceAttribute:   284ms   283ms   +0.4%   289ms   288ms
           PythonFunctionCalls:   259ms   278ms   -6.7%   274ms   289ms
             PythonMethodCalls:   358ms   357ms   +0.3%   371ms   365ms
                     Recursion:   389ms   398ms   -2.1%   395ms   407ms
                  SecondImport:   335ms   319ms   +5.2%   346ms   380ms
           SecondPackageImport:   338ms   326ms   +3.6%   350ms   337ms
         SecondSubmoduleImport:   413ms   403ms   +2.5%   426ms   411ms
       SimpleComplexArithmetic:   341ms   345ms   -1.2%   351ms   355ms
        SimpleDictManipulation:   288ms   298ms   -3.7%   293ms   303ms
         SimpleFloatArithmetic:   272ms   275ms   -1.1%   279ms   286ms
      SimpleIntFloatArithmetic:   211ms   204ms   +3.3%   216ms   215ms
       SimpleIntegerArithmetic:   207ms   203ms   +1.7%   214ms   213ms
      SimpleListComprehensions:   275ms   273ms   +0.6%   281ms   281ms
        SimpleListManipulation:   224ms   229ms   -2.5%   234ms   241ms
          SimpleLongArithmetic:   252ms   253ms   -0.6%   263ms   266ms
                    SmallLists:   290ms   301ms   -3.8%   299ms   311ms
                   SmallTuples:   254ms   253ms   +0.3%   261ms   266ms
         SpecialClassAttribute:   311ms   309ms   +0.7%   320ms   321ms
      SpecialInstanceAttribute:   358ms   358ms   +0.1%   370ms   371ms
                StringMappings:   817ms   833ms   -1.9%   823ms   852ms
              StringPredicates:   488ms   538ms   -9.2%   495ms   547ms
                 StringSlicing:   295ms   296ms   -0.2%   306ms   323ms
                     TryExcept:   282ms   280ms   +1.0%   291ms   288ms
                    TryFinally:   290ms   255ms  +14.1%   300ms   263ms
                TryRaiseExcept:   261ms   256ms   +1.7%   271ms   263ms
                  TupleSlicing:   281ms   270ms   +4.3%   289ms   277ms
               UnicodeMappings:   329ms   337ms   -2.4%   337ms   348ms
             UnicodePredicates:   295ms   329ms  -10.4%   303ms   338ms
             UnicodeProperties:   272ms   307ms  -11.4%   284ms   313ms
                UnicodeSlicing:   258ms   261ms   -1.2%   266ms   275ms
                   WithFinally:   384ms   398ms   -3.6%   394ms   408ms
               WithRaiseExcept:   320ms   294ms   +8.9%   336ms   304ms
msg87870 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-16 09:13
Can you give results of the "richards" benchmark?
msg87889 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-05-16 12:49
With the patch:

Richards benchmark (Python) starting... [<function entry_point at 0x63b430>]
Total time for 10 iterations: 8.49 secs
Average time per iteration: 848.90 ms


Richards benchmark (Python) starting... [<function entry_point at 0x637530>]
Total time for 10 iterations: 10.46 secs
Average time per iteration: 1045.88 ms
msg87896 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-16 14:25
Similar results here.
With the patch:

Richards benchmark (Python) starting... [<function entry_point at
Total time for 4 iterations: 1.78 secs
Average time per iteration: 443.90 ms

Without the patch:

Richards benchmark (Python) starting... [<function entry_point at
Total time for 4 iterations: 2.02 secs
Average time per iteration: 503.79 ms
msg95707 - (view) Author: Reid Kleckner (rnk) (Python committer) Date: 2009-11-25 02:21
One thing I was wondering about the current patch is what about objects
that have attributes that shadow methods?  For example:

class C(object):
    def foo(self):
        return 1
c = c()
print = lambda: 2

Shouldn't the above print 1 and 2?  With the current patch, it seems
that you might still print 1.

There's also the possible performance drawback where you're loading
builtin C methods, so the optimization fails, but you end up calling
_PyType_Lookup twice.  :(

I'm doing the same optimization for unladen swallow, and these were some
of the things I ran into.  I think I'm going to write a
PyObject_GetMethod that tries to get a method without binding it, but if
it can't for any reason, it behaves just like PyObject_GetAttr and sets
a status code.
msg95710 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-11-25 03:27
Yes, my patch introduces that regression you mention. PyPy solves this
by having the instances dictionary keep track of shadowing of the type
dictionary. Not easy for CPython... I wish you luck on your patch!
msg110253 - (view) Author: Reid Kleckner (rnk) (Python committer) Date: 2010-07-14 04:56
I have an patch for unladen-swallow out for review here:

It resolves the correctness issues I mentioned previously by emitting guards if necessary.  If the type is predictable and uses slots, then we don't need to check the instance dict.

It gives a 5% speedup on the unpickle benchmark.  Presumably the other benchmarks do not do as many method calls.
msg110272 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-14 12:08
First, are these benchmark results jitted or non-jitted? Right now, non-jitted results are a stronger motivation for inclusion in main CPython, IMHO. 

Second, 2.7 won't receive any features / performance improvements anymore. It would be nice to have 3.2 (non-jitted) benchmark results.

Third, if removing intermediate allocations is the kind of optimizations a JIT will do anyway, does it make sense or not to make these optimizations explicit at the bytecode level? (this is really an open question, not a rhetorical one)
msg110298 - (view) Author: Reid Kleckner (rnk) (Python committer) Date: 2010-07-14 16:45
Sorry, I was just posting it so Benjamin could see what this bought us.  I'm not pushing to get this in CPython.

The results are for JITed code.  I forget what the interpreted results are.  I think they are good for the microbenchmarks, but not as good for the macro.
msg168170 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-08-14 03:39
Benjamin confirms a regression in his patch, and the other patch was not intended for CPython.  So I'm closing this issue.
Date User Action Args
2016-01-14 17:55:24yselivanovsetsuperseder: Speedup method calls 1.2x
2012-08-14 03:39:29r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg168170

stage: patch review -> resolved
2010-07-14 16:45:22rnksetmessages: + msg110298
2010-07-14 12:11:47pitrousetstage: patch review
versions: - Python 2.7
2010-07-14 12:08:10pitrousetnosy: + collinwinter, jyasskin
messages: + msg110272
2010-07-14 04:56:43rnksetmessages: + msg110253
2009-11-25 03:27:40benjamin.petersonsetmessages: + msg95710
2009-11-25 02:21:16rnksetmessages: + msg95707
2009-11-24 17:15:04rnksetnosy: + rnk
2009-05-16 14:25:05pitrousetmessages: + msg87896
2009-05-16 12:49:25benjamin.petersonsetmessages: + msg87889
2009-05-16 09:13:54pitrousetnosy: + pitrou
messages: + msg87870
2009-05-16 01:43:58benjamin.petersoncreate