This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: Speed up PyEval_EvalFrameEx when tracing is off.
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.0, Python 2.7, Python 2.6
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: jyasskin Nosy List: benjamin.peterson, brett.cannon, jyasskin, loewis, rhettinger, skip.montanaro
Priority: normal Keywords: patch

Created on 2008-12-01 04:33 by jyasskin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

File name Uploaded Description Edit
fast-tracing.diff jyasskin, 2008-12-01 04:33
Messages (9)
msg76672 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2008-12-01 04:33
Tracing support shows up fairly heavily an a Python profile, even
though it's nearly always turned off. The attached patch against the
trunk speeds up PyBench by 2% for me. All tests pass. I have 2

1) Can other people corroborate this speedup on their machines? I'm
running on a Macbook Pro (Intel Core2 processor, probably Merom) with
a 32-bit build from Apple's gcc-4.0.1. (Apple's gcc consistently
produces a faster python than gcc-4.3.)

2) Assuming this speeds things up for most people, should I check it
in anywhere besides the trunk? I assume it's out for 3.0; is it in for
2.6.1 or 3.0.1?

Pybench output:

* using CPython 2.7a0 (trunk:67458M, Nov 30 2008, 17:14:10) [GCC 4.0.1
(Apple Inc. build 5488)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.time

Benchmark: pybench.out

   Rounds: 10
   Warp:   10
   Timer:  time.time

   Machine Details:
      Platform ID:    Darwin-9.5.0-i386-32bit
      Processor:      i386

      Implementation: CPython
      Version:        2.7.0
      Compiler:       GCC 4.0.1 (Apple Inc. build 5488)
      Bits:           32bit
      Build:          Nov 30 2008 17:14:10 (#trunk:67458M)
      Unicode:        UCS2

Comparing with: ../build_orig/pybench.out

   Rounds: 10
   Warp:   10
   Timer:  time.time

   Machine Details:
      Platform ID:    Darwin-9.5.0-i386-32bit
      Processor:      i386

      Implementation: CPython
      Version:        2.7.0
      Compiler:       GCC 4.0.1 (Apple Inc. build 5488)
      Bits:           32bit
      Build:          Nov 30 2008 13:51:09 (#trunk:67458)
      Unicode:        UCS2

Test                             minimum run-time        average  run-time
                                this    other   diff    this    other   diff
         BuiltinFunctionCalls:   127ms   130ms   -2.4%   129ms   132ms 
          BuiltinMethodLookup:    90ms    93ms   -3.2%    91ms    94ms 
                CompareFloats:    88ms    91ms   -3.3%    89ms    93ms 
        CompareFloatsIntegers:    97ms    99ms   -2.1%    97ms   100ms 
              CompareIntegers:    79ms    82ms   -4.2%    79ms    85ms 
       CompareInternedStrings:    90ms    92ms   -2.4%    94ms    94ms 
                 CompareLongs:    86ms    83ms   +3.6%    87ms    84ms 
               CompareStrings:    80ms    82ms   -3.1%    81ms    83ms 
               CompareUnicode:   103ms   105ms   -2.3%   106ms   108ms 
   ComplexPythonFunctionCalls:   139ms   137ms   +1.3%   140ms   139ms 
                ConcatStrings:   142ms   151ms   -6.0%   156ms   154ms 
                ConcatUnicode:    87ms    92ms   -5.4%    89ms    94ms 
              CreateInstances:   142ms   144ms   -1.4%   144ms   145ms 
           CreateNewInstances:   107ms   109ms   -2.3%   108ms   111ms 
      CreateStringsWithConcat:   114ms   137ms  -17.1%   117ms   139ms 
      CreateUnicodeWithConcat:    92ms   101ms   -9.2%    95ms   102ms 
                 DictCreation:    77ms    81ms   -4.4%    80ms    85ms 
            DictWithFloatKeys:    91ms   107ms  -14.5%    93ms   109ms 
          DictWithIntegerKeys:    95ms    94ms   +1.4%   108ms    96ms 
           DictWithStringKeys:    83ms    88ms   -5.8%    84ms    88ms 
                     ForLoops:    72ms    72ms   -0.1%    79ms    74ms 
                   IfThenElse:    83ms    80ms   +3.9%    85ms    80ms 
                  ListSlicing:   117ms   118ms   -0.7%   118ms   121ms 
               NestedForLoops:   116ms   119ms   -2.4%   121ms   121ms 
         NormalClassAttribute:   106ms   115ms   -7.7%   108ms   117ms 
      NormalInstanceAttribute:    96ms    98ms   -2.3%    97ms   100ms 
          PythonFunctionCalls:    92ms    95ms   -3.7%    94ms    99ms 
            PythonMethodCalls:   147ms   147ms   +0.1%   152ms   149ms 
                    Recursion:   135ms   136ms   -0.3%   140ms   144ms 
                 SecondImport:   101ms    99ms   +2.1%   103ms   101ms 
          SecondPackageImport:   107ms   103ms   +3.5%   108ms   104ms 
        SecondSubmoduleImport:   134ms   134ms   +0.3%   136ms   136ms 
      SimpleComplexArithmetic:   105ms   111ms   -5.0%   110ms   112ms 
       SimpleDictManipulation:    95ms   106ms  -10.6%    96ms   109ms 
        SimpleFloatArithmetic:    90ms    99ms   -9.3%    93ms   102ms 
     SimpleIntFloatArithmetic:    78ms    76ms   +2.3%    79ms    77ms 
      SimpleIntegerArithmetic:    78ms    77ms   +1.8%    79ms    77ms 
       SimpleListManipulation:    80ms    78ms   +2.4%    80ms    79ms 
         SimpleLongArithmetic:   110ms   113ms   -2.0%   111ms   113ms 
                   SmallLists:   128ms   117ms   +9.5%   130ms   124ms 
                  SmallTuples:   115ms   114ms   +1.7%   117ms   114ms 
        SpecialClassAttribute:   101ms   112ms  -10.3%   104ms   114ms 
     SpecialInstanceAttribute:   173ms   177ms   -1.9%   176ms   179ms 
               StringMappings:   165ms   167ms   -1.2%   168ms   169ms 
             StringPredicates:   126ms   134ms   -5.7%   127ms   134ms 
                StringSlicing:   125ms   123ms   +1.9%   131ms   130ms 
                    TryExcept:    79ms    80ms   -0.6%    80ms    80ms 
                   TryFinally:   110ms   107ms   +3.0%   111ms   112ms 
               TryRaiseExcept:    99ms   101ms   -1.6%   100ms   102ms 
                 TupleSlicing:   127ms   127ms   +0.6%   137ms   137ms 
              UnicodeMappings:   144ms   144ms   -0.3%   145ms   145ms 
            UnicodePredicates:   116ms   114ms   +1.3%   117ms   115ms 
            UnicodeProperties:   106ms   102ms   +3.6%   107ms   104ms 
               UnicodeSlicing:    95ms   111ms  -14.0%    99ms   112ms 
                  WithFinally:   157ms   152ms   +3.3%   159ms   154ms 
              WithRaiseExcept:   123ms   125ms   -1.1%   125ms   126ms 
Totals:                          6043ms  6182ms   -2.2%  6185ms  6301ms

(this=pybench.out, other=../build_orig/pybench.out)

2to3 times:

$ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null
real    0m56.685s
user    0m55.620s
sys     0m0.380s

$ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null
real    0m55.067s
user    0m53.843s
sys     0m0.376s

== 3% faster

Gory details:

The meat of the patch is:
@@ -884,11 +891,12 @@
               f->f_lasti = INSTR_OFFSET();

               /* line-by-line tracing support */

-               if (tstate->c_tracefunc != NULL && !tstate->tracing) {
+               if (_Py_TracingPossible &&
+                   tstate->c_tracefunc != NULL && !tstate->tracing) {

This converts the generated assembly (produced with `gcc -S -dA ...`,
then manually annotated a bit) from:

       # basic block 17
       # ../Python/ceval.c:885
       movl    8(%ebp), %ecx
       subl    -316(%ebp), %edx
       movl    %edx, 60(%ecx)
       # ../Python/ceval.c:889
# %esi = tstate
       movl    -336(%ebp), %esi
# %eax = tstate->c_tracefunc
       movl    28(%esi), %eax
# if tstate->c_tracefunc == 0
       testl   %eax, %eax
# goto past-if ()
       je      L567
# more if conditions here


       # basic block 17
       # ../Python/ceval.c:889
       movl    8(%ebp), %ecx
       subl    -316(%ebp), %edx
       movl    %edx, 60(%ecx)
       # ../Python/ceval.c:893
# %eax = _Py_TracingPossible
       movl    __Py_TracingPossible-"L00000000033$pb"(%ebx), %eax
# if _Py_TracingPossible != 0
       testl   %eax, %eax
# goto rest-of-if (nearby)
       jne     L2321
# opcode = NEXTOP(); continues here

The branch should be predicted accurately either way, so there are 2
things that may be contributing to the performance change.

First, adding the global caching variable halves the amount of memory
that has to be read to check the prediction. The memory that is read
is still read one instruction before it's used, but adding a local
variable to read the memory earlier doesn't affect the performance.

Without the global variable, the compiler puts the tracing code
immediately after the if; with the global, it moves it away and puts the
non-tracing code immediately after the first test in the if. This
may affect branch prediction and may affect the icache. I tried using
gcc's __builtin_expect() to ensure that the tracing code is always
out-of-line. This moved it much farther away and cost about 1% in
performance (i.e. 1% instead of 2% faster than "before"). I don't know
why the __builtin_expect() version would be slower. If anyone feels
inspired to test this out on another processor or compiler version,
let me know how it goes.

msg76677 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-12-01 05:08
I don't think anything performance-related should be checked in before 3.0.
msg76700 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2008-12-01 16:12
Martin, does your comment mean you think this should stay out of the
trunk(s) until 3.0 is released? That's fine. How about after that:
should it go in the 2.6.x and 3.0.x branches too?
msg76705 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-12-01 18:52
I think it should stay out of the trunk until 3.0 is released only
because that simplifies merging and 3.0 is only 3 days away.
msg76710 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-12-01 22:45
Before it is accepted, I'd like to see some independent analysis (which
I might have to carry out myself). It would also be good if some deeper
understanding of the effect could be provided - I'm fairly skeptical
that such a minor change should have any notable effect.
msg76711 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-12-01 22:49
+1 on the patch.  It makes good sense to me.
msg76810 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-12-03 03:46
Jeffrey, I've run my own timings and verified the improvement.  Please
go ahead and apply the trunk (2.7 head).  

It's too late for 3.0 but once that is out the door, Barry has approved
adding the patch for 3.0.1.
msg76816 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2008-12-03 06:54
Thanks for the review, Raymond. Committed to the trunk in r67494. I'll
svnmerge it over to 3.0.1 once Barry unfreezes the branches.

Martin, let me know if you analyze this any more and find anything
msg77688 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2008-12-12 21:27
Merged to 3.1 in r67611, 3.0.x in r67721, and 2.6.x in r67722.
Date User Action Args
2022-04-11 14:56:41adminsetgithub: 48727
2008-12-12 21:27:10jyasskinsetstatus: open -> closed
keywords: - needs review
messages: + msg77688
versions: + Python 2.6
2008-12-03 06:54:14jyasskinsetmessages: + msg76816
stage: commit review -> resolved
2008-12-03 03:46:31rhettingersetassignee: rhettinger -> jyasskin
messages: + msg76810
resolution: accepted
stage: patch review -> commit review
2008-12-02 22:15:50rhettingersetassignee: rhettinger
versions: + Python 3.0
2008-12-01 22:49:58rhettingersetnosy: + rhettinger
messages: + msg76711
2008-12-01 22:45:18loewissetmessages: + msg76710
2008-12-01 18:52:06benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg76705
2008-12-01 16:12:05jyasskinsetmessages: + msg76700
2008-12-01 10:27:29skip.montanarosetnosy: + skip.montanaro
2008-12-01 05:08:34loewissetnosy: + loewis
messages: + msg76677
2008-12-01 04:39:44brett.cannonsetnosy: + brett.cannon
2008-12-01 04:33:20jyasskincreate