Message 76672 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jyasskin
Recipients	jyasskin
Date	2008-12-01.04:33:11
SpamBayes Score	2.794644e-10
Marked as misclassified	No
Message-id	<1228106006.06.0.406632816791.issue4477@psf.upfronthosting.co.za>
In-reply-to

Content
Tracing support shows up fairly heavily an a Python profile, even though it's nearly always turned off. The attached patch against the trunk speeds up PyBench by 2% for me. All tests pass. I have 2 questions: 1) Can other people corroborate this speedup on their machines? I'm running on a Macbook Pro (Intel Core2 processor, probably Merom) with a 32-bit build from Apple's gcc-4.0.1. (Apple's gcc consistently produces a faster python than gcc-4.3.) 2) Assuming this speeds things up for most people, should I check it in anywhere besides the trunk? I assume it's out for 3.0; is it in for 2.6.1 or 3.0.1? Pybench output: ------------------------------------------------------------------------------- PYBENCH 2.0 ------------------------------------------------------------------------------- * using CPython 2.7a0 (trunk:67458M, Nov 30 2008, 17:14:10) [GCC 4.0.1 (Apple Inc. build 5488)] * disabled garbage collection * system check interval set to maximum: 2147483647 * using timer: time.time ------------------------------------------------------------------------------- Benchmark: pybench.out ------------------------------------------------------------------------------- Rounds: 10 Warp: 10 Timer: time.time Machine Details: Platform ID: Darwin-9.5.0-i386-32bit Processor: i386 Python: Implementation: CPython Executable: /Users/jyasskin/src/python/trunk-fast-tracing/build/python.exe Version: 2.7.0 Compiler: GCC 4.0.1 (Apple Inc. build 5488) Bits: 32bit Build: Nov 30 2008 17:14:10 (#trunk:67458M) Unicode: UCS2 ------------------------------------------------------------------------------- Comparing with: ../build_orig/pybench.out ------------------------------------------------------------------------------- Rounds: 10 Warp: 10 Timer: time.time Machine Details: Platform ID: Darwin-9.5.0-i386-32bit Processor: i386 Python: Implementation: CPython Executable: /Users/jyasskin/src/python/trunk-fast-tracing/build_orig/python.exe Version: 2.7.0 Compiler: GCC 4.0.1 (Apple Inc. build 5488) Bits: 32bit Build: Nov 30 2008 13:51:09 (#trunk:67458) Unicode: UCS2 Test minimum run-time average run-time this other diff this other diff ------------------------------------------------------------------------------- BuiltinFunctionCalls: 127ms 130ms -2.4% 129ms 132ms -2.1% BuiltinMethodLookup: 90ms 93ms -3.2% 91ms 94ms -3.1% CompareFloats: 88ms 91ms -3.3% 89ms 93ms -4.3% CompareFloatsIntegers: 97ms 99ms -2.1% 97ms 100ms -2.4% CompareIntegers: 79ms 82ms -4.2% 79ms 85ms -6.1% CompareInternedStrings: 90ms 92ms -2.4% 94ms 94ms -0.9% CompareLongs: 86ms 83ms +3.6% 87ms 84ms +3.5% CompareStrings: 80ms 82ms -3.1% 81ms 83ms -2.3% CompareUnicode: 103ms 105ms -2.3% 106ms 108ms -1.5% ComplexPythonFunctionCalls: 139ms 137ms +1.3% 140ms 139ms +0.1% ConcatStrings: 142ms 151ms -6.0% 156ms 154ms +1.1% ConcatUnicode: 87ms 92ms -5.4% 89ms 94ms -5.7% CreateInstances: 142ms 144ms -1.4% 144ms 145ms -1.1% CreateNewInstances: 107ms 109ms -2.3% 108ms 111ms -2.1% CreateStringsWithConcat: 114ms 137ms -17.1% 117ms 139ms -16.0% CreateUnicodeWithConcat: 92ms 101ms -9.2% 95ms 102ms -7.2% DictCreation: 77ms 81ms -4.4% 80ms 85ms -5.9% DictWithFloatKeys: 91ms 107ms -14.5% 93ms 109ms -14.6% DictWithIntegerKeys: 95ms 94ms +1.4% 108ms 96ms +12.3% DictWithStringKeys: 83ms 88ms -5.8% 84ms 88ms -4.7% ForLoops: 72ms 72ms -0.1% 79ms 74ms +5.8% IfThenElse: 83ms 80ms +3.9% 85ms 80ms +5.3% ListSlicing: 117ms 118ms -0.7% 118ms 121ms -1.8% NestedForLoops: 116ms 119ms -2.4% 121ms 121ms +0.0% NormalClassAttribute: 106ms 115ms -7.7% 108ms 117ms -7.7% NormalInstanceAttribute: 96ms 98ms -2.3% 97ms 100ms -3.1% PythonFunctionCalls: 92ms 95ms -3.7% 94ms 99ms -5.2% PythonMethodCalls: 147ms 147ms +0.1% 152ms 149ms +2.1% Recursion: 135ms 136ms -0.3% 140ms 144ms -2.9% SecondImport: 101ms 99ms +2.1% 103ms 101ms +2.2% SecondPackageImport: 107ms 103ms +3.5% 108ms 104ms +3.3% SecondSubmoduleImport: 134ms 134ms +0.3% 136ms 136ms -0.0% SimpleComplexArithmetic: 105ms 111ms -5.0% 110ms 112ms -1.4% SimpleDictManipulation: 95ms 106ms -10.6% 96ms 109ms -12.0% SimpleFloatArithmetic: 90ms 99ms -9.3% 93ms 102ms -8.2% SimpleIntFloatArithmetic: 78ms 76ms +2.3% 79ms 77ms +2.0% SimpleIntegerArithmetic: 78ms 77ms +1.8% 79ms 77ms +2.0% SimpleListManipulation: 80ms 78ms +2.4% 80ms 79ms +1.9% SimpleLongArithmetic: 110ms 113ms -2.0% 111ms 113ms -2.1% SmallLists: 128ms 117ms +9.5% 130ms 124ms +4.9% SmallTuples: 115ms 114ms +1.7% 117ms 114ms +2.2% SpecialClassAttribute: 101ms 112ms -10.3% 104ms 114ms -8.9% SpecialInstanceAttribute: 173ms 177ms -1.9% 176ms 179ms -1.6% StringMappings: 165ms 167ms -1.2% 168ms 169ms -0.5% StringPredicates: 126ms 134ms -5.7% 127ms 134ms -5.6% StringSlicing: 125ms 123ms +1.9% 131ms 130ms +0.7% TryExcept: 79ms 80ms -0.6% 80ms 80ms -0.8% TryFinally: 110ms 107ms +3.0% 111ms 112ms -1.1% TryRaiseExcept: 99ms 101ms -1.6% 100ms 102ms -1.7% TupleSlicing: 127ms 127ms +0.6% 137ms 137ms +0.0% UnicodeMappings: 144ms 144ms -0.3% 145ms 145ms -0.4% UnicodePredicates: 116ms 114ms +1.3% 117ms 115ms +1.1% UnicodeProperties: 106ms 102ms +3.6% 107ms 104ms +3.1% UnicodeSlicing: 95ms 111ms -14.0% 99ms 112ms -11.8% WithFinally: 157ms 152ms +3.3% 159ms 154ms +3.3% WithRaiseExcept: 123ms 125ms -1.1% 125ms 126ms -1.2% ------------------------------------------------------------------------------- Totals: 6043ms 6182ms -2.2% 6185ms 6301ms -1.9% (this=pybench.out, other=../build_orig/pybench.out) 2to3 times: Before: $ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null real 0m56.685s user 0m55.620s sys 0m0.380s After: $ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null real 0m55.067s user 0m53.843s sys 0m0.376s == 3% faster Gory details: The meat of the patch is: @@ -884,11 +891,12 @@ fast_next_opcode: f->f_lasti = INSTR_OFFSET(); /* line-by-line tracing support */ - if (tstate->c_tracefunc != NULL && !tstate->tracing) { + if (_Py_TracingPossible && + tstate->c_tracefunc != NULL && !tstate->tracing) { This converts the generated assembly (produced with `gcc -S -dA ...`, then manually annotated a bit) from: # basic block 17 # ../Python/ceval.c:885 LM541: movl 8(%ebp), %ecx LVL319: subl -316(%ebp), %edx movl %edx, 60(%ecx) # ../Python/ceval.c:889 LM542: # %esi = tstate movl -336(%ebp), %esi LVL320: # %eax = tstate->c_tracefunc movl 28(%esi), %eax LVL321: # if tstate->c_tracefunc == 0 testl %eax, %eax # goto past-if () je L567 # more if conditions here to: # basic block 17 # ../Python/ceval.c:889 LM542: movl 8(%ebp), %ecx LVL319: subl -316(%ebp), %edx movl %edx, 60(%ecx) # ../Python/ceval.c:893 LM543: # %eax = _Py_TracingPossible movl __Py_TracingPossible-"L00000000033$pb"(%ebx), %eax LVL320: # if _Py_TracingPossible != 0 testl %eax, %eax # goto rest-of-if (nearby) jne L2321 # opcode = NEXTOP(); continues here The branch should be predicted accurately either way, so there are 2 things that may be contributing to the performance change. First, adding the global caching variable halves the amount of memory that has to be read to check the prediction. The memory that is read is still read one instruction before it's used, but adding a local variable to read the memory earlier doesn't affect the performance. Without the global variable, the compiler puts the tracing code immediately after the if; with the global, it moves it away and puts the non-tracing code immediately after the first test in the if. This may affect branch prediction and may affect the icache. I tried using gcc's __builtin_expect() to ensure that the tracing code is always out-of-line. This moved it much farther away and cost about 1% in performance (i.e. 1% instead of 2% faster than "before"). I don't know why the __builtin_expect() version would be slower. If anyone feels inspired to test this out on another processor or compiler version, let me know how it goes. Jeffrey

Tracing support shows up fairly heavily an a Python profile, even
though it's nearly always turned off. The attached patch against the
trunk speeds up PyBench by 2% for me. All tests pass. I have 2
questions:

1) Can other people corroborate this speedup on their machines? I'm
running on a Macbook Pro (Intel Core2 processor, probably Merom) with
a 32-bit build from Apple's gcc-4.0.1. (Apple's gcc consistently
produces a faster python than gcc-4.3.)

2) Assuming this speeds things up for most people, should I check it
in anywhere besides the trunk? I assume it's out for 3.0; is it in for
2.6.1 or 3.0.1?



Pybench output:

-------------------------------------------------------------------------------
PYBENCH 2.0
-------------------------------------------------------------------------------
* using CPython 2.7a0 (trunk:67458M, Nov 30 2008, 17:14:10) [GCC 4.0.1
(Apple Inc. build 5488)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.time

-------------------------------------------------------------------------------
Benchmark: pybench.out
-------------------------------------------------------------------------------

   Rounds: 10
   Warp:   10
   Timer:  time.time

   Machine Details:
      Platform ID:    Darwin-9.5.0-i386-32bit
      Processor:      i386

   Python:
      Implementation: CPython
      Executable:
/Users/jyasskin/src/python/trunk-fast-tracing/build/python.exe
      Version:        2.7.0
      Compiler:       GCC 4.0.1 (Apple Inc. build 5488)
      Bits:           32bit
      Build:          Nov 30 2008 17:14:10 (#trunk:67458M)
      Unicode:        UCS2


-------------------------------------------------------------------------------
Comparing with: ../build_orig/pybench.out
-------------------------------------------------------------------------------

   Rounds: 10
   Warp:   10
   Timer:  time.time

   Machine Details:
      Platform ID:    Darwin-9.5.0-i386-32bit
      Processor:      i386

   Python:
      Implementation: CPython
      Executable:
/Users/jyasskin/src/python/trunk-fast-tracing/build_orig/python.exe
      Version:        2.7.0
      Compiler:       GCC 4.0.1 (Apple Inc. build 5488)
      Bits:           32bit
      Build:          Nov 30 2008 13:51:09 (#trunk:67458)
      Unicode:        UCS2


Test                             minimum run-time        average  run-time
                                this    other   diff    this    other   diff
-------------------------------------------------------------------------------
         BuiltinFunctionCalls:   127ms   130ms   -2.4%   129ms   132ms 
 -2.1%
          BuiltinMethodLookup:    90ms    93ms   -3.2%    91ms    94ms 
 -3.1%
                CompareFloats:    88ms    91ms   -3.3%    89ms    93ms 
 -4.3%
        CompareFloatsIntegers:    97ms    99ms   -2.1%    97ms   100ms 
 -2.4%
              CompareIntegers:    79ms    82ms   -4.2%    79ms    85ms 
 -6.1%
       CompareInternedStrings:    90ms    92ms   -2.4%    94ms    94ms 
 -0.9%
                 CompareLongs:    86ms    83ms   +3.6%    87ms    84ms 
 +3.5%
               CompareStrings:    80ms    82ms   -3.1%    81ms    83ms 
 -2.3%
               CompareUnicode:   103ms   105ms   -2.3%   106ms   108ms 
 -1.5%
   ComplexPythonFunctionCalls:   139ms   137ms   +1.3%   140ms   139ms 
 +0.1%
                ConcatStrings:   142ms   151ms   -6.0%   156ms   154ms 
 +1.1%
                ConcatUnicode:    87ms    92ms   -5.4%    89ms    94ms 
 -5.7%
              CreateInstances:   142ms   144ms   -1.4%   144ms   145ms 
 -1.1%
           CreateNewInstances:   107ms   109ms   -2.3%   108ms   111ms 
 -2.1%
      CreateStringsWithConcat:   114ms   137ms  -17.1%   117ms   139ms 
-16.0%
      CreateUnicodeWithConcat:    92ms   101ms   -9.2%    95ms   102ms 
 -7.2%
                 DictCreation:    77ms    81ms   -4.4%    80ms    85ms 
 -5.9%
            DictWithFloatKeys:    91ms   107ms  -14.5%    93ms   109ms 
-14.6%
          DictWithIntegerKeys:    95ms    94ms   +1.4%   108ms    96ms 
+12.3%
           DictWithStringKeys:    83ms    88ms   -5.8%    84ms    88ms 
 -4.7%
                     ForLoops:    72ms    72ms   -0.1%    79ms    74ms 
 +5.8%
                   IfThenElse:    83ms    80ms   +3.9%    85ms    80ms 
 +5.3%
                  ListSlicing:   117ms   118ms   -0.7%   118ms   121ms 
 -1.8%
               NestedForLoops:   116ms   119ms   -2.4%   121ms   121ms 
 +0.0%
         NormalClassAttribute:   106ms   115ms   -7.7%   108ms   117ms 
 -7.7%
      NormalInstanceAttribute:    96ms    98ms   -2.3%    97ms   100ms 
 -3.1%
          PythonFunctionCalls:    92ms    95ms   -3.7%    94ms    99ms 
 -5.2%
            PythonMethodCalls:   147ms   147ms   +0.1%   152ms   149ms 
 +2.1%
                    Recursion:   135ms   136ms   -0.3%   140ms   144ms 
 -2.9%
                 SecondImport:   101ms    99ms   +2.1%   103ms   101ms 
 +2.2%
          SecondPackageImport:   107ms   103ms   +3.5%   108ms   104ms 
 +3.3%
        SecondSubmoduleImport:   134ms   134ms   +0.3%   136ms   136ms 
 -0.0%
      SimpleComplexArithmetic:   105ms   111ms   -5.0%   110ms   112ms 
 -1.4%
       SimpleDictManipulation:    95ms   106ms  -10.6%    96ms   109ms 
-12.0%
        SimpleFloatArithmetic:    90ms    99ms   -9.3%    93ms   102ms 
 -8.2%
     SimpleIntFloatArithmetic:    78ms    76ms   +2.3%    79ms    77ms 
 +2.0%
      SimpleIntegerArithmetic:    78ms    77ms   +1.8%    79ms    77ms 
 +2.0%
       SimpleListManipulation:    80ms    78ms   +2.4%    80ms    79ms 
 +1.9%
         SimpleLongArithmetic:   110ms   113ms   -2.0%   111ms   113ms 
 -2.1%
                   SmallLists:   128ms   117ms   +9.5%   130ms   124ms 
 +4.9%
                  SmallTuples:   115ms   114ms   +1.7%   117ms   114ms 
 +2.2%
        SpecialClassAttribute:   101ms   112ms  -10.3%   104ms   114ms 
 -8.9%
     SpecialInstanceAttribute:   173ms   177ms   -1.9%   176ms   179ms 
 -1.6%
               StringMappings:   165ms   167ms   -1.2%   168ms   169ms 
 -0.5%
             StringPredicates:   126ms   134ms   -5.7%   127ms   134ms 
 -5.6%
                StringSlicing:   125ms   123ms   +1.9%   131ms   130ms 
 +0.7%
                    TryExcept:    79ms    80ms   -0.6%    80ms    80ms 
 -0.8%
                   TryFinally:   110ms   107ms   +3.0%   111ms   112ms 
 -1.1%
               TryRaiseExcept:    99ms   101ms   -1.6%   100ms   102ms 
 -1.7%
                 TupleSlicing:   127ms   127ms   +0.6%   137ms   137ms 
 +0.0%
              UnicodeMappings:   144ms   144ms   -0.3%   145ms   145ms 
 -0.4%
            UnicodePredicates:   116ms   114ms   +1.3%   117ms   115ms 
 +1.1%
            UnicodeProperties:   106ms   102ms   +3.6%   107ms   104ms 
 +3.1%
               UnicodeSlicing:    95ms   111ms  -14.0%    99ms   112ms 
-11.8%
                  WithFinally:   157ms   152ms   +3.3%   159ms   154ms 
 +3.3%
              WithRaiseExcept:   123ms   125ms   -1.1%   125ms   126ms 
 -1.2%
-------------------------------------------------------------------------------
Totals:                          6043ms  6182ms   -2.2%  6185ms  6301ms
  -1.9%

(this=pybench.out, other=../build_orig/pybench.out)


2to3 times:

Before:
$ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null
real    0m56.685s
user    0m55.620s
sys     0m0.380s

After:
$ time ./python.exe ~/src/2to3/2to3 -f all ~/src/2to3/ >/dev/null
real    0m55.067s
user    0m53.843s
sys     0m0.376s

== 3% faster


Gory details:

The meat of the patch is:
@@ -884,11 +891,12 @@
       fast_next_opcode:
               f->f_lasti = INSTR_OFFSET();

               /* line-by-line tracing support */

-               if (tstate->c_tracefunc != NULL && !tstate->tracing) {
+               if (_Py_TracingPossible &&
+                   tstate->c_tracefunc != NULL && !tstate->tracing) {


This converts the generated assembly (produced with `gcc -S -dA ...`,
then manually annotated a bit) from:

       # basic block 17
       # ../Python/ceval.c:885
LM541:
       movl    8(%ebp), %ecx
LVL319:
       subl    -316(%ebp), %edx
       movl    %edx, 60(%ecx)
       # ../Python/ceval.c:889
LM542:
# %esi = tstate
       movl    -336(%ebp), %esi
LVL320:
# %eax = tstate->c_tracefunc
       movl    28(%esi), %eax
LVL321:
# if tstate->c_tracefunc == 0
       testl   %eax, %eax
# goto past-if ()
       je      L567
# more if conditions here

to:

       # basic block 17
       # ../Python/ceval.c:889
LM542:
       movl    8(%ebp), %ecx
LVL319:
       subl    -316(%ebp), %edx
       movl    %edx, 60(%ecx)
       # ../Python/ceval.c:893
LM543:
# %eax = _Py_TracingPossible
       movl    __Py_TracingPossible-"L00000000033$pb"(%ebx), %eax
LVL320:
# if _Py_TracingPossible != 0
       testl   %eax, %eax
# goto rest-of-if (nearby)
       jne     L2321
# opcode = NEXTOP(); continues here


The branch should be predicted accurately either way, so there are 2
things that may be contributing to the performance change.

First, adding the global caching variable halves the amount of memory
that has to be read to check the prediction. The memory that is read
is still read one instruction before it's used, but adding a local
variable to read the memory earlier doesn't affect the performance.

Without the global variable, the compiler puts the tracing code
immediately after the if; with the global, it moves it away and puts the
non-tracing code immediately after the first test in the if. This
may affect branch prediction and may affect the icache. I tried using
gcc's __builtin_expect() to ensure that the tracing code is always
out-of-line. This moved it much farther away and cost about 1% in
performance (i.e. 1% instead of 2% faster than "before"). I don't know
why the __builtin_expect() version would be slower. If anyone feels
inspired to test this out on another processor or compiler version,
let me know how it goes.

Jeffrey

History
Date	User	Action	Args
2008-12-01 04:33:27	jyasskin	set	recipients: + jyasskin
2008-12-01 04:33:26	jyasskin	set	messageid: <1228106006.06.0.406632816791.issue4477@psf.upfronthosting.co.za>
2008-12-01 04:33:19	jyasskin	link	issue4477 messages
2008-12-01 04:33:13	jyasskin	create