Message 256236 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	mark.dickinson, rhettinger, serhiy.storchaka, skrah, tim.peters, vstinner
Date	2015-12-11.21:44:17
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1449870259.01.0.351863599839.issue25823@psf.upfronthosting.co.za>
In-reply-to

Content
I verified that Clang and GCC both give the expected disassembly with Serhiy's patch. We ought to restrict the #if to just the compilers that are known to optimize away the memcpy. Clang (for 'BUILD_LIST_UNPACK') ------------------------------- .loc 10 2525 9 ## Python/ceval.c:2525:9 movzwl (%r13), %r9d addq $2, %r13 Ltmp2042: ##DEBUG_VALUE: PyEval_EvalFrameEx:next_instr <- R13 GCC (for 'BUILD_LIST_UNPACK') ----------------------------- LM1275: movzwl (%rdx), %r8d LVL1147: leaq 2(%rdx), %rbp [Mark] > Benchmarks showing dramatic real-world speed improvements ... Much of the doubling of speed for core Python that has occurred over the last ten decade has occurred one little step at a time, none of the them being individually "dramatic". In general, if we have a chance to reduce the work load in the ceval inner-loop, we should take it. A simple benchmark on clang shows a roughly 10+% speedup in code exercising simple and common opcodes that that have a oparg (there is no point of benchmarking the effect on opcodes like IMPORT_NAME where the total eval-loop overhead is already an insignificant proportion of the total work). Baseline version with CLANG Apple LLVM version 7.0.2 (clang-700.1.81) $ ./python.exe exercise_oparg.py 0.22484053499647416 $ ./python.exe exercise_oparg.py 0.22687773499637842 $ ./python.exe exercise_oparg.py 0.22026274001109414 Patched version with CLANG Apple LLVM version 7.0.2 (clang-700.1.81) $ ./python.exe exercise_oparg.py 0.19516360601119231 $ ./python.exe exercise_oparg.py 0.20087355599389412 $ ./python.exe exercise_oparg.py 0.1980393300036667 To better isolate the effect, I suppose you could enable the READ_TIMESTAMP macros to precisely measure the effect of converting five sequentially dependent instructions with two independent instructions, but likely all it would show you is that the two are cheaper than the five.

I verified that Clang and GCC both give the expected disassembly with Serhiy's patch.   We ought to restrict the #if to just the compilers that are known to optimize away the memcpy.

Clang (for 'BUILD_LIST_UNPACK')
-------------------------------
      .loc    10 2525 9               ## Python/ceval.c:2525:9
      movzwl  (%r13), %r9d
      addq    $2, %r13
  Ltmp2042:
      ##DEBUG_VALUE: PyEval_EvalFrameEx:next_instr <- R13

GCC (for 'BUILD_LIST_UNPACK')
----------------------------- 
  LM1275:
      movzwl  (%rdx), %r8d
  LVL1147:
      leaq    2(%rdx), %rbp

[Mark]
> Benchmarks showing dramatic real-world speed improvements ...

Much of the doubling of speed for core Python that has occurred over the last ten decade has occurred one little step at a time, none of the them being individually "dramatic".  In general, if we have a chance to reduce the work load in the ceval inner-loop, we should take it.

A simple benchmark on clang shows a roughly 10+% speedup in code exercising simple and common opcodes that that have a oparg (there is no point of benchmarking the effect on opcodes like IMPORT_NAME where the total eval-loop overhead is already an insignificant proportion of the total work).

Baseline version with CLANG Apple LLVM version 7.0.2 (clang-700.1.81)
  $ ./python.exe exercise_oparg.py 
  0.22484053499647416
  $ ./python.exe exercise_oparg.py 
  0.22687773499637842
  $ ./python.exe exercise_oparg.py 
  0.22026274001109414

Patched version with CLANG Apple LLVM version 7.0.2 (clang-700.1.81)
  $ ./python.exe exercise_oparg.py 
  0.19516360601119231
  $ ./python.exe exercise_oparg.py 
  0.20087355599389412
  $ ./python.exe exercise_oparg.py 
  0.1980393300036667

To better isolate the effect, I suppose you could enable the READ_TIMESTAMP macros to precisely measure the effect of converting five sequentially dependent instructions with two independent instructions, but likely all it would show you is that the two are cheaper than the five.

History
Date	User	Action	Args
2015-12-11 21:44:19	rhettinger	set	recipients: + rhettinger, tim.peters, mark.dickinson, vstinner, skrah, serhiy.storchaka
2015-12-11 21:44:19	rhettinger	set	messageid: <1449870259.01.0.351863599839.issue25823@psf.upfronthosting.co.za>
2015-12-11 21:44:18	rhettinger	link	issue25823 messages
2015-12-11 21:44:17	rhettinger	create