This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Tell GCC Py_DECREF is unlikely to call the destructor
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.0, Python 3.1
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, blaisorblade, collinwinter, jyasskin, loewis, pitrou
Priority: normal Keywords: patch

Created on 2009-01-14 02:10 by ajaksu2, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
pybench.txt pitrou, 2009-01-14 02:42
likely_decref.diff ajaksu2, 2009-01-20 22:44 New version of the Py_DECREF patch
likely_object.diff ajaksu2, 2009-01-20 22:44 This one only defines likely/unlikely
ceval_exception.diff ajaksu2, 2009-01-20 22:46 Adds hints to END_FINALLY
ceval_tuple_unpack.diff ajaksu2, 2009-01-20 22:48 Adds hints to UNPACK_SEQUENCE, probably bogus outside of pybench
ceval_function.diff ajaksu2, 2009-01-20 22:49 Adds hints to CALL_FUNCTION_*, certainly bogus outside pybench
Messages (11)
msg79821 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-01-14 02:10
As suggested by Paolo Giarrusso, add a "likely()" to Py_DECREF, telling
GCC it doesn't call the destructor in the common path.

This optimization seems to interact well with #4896 and to be slower on
top of #4753 on my system. Patching ceval.c instead of object.h seems to
give the same result, FWIW. Benchmarks most welcome :)

Paolo Giarrusso wrote:
"Anyway, yesterday I was looking at the assembly and I realized that
GCC optimizes the case where Py_DECREF calls the destructor! Since I
don't have time for this, can you hint GCC for the better by adding
unlikely() to Py_DECREF, at least (or maybe only) for ceval.c, or
maybe even only for the main interpreter?"

pybench:
    Rounds: 10
    Warp:   10
    Timer:  time.time

    Machine Details:
       Platform ID:    Linux-2.6.24-23-generic-i686-with-debian-lenny-sid
       Processor:      Intel Celeron M 410

    Python:
       Implementation: CPython
       Executable:     /home/ajaksu/py3k/python
       Version:        3.1.0
       Compiler:       GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)
       Bits:           32bit
       Build:          Jan 13 2009 23:40:05 (#py3k:68571M)
       Unicode:        UCS2


Test                             minimum run-time        average  run-time
                                 this    other   diff    this    other 
 diff
-------------------------------------------------------------------------------
          BuiltinFunctionCalls:   147ms   163ms   -9.7%   148ms   164ms
 -10.0%
           BuiltinMethodLookup:   138ms   154ms  -10.3%   139ms   155ms
 -10.6%
                 CompareFloats:   185ms   191ms   -3.4%   185ms   192ms
  -3.4%
         CompareFloatsIntegers:   304ms   294ms   +3.1%   304ms   296ms
  +2.7%
               CompareIntegers:   287ms   281ms   +2.0%   288ms   281ms
  +2.3%
        CompareInternedStrings:   189ms   217ms  -12.7%   192ms   217ms
 -11.7%
                  CompareLongs:   167ms   162ms   +3.3%   168ms   162ms
  +3.3%
                CompareStrings:   152ms   178ms  -14.2%   156ms   179ms
 -12.7%
    ComplexPythonFunctionCalls:   192ms   202ms   -5.1%   193ms   206ms
  -6.2%
                 ConcatStrings:   325ms   327ms   -0.7%   326ms   330ms
  -1.1%
               CreateInstances:   200ms   209ms   -4.4%   204ms   210ms
  -2.9%
            CreateNewInstances:   149ms   156ms   -4.4%   151ms   161ms
  -5.8%
       CreateStringsWithConcat:   255ms   272ms   -6.1%   258ms   275ms
  -6.3%
                  DictCreation:   145ms   151ms   -4.0%   146ms   154ms
  -5.3%
             DictWithFloatKeys:   209ms   224ms   -6.8%   209ms   224ms
  -6.7%
           DictWithIntegerKeys:   170ms   185ms   -8.0%   171ms   187ms
  -8.3%
            DictWithStringKeys:   155ms   172ms  -10.1%   156ms   173ms
 -10.2%
                      ForLoops:   162ms   183ms  -11.7%   162ms   183ms
 -11.4%
                    IfThenElse:   189ms   217ms  -12.7%   190ms   219ms
 -13.1%
                   ListSlicing:   156ms   156ms   -0.3%   156ms   158ms
  -1.0%
                NestedForLoops:   199ms   219ms   -9.1%   201ms   220ms
  -8.6%
      NestedListComprehensions:   222ms   245ms   -9.5%   228ms   251ms
  -9.2%
          NormalClassAttribute:   352ms   376ms   -6.2%   353ms   378ms
  -6.6%
       NormalInstanceAttribute:   188ms   212ms  -11.1%   191ms   215ms
 -11.3%
           PythonFunctionCalls:   173ms   192ms   -9.6%   174ms   193ms
  -9.7%
             PythonMethodCalls:   213ms   226ms   -5.6%   214ms   227ms
  -5.8%
                     Recursion:   275ms   295ms   -7.0%   276ms   296ms
  -7.0%
                  SecondImport:   158ms   158ms   +0.2%   160ms   160ms
  +0.1%
           SecondPackageImport:   164ms   168ms   -2.2%   166ms   170ms
  -2.8%
         SecondSubmoduleImport:   221ms   226ms   -2.4%   223ms   229ms
  -2.9%
       SimpleComplexArithmetic:   143ms   166ms  -13.5%   144ms   166ms
 -13.4%
        SimpleDictManipulation:   269ms   296ms   -9.2%   270ms   297ms
  -9.0%
         SimpleFloatArithmetic:   150ms   177ms  -15.3%   151ms   196ms
 -23.2%
      SimpleIntFloatArithmetic:   200ms   226ms  -11.4%   203ms   232ms
 -12.4%
       SimpleIntegerArithmetic:   200ms   226ms  -11.4%   202ms   235ms
 -13.9%
      SimpleListComprehensions:   186ms   211ms  -12.0%   193ms   218ms
 -11.5%
        SimpleListManipulation:   161ms   180ms  -10.6%   161ms   181ms
 -10.9%
          SimpleLongArithmetic:   137ms   159ms  -14.0%   137ms   162ms
 -15.2%
                    SmallLists:   197ms   209ms   -5.6%   201ms   210ms
  -4.2%
                   SmallTuples:   215ms   234ms   -8.2%   216ms   236ms
  -8.2%
         SpecialClassAttribute:   544ms   566ms   -4.0%   549ms   571ms
  -3.7%
      SpecialInstanceAttribute:   188ms   211ms  -11.0%   190ms   214ms
 -11.0%
                StringMappings:   432ms   439ms   -1.6%   433ms   441ms
  -1.9%
              StringPredicates:   183ms   198ms   -7.8%   183ms   199ms
  -7.9%
                 StringSlicing:   304ms   317ms   -4.2%   310ms   321ms
  -3.3%
                     TryExcept:   138ms   154ms  -10.5%   139ms   155ms
 -10.4%
                    TryFinally:   149ms   156ms   -4.8%   151ms   159ms
  -5.1%
                TryRaiseExcept:    76ms    78ms   -2.3%    77ms    82ms
  -6.3%
                  TupleSlicing:   254ms   254ms   -0.2%   254ms   257ms
  -1.0%
                   WithFinally:   204ms   207ms   -1.1%   206ms   209ms
  -1.1%
               WithRaiseExcept:   187ms   190ms   -2.0%   188ms   192ms
  -2.2%
-------------------------------------------------------------------------------
Totals:                         10557ms 11266ms   -6.3% 10647ms 11397ms
  -6.6%

(this=likely_object.pybench, other=classic_opcode.pybench)
msg79822 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-14 02:42
I get a slowdown here (results attached). Many objects are shortlived in
Python, so I'm not sure the approach is valid anyway.
msg79831 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-14 05:23
If speedup on other machines are confirmed, a slowdown of less than 2%
should be acceptable (it's also similar to the statistic noise I have on
my machine - it's a pity pybench does not calculate the standard
deviation of the execution time).

It is true that many objects are short lived, and that makes
generational GC work, but the patch is about another point. Do most
refcount decrements lead to garbage collection? That's a completely
different question. And I don't think that popping a function call
argument usually leads to a destructor invocation, for instance, unless
the argument is a temporary. The same reasoning holds for most other
opcodes.

The simplest way to check this is to look at the compilation output with
Profile-Guided Optimization active and verify which branch is optimized.

Having said that, the slowdown you get (on your Athlon X2 3600+) might
be due to a higher cost in case of failed static prediction, since that
leads to 2 jumps.

This code leads to 0 or 1 jump:

if (likely(cond)) {
  destructor
}

while with unlikely, one runs through either 0 or 2 jumps. The second
jump is unconditional, so it has a much smaller delay (the decode unit
has to tell the fetch unit to fetch another instruction). However, in
that case one has to invoke anyway the destructor through a virtual
call, which is much more expensive.
msg79834 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-01-14 06:28
I dislike about this patch that there are two different else cases
(giving the trivial macro definition); there should only be one such case.
msg79851 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-14 11:45
Also, GCC 2.95 does not support the construct, GCC 2.96 is required.
So, I'd suggest defining likely/unlikely unconditionally and using this,
which leads to less code overall:

#  if (__GNUC__ < 2) || ((__GNUC__ == 2) && (__GNUC_MINOR__ <= 95))
#  define __builtin_expect(exp, prediction)  (exp)
#  endif

This is the simplest way to code this, assuming that ICC defines
__GNUC__, and my reference guide tells "yes, it defines __GNUC__ and
__INTEL_COMPILER".
msg79983 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-16 23:02
You could also test this principle with a few opcodes in ceval.c.
Especially, the error cases in LOAD_FAST, LOAD_GLOBAL etc. could be
flagged as unlikely.
msg79987 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-16 23:41
Yep, agreed. It could be quite cool to rely on __attribute__((cold)) to
mark format_exc_check_arg(), but that only works with newer gcc's. I
guess I'll add many likely() by hand - in some cases GCC might already
get it right, but the heuristics used to choose the likely path
statically are quite arguable and subject to change. Also, on other
archs such hints might have a bigger impact on the generated code.

I'll give a look at that not earlier than February, or you're welcome to
try this. However, at least on x86, an "if (successCondition) {
success(); dispatch(); } failure();" will be probably compiled to
assembly like follows, equivalent to adding likely to the test:

evaluate successCondition
if_false goto err:
call to success();
dispatch next instrutions.

err:
call to failure();
msg80288 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-01-20 22:44
New version of the patch, as well as one that doesn't touch Py_DECREF
but defines likely and unlikely. Then, three ceval patches that result
in small speedups (2% to 8% here), but don't play well together (neither
with the Py_DECREF one).

Since I get speedups for some changes and slowdowns when more than one
is applied, I suspect my system is hitting some limitation (register
stavation?) and contributing more noise that signal :/
msg80290 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2009-01-20 22:57
Minor comments on likely_object.diff:
* Py_LIKELY() and Py_UNLIKELY() would be better spellings than likely()
and unlikely().
* The definitions should go in pyport.h instead of object.h
* Usually don't put spaces after the "#".
* Probably #if the definitions of Py_LIKELY and Py_UNLIKELY instead of
__builtin_expect so new compilers can easily add their own definitions.
msg80569 - (view) Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade) Date: 2009-01-26 11:53
>Probably #if the definitions of Py_LIKELY and Py_UNLIKELY instead of
__builtin_expect so new compilers can easily add their own definitions.

This was done in the first version, but with the currently supported
compilers it's simpler to do like that, because both GCC and ICC support
the same __builtin_expect syntax, so you get less code this way.

Anyway, the code was "inspired" from the Linux kernel which only
supports those two compilers, so anybody more knowledgeable is welcome
to suggest how to express this with other compilers.
msg115644 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-05 12:59
The whole approach doesn't seem to bear much fruit. I tried to apply again likely_decref.diff and got a 0% performance change on 3.2 (on a Core i3 processor).
History
Date User Action Args
2022-04-11 14:56:44adminsetgithub: 49191
2010-09-05 12:59:26pitrousetstatus: open -> closed
resolution: rejected
messages: + msg115644
2009-01-26 11:53:13blaisorbladesetmessages: + msg80569
2009-01-20 22:57:12jyasskinsetmessages: + msg80290
2009-01-20 22:49:33ajaksu2setfiles: + ceval_function.diff
2009-01-20 22:48:18ajaksu2setfiles: + ceval_tuple_unpack.diff
2009-01-20 22:46:30ajaksu2setfiles: + ceval_exception.diff
2009-01-20 22:44:50ajaksu2setfiles: + likely_object.diff
2009-01-20 22:44:00ajaksu2setfiles: + likely_decref.diff
messages: + msg80288
2009-01-20 22:38:46ajaksu2setfiles: - likely_decref.diff
2009-01-16 23:41:44blaisorbladesetmessages: + msg79987
2009-01-16 23:02:35pitrousetmessages: + msg79983
2009-01-16 17:10:14collinwintersetnosy: + collinwinter, jyasskin
2009-01-14 11:45:12blaisorbladesetmessages: + msg79851
2009-01-14 06:28:56loewissetnosy: + loewis
messages: + msg79834
2009-01-14 05:23:01blaisorbladesetmessages: + msg79831
2009-01-14 02:42:30pitrousetfiles: + pybench.txt
messages: + msg79822
2009-01-14 02:10:30ajaksu2create