Issue 4941: Tell GCC Py_DECREF is unlikely to call the destructor

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49191

classification

Title:	Tell GCC Py_DECREF is unlikely to call the destructor
Type:	performance	Stage:
Components:	Interpreter Core	Versions:	Python 3.0, Python 3.1

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	ajaksu2, blaisorblade, collinwinter, jyasskin, loewis, pitrou
Priority:	normal	Keywords:	patch

Created on 2009-01-14 02:10 by ajaksu2, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
pybench.txt	pitrou, 2009-01-14 02:42
likely_decref.diff	ajaksu2, 2009-01-20 22:44	New version of the Py_DECREF patch
likely_object.diff	ajaksu2, 2009-01-20 22:44	This one only defines likely/unlikely
ceval_exception.diff	ajaksu2, 2009-01-20 22:46	Adds hints to END_FINALLY
ceval_tuple_unpack.diff	ajaksu2, 2009-01-20 22:48	Adds hints to UNPACK_SEQUENCE, probably bogus outside of pybench
ceval_function.diff	ajaksu2, 2009-01-20 22:49	Adds hints to CALL_FUNCTION_*, certainly bogus outside pybench

Messages (11)
msg79821 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-01-14 02:10
As suggested by Paolo Giarrusso, add a "likely()" to Py_DECREF, telling GCC it doesn't call the destructor in the common path. This optimization seems to interact well with #4896 and to be slower on top of #4753 on my system. Patching ceval.c instead of object.h seems to give the same result, FWIW. Benchmarks most welcome :) Paolo Giarrusso wrote: "Anyway, yesterday I was looking at the assembly and I realized that GCC optimizes the case where Py_DECREF calls the destructor! Since I don't have time for this, can you hint GCC for the better by adding unlikely() to Py_DECREF, at least (or maybe only) for ceval.c, or maybe even only for the main interpreter?" pybench: Rounds: 10 Warp: 10 Timer: time.time Machine Details: Platform ID: Linux-2.6.24-23-generic-i686-with-debian-lenny-sid Processor: Intel Celeron M 410 Python: Implementation: CPython Executable: /home/ajaksu/py3k/python Version: 3.1.0 Compiler: GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3) Bits: 32bit Build: Jan 13 2009 23:40:05 (#py3k:68571M) Unicode: UCS2 Test minimum run-time average run-time this other diff this other diff ------------------------------------------------------------------------------- BuiltinFunctionCalls: 147ms 163ms -9.7% 148ms 164ms -10.0% BuiltinMethodLookup: 138ms 154ms -10.3% 139ms 155ms -10.6% CompareFloats: 185ms 191ms -3.4% 185ms 192ms -3.4% CompareFloatsIntegers: 304ms 294ms +3.1% 304ms 296ms +2.7% CompareIntegers: 287ms 281ms +2.0% 288ms 281ms +2.3% CompareInternedStrings: 189ms 217ms -12.7% 192ms 217ms -11.7% CompareLongs: 167ms 162ms +3.3% 168ms 162ms +3.3% CompareStrings: 152ms 178ms -14.2% 156ms 179ms -12.7% ComplexPythonFunctionCalls: 192ms 202ms -5.1% 193ms 206ms -6.2% ConcatStrings: 325ms 327ms -0.7% 326ms 330ms -1.1% CreateInstances: 200ms 209ms -4.4% 204ms 210ms -2.9% CreateNewInstances: 149ms 156ms -4.4% 151ms 161ms -5.8% CreateStringsWithConcat: 255ms 272ms -6.1% 258ms 275ms -6.3% DictCreation: 145ms 151ms -4.0% 146ms 154ms -5.3% DictWithFloatKeys: 209ms 224ms -6.8% 209ms 224ms -6.7% DictWithIntegerKeys: 170ms 185ms -8.0% 171ms 187ms -8.3% DictWithStringKeys: 155ms 172ms -10.1% 156ms 173ms -10.2% ForLoops: 162ms 183ms -11.7% 162ms 183ms -11.4% IfThenElse: 189ms 217ms -12.7% 190ms 219ms -13.1% ListSlicing: 156ms 156ms -0.3% 156ms 158ms -1.0% NestedForLoops: 199ms 219ms -9.1% 201ms 220ms -8.6% NestedListComprehensions: 222ms 245ms -9.5% 228ms 251ms -9.2% NormalClassAttribute: 352ms 376ms -6.2% 353ms 378ms -6.6% NormalInstanceAttribute: 188ms 212ms -11.1% 191ms 215ms -11.3% PythonFunctionCalls: 173ms 192ms -9.6% 174ms 193ms -9.7% PythonMethodCalls: 213ms 226ms -5.6% 214ms 227ms -5.8% Recursion: 275ms 295ms -7.0% 276ms 296ms -7.0% SecondImport: 158ms 158ms +0.2% 160ms 160ms +0.1% SecondPackageImport: 164ms 168ms -2.2% 166ms 170ms -2.8% SecondSubmoduleImport: 221ms 226ms -2.4% 223ms 229ms -2.9% SimpleComplexArithmetic: 143ms 166ms -13.5% 144ms 166ms -13.4% SimpleDictManipulation: 269ms 296ms -9.2% 270ms 297ms -9.0% SimpleFloatArithmetic: 150ms 177ms -15.3% 151ms 196ms -23.2% SimpleIntFloatArithmetic: 200ms 226ms -11.4% 203ms 232ms -12.4% SimpleIntegerArithmetic: 200ms 226ms -11.4% 202ms 235ms -13.9% SimpleListComprehensions: 186ms 211ms -12.0% 193ms 218ms -11.5% SimpleListManipulation: 161ms 180ms -10.6% 161ms 181ms -10.9% SimpleLongArithmetic: 137ms 159ms -14.0% 137ms 162ms -15.2% SmallLists: 197ms 209ms -5.6% 201ms 210ms -4.2% SmallTuples: 215ms 234ms -8.2% 216ms 236ms -8.2% SpecialClassAttribute: 544ms 566ms -4.0% 549ms 571ms -3.7% SpecialInstanceAttribute: 188ms 211ms -11.0% 190ms 214ms -11.0% StringMappings: 432ms 439ms -1.6% 433ms 441ms -1.9% StringPredicates: 183ms 198ms -7.8% 183ms 199ms -7.9% StringSlicing: 304ms 317ms -4.2% 310ms 321ms -3.3% TryExcept: 138ms 154ms -10.5% 139ms 155ms -10.4% TryFinally: 149ms 156ms -4.8% 151ms 159ms -5.1% TryRaiseExcept: 76ms 78ms -2.3% 77ms 82ms -6.3% TupleSlicing: 254ms 254ms -0.2% 254ms 257ms -1.0% WithFinally: 204ms 207ms -1.1% 206ms 209ms -1.1% WithRaiseExcept: 187ms 190ms -2.0% 188ms 192ms -2.2% ------------------------------------------------------------------------------- Totals: 10557ms 11266ms -6.3% 10647ms 11397ms -6.6% (this=likely_object.pybench, other=classic_opcode.pybench)
msg79822 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-01-14 02:42
I get a slowdown here (results attached). Many objects are shortlived in Python, so I'm not sure the approach is valid anyway.
msg79831 - (view)	Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade)	Date: 2009-01-14 05:23
If speedup on other machines are confirmed, a slowdown of less than 2% should be acceptable (it's also similar to the statistic noise I have on my machine - it's a pity pybench does not calculate the standard deviation of the execution time). It is true that many objects are short lived, and that makes generational GC work, but the patch is about another point. Do most refcount decrements lead to garbage collection? That's a completely different question. And I don't think that popping a function call argument usually leads to a destructor invocation, for instance, unless the argument is a temporary. The same reasoning holds for most other opcodes. The simplest way to check this is to look at the compilation output with Profile-Guided Optimization active and verify which branch is optimized. Having said that, the slowdown you get (on your Athlon X2 3600+) might be due to a higher cost in case of failed static prediction, since that leads to 2 jumps. This code leads to 0 or 1 jump: if (likely(cond)) { destructor } while with unlikely, one runs through either 0 or 2 jumps. The second jump is unconditional, so it has a much smaller delay (the decode unit has to tell the fetch unit to fetch another instruction). However, in that case one has to invoke anyway the destructor through a virtual call, which is much more expensive.
msg79834 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-01-14 06:28
I dislike about this patch that there are two different else cases (giving the trivial macro definition); there should only be one such case.
msg79851 - (view)	Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade)	Date: 2009-01-14 11:45
Also, GCC 2.95 does not support the construct, GCC 2.96 is required. So, I'd suggest defining likely/unlikely unconditionally and using this, which leads to less code overall: # if (__GNUC__ < 2) \|\| ((__GNUC__ == 2) && (__GNUC_MINOR__ <= 95)) # define __builtin_expect(exp, prediction) (exp) # endif This is the simplest way to code this, assuming that ICC defines __GNUC__, and my reference guide tells "yes, it defines __GNUC__ and __INTEL_COMPILER".
msg79983 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-01-16 23:02
You could also test this principle with a few opcodes in ceval.c. Especially, the error cases in LOAD_FAST, LOAD_GLOBAL etc. could be flagged as unlikely.
msg79987 - (view)	Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade)	Date: 2009-01-16 23:41
Yep, agreed. It could be quite cool to rely on __attribute__((cold)) to mark format_exc_check_arg(), but that only works with newer gcc's. I guess I'll add many likely() by hand - in some cases GCC might already get it right, but the heuristics used to choose the likely path statically are quite arguable and subject to change. Also, on other archs such hints might have a bigger impact on the generated code. I'll give a look at that not earlier than February, or you're welcome to try this. However, at least on x86, an "if (successCondition) { success(); dispatch(); } failure();" will be probably compiled to assembly like follows, equivalent to adding likely to the test: evaluate successCondition if_false goto err: call to success(); dispatch next instrutions. err: call to failure();
msg80288 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-01-20 22:44
New version of the patch, as well as one that doesn't touch Py_DECREF but defines likely and unlikely. Then, three ceval patches that result in small speedups (2% to 8% here), but don't play well together (neither with the Py_DECREF one). Since I get speedups for some changes and slowdowns when more than one is applied, I suspect my system is hitting some limitation (register stavation?) and contributing more noise that signal :/
msg80290 - (view)	Author: Jeffrey Yasskin (jyasskin) *	Date: 2009-01-20 22:57
Minor comments on likely_object.diff: * Py_LIKELY() and Py_UNLIKELY() would be better spellings than likely() and unlikely(). * The definitions should go in pyport.h instead of object.h * Usually don't put spaces after the "#". * Probably #if the definitions of Py_LIKELY and Py_UNLIKELY instead of __builtin_expect so new compilers can easily add their own definitions.
msg80569 - (view)	Author: Paolo 'Blaisorblade' Giarrusso (blaisorblade)	Date: 2009-01-26 11:53
>Probably #if the definitions of Py_LIKELY and Py_UNLIKELY instead of __builtin_expect so new compilers can easily add their own definitions. This was done in the first version, but with the currently supported compilers it's simpler to do like that, because both GCC and ICC support the same __builtin_expect syntax, so you get less code this way. Anyway, the code was "inspired" from the Linux kernel which only supports those two compilers, so anybody more knowledgeable is welcome to suggest how to express this with other compilers.
msg115644 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-09-05 12:59
The whole approach doesn't seem to bear much fruit. I tried to apply again likely_decref.diff and got a 0% performance change on 3.2 (on a Core i3 processor).

History
Date	User	Action	Args
2022-04-11 14:56:44	admin	set	github: 49191
2010-09-05 12:59:26	pitrou	set	status: open -> closed resolution: rejected messages: + msg115644
2009-01-26 11:53:13	blaisorblade	set	messages: + msg80569
2009-01-20 22:57:12	jyasskin	set	messages: + msg80290
2009-01-20 22:49:33	ajaksu2	set	files: + ceval_function.diff
2009-01-20 22:48:18	ajaksu2	set	files: + ceval_tuple_unpack.diff
2009-01-20 22:46:30	ajaksu2	set	files: + ceval_exception.diff
2009-01-20 22:44:50	ajaksu2	set	files: + likely_object.diff
2009-01-20 22:44:00	ajaksu2	set	files: + likely_decref.diff messages: + msg80288
2009-01-20 22:38:46	ajaksu2	set	files: - likely_decref.diff
2009-01-16 23:41:44	blaisorblade	set	messages: + msg79987
2009-01-16 23:02:35	pitrou	set	messages: + msg79983
2009-01-16 17:10:14	collinwinter	set	nosy: + collinwinter, jyasskin
2009-01-14 11:45:12	blaisorblade	set	messages: + msg79851
2009-01-14 06:28:56	loewis	set	nosy: + loewis messages: + msg79834
2009-01-14 05:23:01	blaisorblade	set	messages: + msg79831
2009-01-14 02:42:30	pitrou	set	files: + pybench.txt messages: + msg79822
2009-01-14 02:10:30	ajaksu2	create