Issue 1700288: Armin's method cache optimization updated for Python 2.6

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44842

classification

Title:	Armin's method cache optimization updated for Python 2.6
Type:		Stage:
Components:	Interpreter Core	Versions:	Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, arigo, bioinformed, collinwinter, georg.brandl, jcea, jyasskin, nnorwitz, ntoronto, peaker, rhettinger
Priority:	normal	Keywords:	patch

Created on 2007-04-13 19:16 by bioinformed, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
python26-classattrcache.diff	bioinformed, 2007-04-13 19:16
fastattr_test.py	ntoronto, 2007-12-07 03:53

Messages (23)
msg52431 - (view)	Author: Kevin Jacobs (bioinformed)	Date: 2007-04-13 19:16
I've forward ported and slightly cleaned up Armin's method cache patch (see #1685986). I've attempted to clarify and tighten up several loose ends in the code, so hopefully I haven't mucked anything up. My performance results are not quite as good as Armin's, though still very encouraging. I see a typical speed up of 10%, with heavily object oriented application code seeing speedups of 15-20%. Given these rather significant results, the major task is to verify correctness.
msg52432 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2007-04-14 06:21
I tried this test on parrotbench (b0.py in particular) and I'm not sure I could distinguish an improvement over the noise (best case was about 5%). The variability is pretty high on my box (dual amd64 opertons, ubuntu gcc 4.1.2-ish). At first it seemed a bit slower with the original patch which uses unsigned ints. I tried changing to unsigned longs. It seemed a little better, though not sure if it was really due to the change. I think unsigned longs should be used for 64-bit boxes. Did you use a benchmark/program that is open source? I don't know that I have anything decent to test this with. Raymond probably does though. Also what platform did you test on?
msg52433 - (view)	Author: Kevin Jacobs (bioinformed)	Date: 2007-04-14 13:49
I benchmarked using pystone, pybench, and a bunch of my local scientific applications that have tight computational kernels still in pure Python. I tested on a 64bit Linux box, defining the version_tag as either int or unsigned long with no appreciable difference in performance. I'm trying to get parrotbench to run, but it doesn't seem to have been updated for modern Python versions.
msg52434 - (view)	Author: Armin Rigo (arigo) *	Date: 2007-05-14 19:12
A further minor improvement could be achieved by re-doing the inlining of _PyType_Lookup() in object.c. Ideally, the fast path only could be inlined, and the rest put in a separate function or just using _PyType_Lookup().
msg52435 - (view)	Author: Kevin Jacobs (bioinformed)	Date: 2007-05-14 20:18
I tried re-inlining the fast path from _PyType_Lookup() in object.c and found no measurable improvement on the simple benchmarks I tried. I've also stress-tested the patch by disabling the fast-path return, always performing the slow-path lookup, and asserting that the cached result matches the slow-path result. I then ran that modified interpreter on the Python test-suite, various benchmarks, and a range of my own applications. While not a formal proof of correctness, it was encouraging that the cache remained consistent.
msg52436 - (view)	Author: Eyal Lotem (peaker)	Date: 2007-06-10 01:16
Why is CPython copying functions to inline them? This is not only code duplication that makes the code easy to break, it also bloats the caller function beyond readability. I believe that using a private .c #include file with an inline keyword, or at worst, #include the function code directly is a better solution. Is there any reason to use the method currently used by CPython?
msg52437 - (view)	Author: Kevin Jacobs (bioinformed)	Date: 2007-06-10 02:08
First off, all versions of this patch do away with the rather aggressively repeated inline code. My previous comment about refactoring and testing an inlined form were purely an experiment with null results. That aside, you do raise a good question. However, given the current patch, it is unfortunately off-topic and irrelevant to the consideration of this patch. Please feel free to pursue it elsewhere, since I worry that it will only serve as a negative distraction from the much more interesting aims of the mro optimization. Since you are clearly worried about the performance of attribute lookup, please try it out and report your findings. I'll be happy review the results from your benchmarks and any suggestions you have.
msg52438 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2007-06-10 02:16
FWIW, I'm still reviewing this code.
msg52439 - (view)	Author: Eyal Lotem (peaker)	Date: 2007-06-10 02:48
I may be missing something trivial here, but why the cache global - and not per-typeobject? As I saw it - the mro cache was supposed to be a mapping of (type,name)->(descriptor/classattr) that's based on the __mro__ of type, and all of the dicts in type and its bases. I think the right way to do it is to use another Pythonic dict object in each type, that holds this cache (name->(descriptor/classattr)). This cache dict should essentially be equivalent to a merge of all the dicts in the mro. and gets updated whenever any of the dicts of the types in the mro gets updated. Not sure why to re-implement a fast-hashtable when one already exists. Also - once it is associated with the type - it does not hold any references to any names that are not already referenced by someone in the type hierarchy, so there are no mm/gc issues either. Am I missing something here?
msg52440 - (view)	Author: Kevin Jacobs (bioinformed)	Date: 2007-06-10 03:19
The cache is global to make it fast, have a deterministic and low memory footprint, and be easy to invalidate. It is distinct from the standard Python dict implementation since it need not be as general nor fully associative, as cache collisions are not subject to re-probing. Neither does it even go through the motions of memory management, which would be rather hard to get the standard dict implementation to emulate. It is kept drastically minimal in the interests of performance and goes to great pains to hide and minimize the impact to the rest of the interpreter. In this regard it seems quite successful, as it even removes the need for the manually inlined code you commented on before. It also survives harsh stress-testing and presents no detectable performance regressions in any of my benchmarks (to the contrary, many operations are significantly faster). While a per-type-object cache sounds feasible, it is hard for me to see how it could be as efficient in the general case -- for one, the dict look up fast path is just so much heavier. Please feel free to take that as a challenge to produce a patch that contradicts my intuition.
msg52441 - (view)	Author: Armin Rigo (arigo) *	Date: 2007-06-10 07:45
peaker: we tried several approaches (including ones similar to what you describe) in PyPy, and the one in this patch is the one that worked best there. There is of course no guarantee that the same experiments would have given exactly the same results in CPython, so feel free to try them all again - I won't :-)
msg58262 - (view)	Author: Neil Toronto (ntoronto)	Date: 2007-12-07 03:53
I've attached the microbenchmarks I was using for my own version of attribute caching. For list, tuple, dict, and a deep hierarchy, it tests accessing type attributes (cls.__class__), class attributes (cls.__init__), class attributes via instance (inst.__class__), and instance attributes (inst.__init__), using LOAD_ATTR and hasattr. It also tests hasattr with missing attributes.
msg58263 - (view)	Author: Neil Toronto (ntoronto)	Date: 2007-12-07 03:57
Attribute access that happens only once or very infrequently probably shouldn't go through the cache machinery - it'll only slow things down. (Updating slots for example.) Maybe _PyType_Lookup should remain the same, with _PyType_CachingLookup added for attribute access via type/instance getattr.
msg58932 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2007-12-21 01:44
Okay, this patch is accepted. We'll likely do additional tweaks after it goes in. Please close the other tracker items for this same issue. If no one applies this, I'll do it when I return from the holidays.
msg59818 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2008-01-12 13:48
Committed to trunk as r59931. Leaving open if you want to make more adjustments, Raymond.
msg59854 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2008-01-13 15:04
Backed out again in r59940 -- test_ctypes fails in test_incomplete.py. Armin reports that with his original patch on 2.4, this test passes.
msg59855 - (view)	Author: Kevin Jacobs (bioinformed)	Date: 2008-01-13 15:21
All tests passed when I first ported Armin's patch to 2.6. I'll have a chance to look into this later in the week. If anyone gets to it first, please summarize your findings here to avoid duplication of effort.
msg59857 - (view)	Author: Kevin Jacobs (bioinformed)	Date: 2008-01-13 17:01
Couldn't resist looking into this more and I must admit that I am a bit stumped. If instead of returning a value from the cache fast path in _PyType_Lookup, I modified the code to store the value and assert that it matches with the non-cached result from traversing the mro. That assertion never fails. However, test_incomplete works with this change that ideally should not alter the semantics of _PyType_Lookup. More so, if I selectively don't cache the name "next" (with the caching fast path reenabled), test_incomplete works. For test_incomplete, _PyType_Lookup on "next" always return NULL from both the cached and non-cached paths. I'm beginning to suspect there is some black magic going on. None of the relevant reference counts drop to zero, the same return value (NULL) is generated, but the slow patch seems to be doing something with a side effect that makes it work. Wish I had more time to debug this, but I'm already overdrawn until Wednesday...
msg59874 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-01-14 00:01
After some debug work, I found an explanation: - The attribute "name" is the name of a type slot. So it becomes cached during the type construction. - in function ctypes/_ctypes.c::StructUnionType_new(), the type's __dict__ is replaced by a "stgdict". - this "stgdict" is then modified with PyDict_ functions, without any call to _PyType_Lookup() - the method cache becomes out of sync ==> boom. I have come with a patch which corrects the problem, and all ctypes tests pass: Index: Modules/_ctypes/stgdict.c =================================================================== --- Modules/_ctypes/stgdict.c (revision 59939) +++ Modules/_ctypes/stgdict.c (working copy) @@ -470,7 +470,7 @@ Py_DECREF(pair); return -1; } - if (-1 == PyDict_SetItem(realdict, name, prop)) { + if (-1 == PyObject_SetAttr(type, name, prop)) { Py_DECREF(prop); Py_DECREF(pair); return -1; Index: Modules/_ctypes/_ctypes.c =================================================================== --- Modules/_ctypes/_ctypes.c (revision 59939) +++ Modules/_ctypes/_ctypes.c (working copy) @@ -410,7 +410,7 @@ StructType_setattro(PyObject self, PyObject key, PyObject value) { / XXX Should we disallow deleting _fields_? */ - if (-1 == PyObject_GenericSetAttr(self, key, value)) + if (-1 == Py_TYPE(self)->tp_base->tp_setattro(self, key, value)) return -1; if (value && PyString_Check(key) && I think that these changes are sensible: The first one deal with the type's attribute instead of updating its __dict__, and the second properly delegates "tp_setattro" to the base class ('type' in this case).
msg59875 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2008-01-14 00:11
Nice analysis. Please apply.
msg59876 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-01-14 00:14
OK, I apply first my 2 lines, then the original patch.
msg59881 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-01-14 01:11
For the record: It is very wrong to call self->ob_type->tp_base's slot: slots are often copied from their base class, and you get infinite recursion. Calling StructType.tp_base->tp_setattro is much better.
msg79912 - (view)	Author: Collin Winter (collinwinter) *	Date: 2009-01-15 18:54
Looks like this was re-applied in r59943 and r59944. Thanks for taking care of this, Amaury.

History
Date	User	Action	Args
2022-04-11 14:56:23	admin	set	github: 44842
2009-01-15 18:54:51	collinwinter	set	status: open -> closed nosy: + collinwinter, jyasskin resolution: accepted -> fixed messages: + msg79912
2008-10-13 11:43:31	jcea	set	nosy: + jcea
2008-07-10 13:29:37	rhettinger	set	assignee: rhettinger ->
2008-01-14 01:11:46	amaury.forgeotdarc	set	messages: + msg59881
2008-01-14 00:14:46	amaury.forgeotdarc	set	messages: + msg59876
2008-01-14 00:11:10	rhettinger	set	messages: + msg59875
2008-01-14 00:01:41	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg59874
2008-01-13 17:01:55	bioinformed	set	messages: + msg59857
2008-01-13 15:21:20	bioinformed	set	messages: + msg59855
2008-01-13 15:04:44	georg.brandl	set	messages: + msg59854
2008-01-12 13:48:20	georg.brandl	set	nosy: + georg.brandl messages: + msg59818
2008-01-12 13:16:06	georg.brandl	link	issue1685986 superseder
2007-12-21 01:44:50	rhettinger	set	resolution: accepted messages: + msg58932
2007-12-07 03:57:25	ntoronto	set	messages: + msg58263
2007-12-07 03:53:54	ntoronto	set	files: + fastattr_test.py nosy: + ntoronto messages: + msg58262
2007-04-13 19:16:08	bioinformed	create