classification
Title: Armin's method cache optimization updated for Python 2.6
Type: Stage:
Components: Interpreter Core Versions: Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, arigo, bioinformed, collinwinter, georg.brandl, jcea, jyasskin, nnorwitz, ntoronto, peaker, rhettinger
Priority: normal Keywords: patch

Created on 2007-04-13 19:16 by bioinformed, last changed 2009-01-15 18:54 by collinwinter. This issue is now closed.

Files
File name Uploaded Description Edit
python26-classattrcache.diff bioinformed, 2007-04-13 19:16
fastattr_test.py ntoronto, 2007-12-07 03:53
Messages (23)
msg52431 - (view) Author: Kevin Jacobs (bioinformed) Date: 2007-04-13 19:16
I've forward ported and slightly cleaned up Armin's method cache patch (see #1685986).  I've attempted to clarify and tighten up several loose ends in the code, so hopefully I haven't mucked anything up.

My performance results are not quite as good as Armin's, though still very encouraging.  I see a typical speed up of 10%, with heavily object oriented application code seeing speedups of 15-20%.  Given these rather significant results, the major task is to verify correctness.
msg52432 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2007-04-14 06:21
I tried this test on parrotbench (b0.py in particular) and I'm not sure I could distinguish an improvement over the noise (best case was about 5%).  The variability is pretty high on my box (dual amd64 opertons, ubuntu gcc 4.1.2-ish).  At first it seemed a bit slower with the original patch which uses unsigned ints.  I tried changing to unsigned longs.  It seemed a little better, though not sure if it was really due to the change.  I think unsigned longs should be used for 64-bit boxes.

Did you use a benchmark/program that is open source?  I don't know that I have anything decent to test this with.  Raymond probably does though.  Also what platform did you test on?
msg52433 - (view) Author: Kevin Jacobs (bioinformed) Date: 2007-04-14 13:49
I benchmarked using pystone, pybench, and a bunch of my local scientific applications that have tight computational kernels still in pure Python.  I tested on a 64bit Linux box, defining the version_tag as either int or unsigned long with no appreciable difference in performance.  I'm trying to get parrotbench to run, but it doesn't seem to have been updated for modern Python versions.
msg52434 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2007-05-14 19:12
A further minor improvement could be achieved by
re-doing the inlining of _PyType_Lookup() in
object.c.  Ideally, the fast path only could be
inlined, and the rest put in a separate function
or just using _PyType_Lookup().
msg52435 - (view) Author: Kevin Jacobs (bioinformed) Date: 2007-05-14 20:18
I tried re-inlining the fast path from _PyType_Lookup() in object.c and found no measurable improvement on the simple benchmarks I tried.  I've also stress-tested the patch by disabling the fast-path return, always performing the slow-path lookup, and asserting that the cached result matches the slow-path result.  I then ran that modified interpreter on the Python test-suite, various benchmarks, and a range of my own applications.  While not a formal proof of correctness, it was encouraging that the cache remained consistent.
msg52436 - (view) Author: Eyal Lotem (peaker) Date: 2007-06-10 01:16
Why is CPython copying functions to inline them?

This is not only code duplication that makes the code easy to break, it also bloats the caller function beyond readability.

I believe that using a private .c #include file with an inline keyword, or at worst, #include the function code directly is a better solution.

Is there any reason to use the method currently used by CPython?
msg52437 - (view) Author: Kevin Jacobs (bioinformed) Date: 2007-06-10 02:08
First off, all versions of this patch do away with the rather aggressively repeated inline code.  My previous comment about refactoring and testing an inlined form were purely an experiment with null results.

That aside, you do raise a good question.  However, given the current patch, it is unfortunately off-topic and irrelevant to the consideration of this patch.  Please feel free to pursue it elsewhere, since I worry that it will only serve as a negative distraction from the much more interesting aims of the mro optimization.  Since you are clearly worried about the performance of attribute lookup, please try it out and report your findings.  I'll be happy review the results from your benchmarks and any suggestions you have.
msg52438 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-06-10 02:16
FWIW, I'm still reviewing this code. 
msg52439 - (view) Author: Eyal Lotem (peaker) Date: 2007-06-10 02:48
I may be missing something trivial here, but why the cache global - and not per-typeobject?

As I saw it - the mro cache was supposed to be a mapping of (type,name)->(descriptor/classattr) that's based on the __mro__ of type, and all of the dicts in type and its bases.

I think the right way to do it is to use another Pythonic dict object in each type, that holds this cache (name->(descriptor/classattr)).
This cache dict should essentially be equivalent to a merge of all the dicts in the mro. and gets updated whenever any of the dicts of the types in the mro gets updated.
Not sure why to re-implement a fast-hashtable when one already exists.
Also - once it is associated with the type - it does not hold any references to any names that are not already referenced by someone in the type hierarchy, so there are no mm/gc issues either.

Am I missing something here?
msg52440 - (view) Author: Kevin Jacobs (bioinformed) Date: 2007-06-10 03:19
The cache is global to make it fast, have a deterministic and low memory footprint, and be easy to invalidate.  It is distinct from the standard Python dict implementation since it need not be as general nor fully associative, as cache collisions are not subject to re-probing.  Neither does it even go through the motions of memory management, which would be rather hard to get the standard dict implementation to emulate.  It is kept drastically minimal in the interests of performance and goes to great pains to hide and minimize the impact to the rest of the interpreter.  In this regard it seems quite successful, as it even removes the need for the manually inlined code you commented on before.  It also survives harsh stress-testing and presents no detectable performance regressions in any of my benchmarks (to the contrary, many operations are significantly faster).

While a per-type-object cache sounds feasible, it is hard for me to see how it could be as efficient in the general case -- for one, the dict look up fast path is just so much heavier.  Please feel free to take that as a challenge to produce a patch that contradicts my intuition.
msg52441 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2007-06-10 07:45
peaker: we tried several approaches (including ones similar to what you describe) in PyPy, and the one in this patch is the one that worked best there.  There is of course no guarantee that the same experiments would have given exactly the same results in CPython, so feel free to try them all again - I won't :-)
msg58262 - (view) Author: Neil Toronto (ntoronto) Date: 2007-12-07 03:53
I've attached the microbenchmarks I was using for my own version of
attribute caching. For list, tuple, dict, and a deep hierarchy, it tests
accessing type attributes (cls.__class__), class attributes
(cls.__init__), class attributes via instance (inst.__class__), and
instance attributes (inst.__init__), using LOAD_ATTR and hasattr. It
also tests hasattr with missing attributes.
msg58263 - (view) Author: Neil Toronto (ntoronto) Date: 2007-12-07 03:57
Attribute access that happens only once or very infrequently probably
shouldn't go through the cache machinery - it'll only slow things down.
(Updating slots for example.) Maybe _PyType_Lookup should remain the
same, with _PyType_CachingLookup added for attribute access via
type/instance getattr.
msg58932 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2007-12-21 01:44
Okay, this patch is accepted.  We'll likely do additional tweaks after
it goes in.

Please close the other tracker items for this same issue.

If no one applies this, I'll do it when I return from the holidays.
msg59818 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-01-12 13:48
Committed to trunk as r59931. Leaving open if you want to make more
adjustments, Raymond.
msg59854 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-01-13 15:04
Backed out again in r59940 -- test_ctypes fails in test_incomplete.py.

Armin reports that with his original patch on 2.4, this test passes.
msg59855 - (view) Author: Kevin Jacobs (bioinformed) Date: 2008-01-13 15:21
All tests passed when I first ported Armin's patch to 2.6.  I'll have a
chance to look into this later in the week.  If anyone gets to it first,
please summarize your findings here to avoid duplication of effort.
msg59857 - (view) Author: Kevin Jacobs (bioinformed) Date: 2008-01-13 17:01
Couldn't resist looking into this more and I must admit that I am a bit
stumped.  If instead of returning a value from the cache fast path in
_PyType_Lookup, I modified the code to store the value and assert that
it matches with the non-cached result from traversing the mro.  That
assertion never fails.  However, test_incomplete works with this change
that ideally should not alter the semantics of _PyType_Lookup.

More so, if I selectively don't cache the name "next" (with the caching
fast path reenabled), test_incomplete works.  For test_incomplete,
_PyType_Lookup on "next" always return NULL from both the cached and
non-cached paths.  

I'm beginning to suspect there is some black magic going on.  None of
the relevant reference counts drop to zero, the same return value (NULL)
is generated, but the slow patch seems to be doing something with a side
effect that makes it work.

Wish I had more time to debug this, but I'm already overdrawn until
Wednesday...
msg59874 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-01-14 00:01
After some debug work, I found an explanation:
- The attribute "name" is the name of a type slot. So it becomes cached
during the type construction.
- in function ctypes/_ctypes.c::StructUnionType_new(), the type's
__dict__ is replaced by a "stgdict".
- this "stgdict" is then modified with PyDict_ functions, without any
call to _PyType_Lookup()
- the method cache becomes out of sync ==> boom.

I have come with a patch which corrects the problem, and all ctypes
tests pass:

Index: Modules/_ctypes/stgdict.c
===================================================================
--- Modules/_ctypes/stgdict.c   (revision 59939)
+++ Modules/_ctypes/stgdict.c   (working copy)
@@ -470,7 +470,7 @@
                        Py_DECREF(pair);
                        return -1;
                }
-               if (-1 == PyDict_SetItem(realdict, name, prop)) {
+               if (-1 == PyObject_SetAttr(type, name, prop)) {
                        Py_DECREF(prop);
                        Py_DECREF(pair);
                        return -1;
Index: Modules/_ctypes/_ctypes.c
===================================================================
--- Modules/_ctypes/_ctypes.c   (revision 59939)
+++ Modules/_ctypes/_ctypes.c   (working copy)
@@ -410,7 +410,7 @@
 StructType_setattro(PyObject *self, PyObject *key, PyObject *value)
 {
        /* XXX Should we disallow deleting _fields_? */
-       if (-1 == PyObject_GenericSetAttr(self, key, value))
+       if (-1 == Py_TYPE(self)->tp_base->tp_setattro(self, key, value))
                return -1;

        if (value && PyString_Check(key) &&

I think that these changes are sensible: The first one deal with the
type's attribute instead of updating its __dict__, and the second
properly delegates "tp_setattro" to the base class ('type' in this case).
msg59875 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-01-14 00:11
Nice analysis.  Please apply.
msg59876 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-01-14 00:14
OK, I apply first my 2 lines, then the original patch.
msg59881 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-01-14 01:11
For the record:
It is very wrong to call self->ob_type->tp_base's slot: slots are often
copied from their base class, and you get infinite recursion.
Calling StructType.tp_base->tp_setattro is much better.
msg79912 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2009-01-15 18:54
Looks like this was re-applied in r59943 and r59944. Thanks for taking
care of this, Amaury.
History
Date User Action Args
2009-01-15 18:54:51collinwintersetstatus: open -> closed
nosy: + collinwinter, jyasskin
resolution: accepted -> fixed
messages: + msg79912
2008-10-13 11:43:31jceasetnosy: + jcea
2008-07-10 13:29:37rhettingersetassignee: rhettinger ->
2008-01-14 01:11:46amaury.forgeotdarcsetmessages: + msg59881
2008-01-14 00:14:46amaury.forgeotdarcsetmessages: + msg59876
2008-01-14 00:11:10rhettingersetmessages: + msg59875
2008-01-14 00:01:41amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg59874
2008-01-13 17:01:55bioinformedsetmessages: + msg59857
2008-01-13 15:21:20bioinformedsetmessages: + msg59855
2008-01-13 15:04:44georg.brandlsetmessages: + msg59854
2008-01-12 13:48:20georg.brandlsetnosy: + georg.brandl
messages: + msg59818
2008-01-12 13:16:06georg.brandllinkissue1685986 superseder
2007-12-21 01:44:50rhettingersetresolution: accepted
messages: + msg58932
2007-12-07 03:57:25ntorontosetmessages: + msg58263
2007-12-07 03:53:54ntorontosetfiles: + fastattr_test.py
nosy: + ntoronto
messages: + msg58262
2007-04-13 19:16:08bioinformedcreate