Armin's method cache optimization updated for Python 2.6 #44842

bioinformed · 2007-04-13T19:16:08Z

BPO	1700288
Nosy	@arigo, @birkenfeld, @rhettinger, @jcea, @amauryfa
Files	python26-classattrcache.diff fastattr_test.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-01-15.18:54:51.780>
created_at = <Date 2007-04-13.19:16:08.000>
labels = ['interpreter-core']
title = "Armin's method cache optimization updated for Python 2.6"
updated_at = <Date 2009-01-15.18:54:51.745>
user = 'https://bugs.python.org/bioinformed'

bugs.python.org fields:

activity = <Date 2009-01-15.18:54:51.745>
actor = 'collinwinter'
assignee = 'none'
closed = True
closed_date = <Date 2009-01-15.18:54:51.780>
closer = 'collinwinter'
components = ['Interpreter Core']
creation = <Date 2007-04-13.19:16:08.000>
creator = 'bioinformed'
dependencies = []
files = ['7940', '8887']
hgrepos = []
issue_num = 1700288
keywords = ['patch']
message_count = 23.0
messages = ['52431', '52432', '52433', '52434', '52435', '52436', '52437', '52438', '52439', '52440', '52441', '58262', '58263', '58932', '59818', '59854', '59855', '59857', '59874', '59875', '59876', '59881', '79912']
nosy_count = 11.0
nosy_names = ['nnorwitz', 'arigo', 'georg.brandl', 'collinwinter', 'rhettinger', 'jcea', 'amaury.forgeotdarc', 'peaker', 'bioinformed', 'jyasskin', 'ntoronto']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue1700288'
versions = ['Python 2.6']

bioinformed · 2007-04-13T19:16:08Z

I've forward ported and slightly cleaned up Armin's method cache patch (see bpo-1685986). I've attempted to clarify and tighten up several loose ends in the code, so hopefully I haven't mucked anything up.

My performance results are not quite as good as Armin's, though still very encouraging. I see a typical speed up of 10%, with heavily object oriented application code seeing speedups of 15-20%. Given these rather significant results, the major task is to verify correctness.

nnorwitz · 2007-04-14T06:21:23Z

I tried this test on parrotbench (b0.py in particular) and I'm not sure I could distinguish an improvement over the noise (best case was about 5%). The variability is pretty high on my box (dual amd64 opertons, ubuntu gcc 4.1.2-ish). At first it seemed a bit slower with the original patch which uses unsigned ints. I tried changing to unsigned longs. It seemed a little better, though not sure if it was really due to the change. I think unsigned longs should be used for 64-bit boxes.

Did you use a benchmark/program that is open source? I don't know that I have anything decent to test this with. Raymond probably does though. Also what platform did you test on?

bioinformed · 2007-04-14T13:49:51Z

I benchmarked using pystone, pybench, and a bunch of my local scientific applications that have tight computational kernels still in pure Python. I tested on a 64bit Linux box, defining the version_tag as either int or unsigned long with no appreciable difference in performance. I'm trying to get parrotbench to run, but it doesn't seem to have been updated for modern Python versions.

arigo · 2007-05-14T19:12:49Z

A further minor improvement could be achieved by
re-doing the inlining of _PyType_Lookup() in
object.c. Ideally, the fast path only could be
inlined, and the rest put in a separate function
or just using _PyType_Lookup().

bioinformed · 2007-05-14T20:18:58Z

I tried re-inlining the fast path from _PyType_Lookup() in object.c and found no measurable improvement on the simple benchmarks I tried. I've also stress-tested the patch by disabling the fast-path return, always performing the slow-path lookup, and asserting that the cached result matches the slow-path result. I then ran that modified interpreter on the Python test-suite, various benchmarks, and a range of my own applications. While not a formal proof of correctness, it was encouraging that the cache remained consistent.

peaker · 2007-06-10T01:16:43Z

Why is CPython copying functions to inline them?

This is not only code duplication that makes the code easy to break, it also bloats the caller function beyond readability.

I believe that using a private .c #include file with an inline keyword, or at worst, #include the function code directly is a better solution.

Is there any reason to use the method currently used by CPython?

bioinformed · 2007-06-10T02:08:35Z

First off, all versions of this patch do away with the rather aggressively repeated inline code. My previous comment about refactoring and testing an inlined form were purely an experiment with null results.

That aside, you do raise a good question. However, given the current patch, it is unfortunately off-topic and irrelevant to the consideration of this patch. Please feel free to pursue it elsewhere, since I worry that it will only serve as a negative distraction from the much more interesting aims of the mro optimization. Since you are clearly worried about the performance of attribute lookup, please try it out and report your findings. I'll be happy review the results from your benchmarks and any suggestions you have.

rhettinger · 2007-06-10T02:16:16Z

FWIW, I'm still reviewing this code.

peaker · 2007-06-10T02:48:36Z

I may be missing something trivial here, but why the cache global - and not per-typeobject?

As I saw it - the mro cache was supposed to be a mapping of (type,name)->(descriptor/classattr) that's based on the __mro__ of type, and all of the dicts in type and its bases.

I think the right way to do it is to use another Pythonic dict object in each type, that holds this cache (name->(descriptor/classattr)).
This cache dict should essentially be equivalent to a merge of all the dicts in the mro. and gets updated whenever any of the dicts of the types in the mro gets updated.
Not sure why to re-implement a fast-hashtable when one already exists.
Also - once it is associated with the type - it does not hold any references to any names that are not already referenced by someone in the type hierarchy, so there are no mm/gc issues either.

Am I missing something here?

bioinformed · 2007-06-10T03:19:35Z

The cache is global to make it fast, have a deterministic and low memory footprint, and be easy to invalidate. It is distinct from the standard Python dict implementation since it need not be as general nor fully associative, as cache collisions are not subject to re-probing. Neither does it even go through the motions of memory management, which would be rather hard to get the standard dict implementation to emulate. It is kept drastically minimal in the interests of performance and goes to great pains to hide and minimize the impact to the rest of the interpreter. In this regard it seems quite successful, as it even removes the need for the manually inlined code you commented on before. It also survives harsh stress-testing and presents no detectable performance regressions in any of my benchmarks (to the contrary, many operations are significantly faster).

While a per-type-object cache sounds feasible, it is hard for me to see how it could be as efficient in the general case -- for one, the dict look up fast path is just so much heavier. Please feel free to take that as a challenge to produce a patch that contradicts my intuition.

arigo · 2007-06-10T07:45:29Z

peaker: we tried several approaches (including ones similar to what you describe) in PyPy, and the one in this patch is the one that worked best there. There is of course no guarantee that the same experiments would have given exactly the same results in CPython, so feel free to try them all again - I won't :-)

ntoronto · 2007-12-07T03:53:54Z

I've attached the microbenchmarks I was using for my own version of
attribute caching. For list, tuple, dict, and a deep hierarchy, it tests
accessing type attributes (cls.__class__), class attributes
(cls.__init__), class attributes via instance (inst.__class__), and
instance attributes (inst.__init__), using LOAD_ATTR and hasattr. It
also tests hasattr with missing attributes.

ntoronto · 2007-12-07T03:57:25Z

Attribute access that happens only once or very infrequently probably
shouldn't go through the cache machinery - it'll only slow things down.
(Updating slots for example.) Maybe _PyType_Lookup should remain the
same, with _PyType_CachingLookup added for attribute access via
type/instance getattr.

rhettinger · 2007-12-21T01:44:50Z

Okay, this patch is accepted. We'll likely do additional tweaks after
it goes in.

Please close the other tracker items for this same issue.

If no one applies this, I'll do it when I return from the holidays.

birkenfeld · 2008-01-12T13:48:21Z

Committed to trunk as r59931. Leaving open if you want to make more
adjustments, Raymond.

birkenfeld · 2008-01-13T15:04:44Z

Backed out again in r59940 -- test_ctypes fails in test_incomplete.py.

Armin reports that with his original patch on 2.4, this test passes.

bioinformed · 2008-01-13T15:21:20Z

All tests passed when I first ported Armin's patch to 2.6. I'll have a
chance to look into this later in the week. If anyone gets to it first,
please summarize your findings here to avoid duplication of effort.

bioinformed · 2008-01-13T17:01:55Z

Couldn't resist looking into this more and I must admit that I am a bit
stumped. If instead of returning a value from the cache fast path in
_PyType_Lookup, I modified the code to store the value and assert that
it matches with the non-cached result from traversing the mro. That
assertion never fails. However, test_incomplete works with this change
that ideally should not alter the semantics of _PyType_Lookup.

More so, if I selectively don't cache the name "next" (with the caching
fast path reenabled), test_incomplete works. For test_incomplete,
_PyType_Lookup on "next" always return NULL from both the cached and
non-cached paths.

I'm beginning to suspect there is some black magic going on. None of
the relevant reference counts drop to zero, the same return value (NULL)
is generated, but the slow patch seems to be doing something with a side
effect that makes it work.

Wish I had more time to debug this, but I'm already overdrawn until
Wednesday...

amauryfa · 2008-01-14T00:01:41Z

After some debug work, I found an explanation:

The attribute "name" is the name of a type slot. So it becomes cached
during the type construction.
in function ctypes/ctypes.c::StructUnionType_new(), the type's
__dict_ is replaced by a "stgdict".
this "stgdict" is then modified with PyDict_ functions, without any
call to _PyType_Lookup()
the method cache becomes out of sync ==> boom.

I have come with a patch which corrects the problem, and all ctypes
tests pass:

Index: Modules/_ctypes/stgdict.c
===================================================================

--- Modules/_ctypes/stgdict.c   (revision 59939)
+++ Modules/_ctypes/stgdict.c   (working copy)
@@ -470,7 +470,7 @@
                        Py_DECREF(pair);
                        return -1;
                }
-               if (-1 == PyDict_SetItem(realdict, name, prop)) {
+               if (-1 == PyObject_SetAttr(type, name, prop)) {
                        Py_DECREF(prop);
                        Py_DECREF(pair);
                        return -1;
Index: Modules/_ctypes/_ctypes.c
===================================================================
--- Modules/_ctypes/_ctypes.c   (revision 59939)
+++ Modules/_ctypes/_ctypes.c   (working copy)
@@ -410,7 +410,7 @@
 StructType_setattro(PyObject *self, PyObject *key, PyObject *value)
 {
        /* XXX Should we disallow deleting _fields_? */
-       if (-1 == PyObject_GenericSetAttr(self, key, value))
+       if (-1 == Py_TYPE(self)->tp_base->tp_setattro(self, key, value))
                return -1;

    if (value && PyString_Check(key) &&

I think that these changes are sensible: The first one deal with the
type's attribute instead of updating its __dict__, and the second
properly delegates "tp_setattro" to the base class ('type' in this case).

rhettinger · 2008-01-14T00:11:10Z

Nice analysis. Please apply.

amauryfa · 2008-01-14T00:14:46Z

OK, I apply first my 2 lines, then the original patch.

amauryfa · 2008-01-14T01:11:46Z

For the record:
It is very wrong to call self->ob_type->tp_base's slot: slots are often
copied from their base class, and you get infinite recursion.
Calling StructType.tp_base->tp_setattro is much better.

collinwinter · 2009-01-15T18:54:51Z

Looks like this was re-applied in r59943 and r59944. Thanks for taking
care of this, Amaury.

bioinformed mannequin assigned rhettinger Apr 13, 2007

bioinformed mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Apr 13, 2007

bioinformed mannequin assigned rhettinger Apr 13, 2007

bioinformed mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Apr 13, 2007

rhettinger unassigned rhettinger Jul 10, 2008

collinwinter mannequin closed this as completed Jan 15, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Armin's method cache optimization updated for Python 2.6 #44842

Armin's method cache optimization updated for Python 2.6 #44842

bioinformed mannequin commented Apr 13, 2007

bioinformed mannequin commented Apr 13, 2007

nnorwitz mannequin commented Apr 14, 2007

bioinformed mannequin commented Apr 14, 2007

arigo mannequin commented May 14, 2007

bioinformed mannequin commented May 14, 2007

peaker mannequin commented Jun 10, 2007

bioinformed mannequin commented Jun 10, 2007

rhettinger commented Jun 10, 2007

peaker mannequin commented Jun 10, 2007

bioinformed mannequin commented Jun 10, 2007

arigo mannequin commented Jun 10, 2007

ntoronto mannequin commented Dec 7, 2007

ntoronto mannequin commented Dec 7, 2007

rhettinger commented Dec 21, 2007

birkenfeld commented Jan 12, 2008

birkenfeld commented Jan 13, 2008

bioinformed mannequin commented Jan 13, 2008

bioinformed mannequin commented Jan 13, 2008

amauryfa commented Jan 14, 2008

rhettinger commented Jan 14, 2008

amauryfa commented Jan 14, 2008

amauryfa commented Jan 14, 2008

collinwinter mannequin commented Jan 15, 2009

Armin's method cache optimization updated for Python 2.6 #44842

Armin's method cache optimization updated for Python 2.6 #44842

Comments

bioinformed mannequin commented Apr 13, 2007

bioinformed mannequin commented Apr 13, 2007

nnorwitz mannequin commented Apr 14, 2007

bioinformed mannequin commented Apr 14, 2007

arigo mannequin commented May 14, 2007

bioinformed mannequin commented May 14, 2007

peaker mannequin commented Jun 10, 2007

bioinformed mannequin commented Jun 10, 2007

rhettinger commented Jun 10, 2007

peaker mannequin commented Jun 10, 2007

bioinformed mannequin commented Jun 10, 2007

arigo mannequin commented Jun 10, 2007

ntoronto mannequin commented Dec 7, 2007

ntoronto mannequin commented Dec 7, 2007

rhettinger commented Dec 21, 2007

birkenfeld commented Jan 12, 2008

birkenfeld commented Jan 13, 2008

bioinformed mannequin commented Jan 13, 2008

bioinformed mannequin commented Jan 13, 2008

amauryfa commented Jan 14, 2008

rhettinger commented Jan 14, 2008

amauryfa commented Jan 14, 2008

amauryfa commented Jan 14, 2008

collinwinter mannequin commented Jan 15, 2009