Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Armin's method cache optimization updated for Python 2.6 #44842

Closed
bioinformed mannequin opened this issue Apr 13, 2007 · 23 comments
Closed

Armin's method cache optimization updated for Python 2.6 #44842

bioinformed mannequin opened this issue Apr 13, 2007 · 23 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@bioinformed
Copy link
Mannequin

bioinformed mannequin commented Apr 13, 2007

BPO 1700288
Nosy @arigo, @birkenfeld, @rhettinger, @jcea, @amauryfa
Files
  • python26-classattrcache.diff
  • fastattr_test.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-01-15.18:54:51.780>
    created_at = <Date 2007-04-13.19:16:08.000>
    labels = ['interpreter-core']
    title = "Armin's method cache optimization updated for Python 2.6"
    updated_at = <Date 2009-01-15.18:54:51.745>
    user = 'https://bugs.python.org/bioinformed'

    bugs.python.org fields:

    activity = <Date 2009-01-15.18:54:51.745>
    actor = 'collinwinter'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-01-15.18:54:51.780>
    closer = 'collinwinter'
    components = ['Interpreter Core']
    creation = <Date 2007-04-13.19:16:08.000>
    creator = 'bioinformed'
    dependencies = []
    files = ['7940', '8887']
    hgrepos = []
    issue_num = 1700288
    keywords = ['patch']
    message_count = 23.0
    messages = ['52431', '52432', '52433', '52434', '52435', '52436', '52437', '52438', '52439', '52440', '52441', '58262', '58263', '58932', '59818', '59854', '59855', '59857', '59874', '59875', '59876', '59881', '79912']
    nosy_count = 11.0
    nosy_names = ['nnorwitz', 'arigo', 'georg.brandl', 'collinwinter', 'rhettinger', 'jcea', 'amaury.forgeotdarc', 'peaker', 'bioinformed', 'jyasskin', 'ntoronto']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1700288'
    versions = ['Python 2.6']

    @bioinformed
    Copy link
    Mannequin Author

    bioinformed mannequin commented Apr 13, 2007

    I've forward ported and slightly cleaned up Armin's method cache patch (see bpo-1685986). I've attempted to clarify and tighten up several loose ends in the code, so hopefully I haven't mucked anything up.

    My performance results are not quite as good as Armin's, though still very encouraging. I see a typical speed up of 10%, with heavily object oriented application code seeing speedups of 15-20%. Given these rather significant results, the major task is to verify correctness.

    @bioinformed bioinformed mannequin assigned rhettinger Apr 13, 2007
    @bioinformed bioinformed mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Apr 13, 2007
    @bioinformed bioinformed mannequin assigned rhettinger Apr 13, 2007
    @bioinformed bioinformed mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Apr 13, 2007
    @nnorwitz
    Copy link
    Mannequin

    nnorwitz mannequin commented Apr 14, 2007

    I tried this test on parrotbench (b0.py in particular) and I'm not sure I could distinguish an improvement over the noise (best case was about 5%). The variability is pretty high on my box (dual amd64 opertons, ubuntu gcc 4.1.2-ish). At first it seemed a bit slower with the original patch which uses unsigned ints. I tried changing to unsigned longs. It seemed a little better, though not sure if it was really due to the change. I think unsigned longs should be used for 64-bit boxes.

    Did you use a benchmark/program that is open source? I don't know that I have anything decent to test this with. Raymond probably does though. Also what platform did you test on?

    @bioinformed
    Copy link
    Mannequin Author

    bioinformed mannequin commented Apr 14, 2007

    I benchmarked using pystone, pybench, and a bunch of my local scientific applications that have tight computational kernels still in pure Python. I tested on a 64bit Linux box, defining the version_tag as either int or unsigned long with no appreciable difference in performance. I'm trying to get parrotbench to run, but it doesn't seem to have been updated for modern Python versions.

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented May 14, 2007

    A further minor improvement could be achieved by
    re-doing the inlining of _PyType_Lookup() in
    object.c. Ideally, the fast path only could be
    inlined, and the rest put in a separate function
    or just using _PyType_Lookup().

    @bioinformed
    Copy link
    Mannequin Author

    bioinformed mannequin commented May 14, 2007

    I tried re-inlining the fast path from _PyType_Lookup() in object.c and found no measurable improvement on the simple benchmarks I tried. I've also stress-tested the patch by disabling the fast-path return, always performing the slow-path lookup, and asserting that the cached result matches the slow-path result. I then ran that modified interpreter on the Python test-suite, various benchmarks, and a range of my own applications. While not a formal proof of correctness, it was encouraging that the cache remained consistent.

    @peaker
    Copy link
    Mannequin

    peaker mannequin commented Jun 10, 2007

    Why is CPython copying functions to inline them?

    This is not only code duplication that makes the code easy to break, it also bloats the caller function beyond readability.

    I believe that using a private .c #include file with an inline keyword, or at worst, #include the function code directly is a better solution.

    Is there any reason to use the method currently used by CPython?

    @bioinformed
    Copy link
    Mannequin Author

    bioinformed mannequin commented Jun 10, 2007

    First off, all versions of this patch do away with the rather aggressively repeated inline code. My previous comment about refactoring and testing an inlined form were purely an experiment with null results.

    That aside, you do raise a good question. However, given the current patch, it is unfortunately off-topic and irrelevant to the consideration of this patch. Please feel free to pursue it elsewhere, since I worry that it will only serve as a negative distraction from the much more interesting aims of the mro optimization. Since you are clearly worried about the performance of attribute lookup, please try it out and report your findings. I'll be happy review the results from your benchmarks and any suggestions you have.

    @rhettinger
    Copy link
    Contributor

    FWIW, I'm still reviewing this code.

    @peaker
    Copy link
    Mannequin

    peaker mannequin commented Jun 10, 2007

    I may be missing something trivial here, but why the cache global - and not per-typeobject?

    As I saw it - the mro cache was supposed to be a mapping of (type,name)->(descriptor/classattr) that's based on the __mro__ of type, and all of the dicts in type and its bases.

    I think the right way to do it is to use another Pythonic dict object in each type, that holds this cache (name->(descriptor/classattr)).
    This cache dict should essentially be equivalent to a merge of all the dicts in the mro. and gets updated whenever any of the dicts of the types in the mro gets updated.
    Not sure why to re-implement a fast-hashtable when one already exists.
    Also - once it is associated with the type - it does not hold any references to any names that are not already referenced by someone in the type hierarchy, so there are no mm/gc issues either.

    Am I missing something here?

    @bioinformed
    Copy link
    Mannequin Author

    bioinformed mannequin commented Jun 10, 2007

    The cache is global to make it fast, have a deterministic and low memory footprint, and be easy to invalidate. It is distinct from the standard Python dict implementation since it need not be as general nor fully associative, as cache collisions are not subject to re-probing. Neither does it even go through the motions of memory management, which would be rather hard to get the standard dict implementation to emulate. It is kept drastically minimal in the interests of performance and goes to great pains to hide and minimize the impact to the rest of the interpreter. In this regard it seems quite successful, as it even removes the need for the manually inlined code you commented on before. It also survives harsh stress-testing and presents no detectable performance regressions in any of my benchmarks (to the contrary, many operations are significantly faster).

    While a per-type-object cache sounds feasible, it is hard for me to see how it could be as efficient in the general case -- for one, the dict look up fast path is just so much heavier. Please feel free to take that as a challenge to produce a patch that contradicts my intuition.

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented Jun 10, 2007

    peaker: we tried several approaches (including ones similar to what you describe) in PyPy, and the one in this patch is the one that worked best there. There is of course no guarantee that the same experiments would have given exactly the same results in CPython, so feel free to try them all again - I won't :-)

    @ntoronto
    Copy link
    Mannequin

    ntoronto mannequin commented Dec 7, 2007

    I've attached the microbenchmarks I was using for my own version of
    attribute caching. For list, tuple, dict, and a deep hierarchy, it tests
    accessing type attributes (cls.__class__), class attributes
    (cls.__init__), class attributes via instance (inst.__class__), and
    instance attributes (inst.__init__), using LOAD_ATTR and hasattr. It
    also tests hasattr with missing attributes.

    @ntoronto
    Copy link
    Mannequin

    ntoronto mannequin commented Dec 7, 2007

    Attribute access that happens only once or very infrequently probably
    shouldn't go through the cache machinery - it'll only slow things down.
    (Updating slots for example.) Maybe _PyType_Lookup should remain the
    same, with _PyType_CachingLookup added for attribute access via
    type/instance getattr.

    @rhettinger
    Copy link
    Contributor

    Okay, this patch is accepted. We'll likely do additional tweaks after
    it goes in.

    Please close the other tracker items for this same issue.

    If no one applies this, I'll do it when I return from the holidays.

    @birkenfeld
    Copy link
    Member

    Committed to trunk as r59931. Leaving open if you want to make more
    adjustments, Raymond.

    @birkenfeld
    Copy link
    Member

    Backed out again in r59940 -- test_ctypes fails in test_incomplete.py.

    Armin reports that with his original patch on 2.4, this test passes.

    @bioinformed
    Copy link
    Mannequin Author

    bioinformed mannequin commented Jan 13, 2008

    All tests passed when I first ported Armin's patch to 2.6. I'll have a
    chance to look into this later in the week. If anyone gets to it first,
    please summarize your findings here to avoid duplication of effort.

    @bioinformed
    Copy link
    Mannequin Author

    bioinformed mannequin commented Jan 13, 2008

    Couldn't resist looking into this more and I must admit that I am a bit
    stumped. If instead of returning a value from the cache fast path in
    _PyType_Lookup, I modified the code to store the value and assert that
    it matches with the non-cached result from traversing the mro. That
    assertion never fails. However, test_incomplete works with this change
    that ideally should not alter the semantics of _PyType_Lookup.

    More so, if I selectively don't cache the name "next" (with the caching
    fast path reenabled), test_incomplete works. For test_incomplete,
    _PyType_Lookup on "next" always return NULL from both the cached and
    non-cached paths.

    I'm beginning to suspect there is some black magic going on. None of
    the relevant reference counts drop to zero, the same return value (NULL)
    is generated, but the slow patch seems to be doing something with a side
    effect that makes it work.

    Wish I had more time to debug this, but I'm already overdrawn until
    Wednesday...

    @amauryfa
    Copy link
    Member

    After some debug work, I found an explanation:

    • The attribute "name" is the name of a type slot. So it becomes cached
      during the type construction.
    • in function ctypes/ctypes.c::StructUnionType_new(), the type's
      __dict
      _ is replaced by a "stgdict".
    • this "stgdict" is then modified with PyDict_ functions, without any
      call to _PyType_Lookup()
    • the method cache becomes out of sync ==> boom.

    I have come with a patch which corrects the problem, and all ctypes
    tests pass:

    Index: Modules/_ctypes/stgdict.c
    ===================================================================

    --- Modules/_ctypes/stgdict.c   (revision 59939)
    +++ Modules/_ctypes/stgdict.c   (working copy)
    @@ -470,7 +470,7 @@
                            Py_DECREF(pair);
                            return -1;
                    }
    -               if (-1 == PyDict_SetItem(realdict, name, prop)) {
    +               if (-1 == PyObject_SetAttr(type, name, prop)) {
                            Py_DECREF(prop);
                            Py_DECREF(pair);
                            return -1;
    Index: Modules/_ctypes/_ctypes.c
    ===================================================================
    --- Modules/_ctypes/_ctypes.c   (revision 59939)
    +++ Modules/_ctypes/_ctypes.c   (working copy)
    @@ -410,7 +410,7 @@
     StructType_setattro(PyObject *self, PyObject *key, PyObject *value)
     {
            /* XXX Should we disallow deleting _fields_? */
    -       if (-1 == PyObject_GenericSetAttr(self, key, value))
    +       if (-1 == Py_TYPE(self)->tp_base->tp_setattro(self, key, value))
                    return -1;
        if (value && PyString_Check(key) &&
    

    I think that these changes are sensible: The first one deal with the
    type's attribute instead of updating its __dict__, and the second
    properly delegates "tp_setattro" to the base class ('type' in this case).

    @rhettinger
    Copy link
    Contributor

    Nice analysis. Please apply.

    @amauryfa
    Copy link
    Member

    OK, I apply first my 2 lines, then the original patch.

    @amauryfa
    Copy link
    Member

    For the record:
    It is very wrong to call self->ob_type->tp_base's slot: slots are often
    copied from their base class, and you get infinite recursion.
    Calling StructType.tp_base->tp_setattro is much better.

    @collinwinter
    Copy link
    Mannequin

    collinwinter mannequin commented Jan 15, 2009

    Looks like this was re-applied in r59943 and r59944. Thanks for taking
    care of this, Amaury.

    @collinwinter collinwinter mannequin closed this as completed Jan 15, 2009
    @collinwinter collinwinter mannequin closed this as completed Jan 15, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants