Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast globals/builtins access (patch) #45859

Closed
ntoronto mannequin opened this issue Nov 29, 2007 · 18 comments
Closed

Fast globals/builtins access (patch) #45859

ntoronto mannequin opened this issue Nov 29, 2007 · 18 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@ntoronto
Copy link
Mannequin

ntoronto mannequin commented Nov 29, 2007

BPO 1518
Nosy @gvanrossum, @gpshead, @pitrou, @larryhastings, @giampaolo, @tiran, @benjaminp
Files
  • fastglobals.patch.txt
  • fastglobals_test.py
  • fastglobals-1.patch.txt
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-09-18.15:20:45.504>
    created_at = <Date 2007-11-29.09:43:57.063>
    labels = ['interpreter-core', 'performance']
    title = 'Fast globals/builtins access (patch)'
    updated_at = <Date 2013-02-12.07:21:38.703>
    user = 'https://bugs.python.org/ntoronto'

    bugs.python.org fields:

    activity = <Date 2013-02-12.07:21:38.703>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-09-18.15:20:45.504>
    closer = 'BreamoreBoy'
    components = ['Interpreter Core']
    creation = <Date 2007-11-29.09:43:57.063>
    creator = 'ntoronto'
    dependencies = []
    files = ['8821', '8823', '8846']
    hgrepos = []
    issue_num = 1518
    keywords = ['patch']
    message_count = 18.0
    messages = ['57927', '57929', '57933', '57941', '57965', '58071', '59324', '59325', '62972', '63933', '70467', '70746', '101237', '101238', '101239', '116795', '181942', '181946']
    nosy_count = 16.0
    nosy_names = ['gvanrossum', 'nnorwitz', 'collinwinter', 'gregory.p.smith', 'lpd', 'ggenellina', 'pitrou', 'larry', 'giampaolo.rodola', 'christian.heimes', 'arkanes', 'jyasskin', 'kevinwatters', 'ntoronto', 'benjamin.peterson', 'BreamoreBoy']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue1518'
    versions = ['Python 2.6']

    @ntoronto
    Copy link
    Mannequin Author

    ntoronto mannequin commented Nov 29, 2007

    I've attached a patch that reduces LOAD_GLOBAL access time by 19% for
    globals and 45% for builtins. In my tests, that's barely slower than
    LOAD_FAST in both cases.

    The approach is to cache pointers to dictionary entries rather than
    caching dictionary values as is usually suggested. Dictionaries keep a
    64-bit "version" internally that is incremented whenever at least one
    entry pointer would become invalid. PyFastGlobalsObject is a
    dictionary adapter, allowing quick access to it using an index instead
    of a key object. It ensures that its entry pointers are always valid
    (by comparing a stored version number against the dictionary's) before
    retrieving values from them.

    A script of microbenchmarks, fastglobals_test.py, is included in the
    patch. My dual-core Intel T2300 1.66GHz on Ubuntu 7.04 (compiled
    with -DNDEBUG -g -O3), gets the following results:

    Test 2.6a0 trunk 2.6a0 fastglobals % time
    ---- ----------- ----------------- ----------
    Dict ins/del (100000) 41.27 41.59 1.01
    Dict get (100000) 21.37 21.35 1.00
    Dict set (100000) 21.36 21.33 1.00
    Local get (1000000) 15.64 15.60 1.00
    Local set (1000000) 16.83 16.94 1.01
    Global get (1000000) 21.09 17.04 0.81*
    Global set (1000000) 34.15 22.80 0.67*
    Builtin get (1000000) 30.99 17.04 0.55*
    Function call (100000) 32.87 33.00 1.00
    Listcomp (100000) 28.65 25.17 0.88*
    Pystone 1.1 (500000) 12.46 11.68 0.94*
    PYBENCH 2.0 9.10 9.05 0.99
    (* = probably significant)

    All regressions that aren't skipped pass except test_gc. I've probably
    got a refcount issue or unbreakable cycle somewhere.

    @ntoronto ntoronto mannequin added type-feature A feature request or enhancement interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Nov 29, 2007
    @tiran
    Copy link
    Member

    tiran commented Nov 29, 2007

    When I run the code of test_gc.py test_function() in a shell I'm getting
    the expected result of 2 collected items:

    gc: collectable <dict 0xb78aa13c>
    gc: collectable <function 0xb78a9374>

    However the same code embedded in the test suite collects two additional
    objects:

    gc: collectable <dict 0x830061c>
    gc: collectable <function 0x82a3d54>
    gc: collectable <fastglobals 0x82a4244>
    gc: collectable <tuple 0x82a168c>

    I've used gc.set_debug(gc.DEBUG_LEAK) to get the information

    @arkanes
    Copy link
    Mannequin

    arkanes mannequin commented Nov 29, 2007

    In funcobject.c:PyFunction_New, the declarations of op and __name__ need
    to be moved to the top of the function to compile in MSVC, and if
    accepted the fastglobals.c file needs to be added to PCBuild.

    In the test script, the use of import * generates syntax warnings that
    make the output awkward to read, and the benchmark for dict_get is
    actually running dict_set. I'm attaching the fixed copy of the test
    script I used.

    I see roughly the same speed ups (MSVC 7.1, Windows XP, Intel Core2 Duo
    @ 2.13Ghz), but I see a 10% slowdown in the dict-insert/delete and
    dict-set benchmarks which I find very troubling.

    Test Trunk fastglobals Time difference
    --------------------------------------------------
    Dict insert/del 2.002495495 2.207409125 1.102329134
    Dict get 0.750253205 0.745576662 0.993766714
    Dict set 0.982695921 1.114997766 1.13463152
    Local get 0.533387029 0.51337118 0.96247406
    Local set 0.596565774 0.614124914 1.029433703
    Global get 0.935605073 0.731136584 0.78145855
    Global set 1.48638532 1.03868462 0.69879903
    Builtin get 1.392606367 0.735180673 0.52791707
    Function call 1.938705781 1.716233004 0.885246756
    List comp 1.547780105 1.188215756 0.767690289

    PyBench shows an overall slowdown - String mappings in particular,
    string/unicode predicates, and new instance creation all show
    significant slowdowns. The results are fairly verbose so I've uploaded
    them as a google docs spreadsheet at
    http://spreadsheets.google.com/ccc?key=p7g0z40g_NpvH5UpPTpr-Ag&hl=en

    I notice that you create a new PyFastGlobals object in every call to
    PyEval_EvalCode. This might explain some of the general case slowdown,
    is this really what you want to do?

    @arkanes
    Copy link
    Mannequin

    arkanes mannequin commented Nov 29, 2007

    I may have had some old code in the build or something - I did a clean
    rebuild to look at some of the slowdowns and the fastglobals_test
    benchmarks are now even better than Neils. Pybench is still overall
    slower but it's mostly in float operations, which seems odd and I'd
    discount it unless someone else can recreate it. I've updated the
    spreadsheet I linked before with the updated timings, along with my
    microbenchmark and pystone results. Very impressive speedup on pystone,
    no doubt because of the heavy use of globals.

    @ntoronto
    Copy link
    Mannequin Author

    ntoronto mannequin commented Nov 29, 2007

    Christian: Thanks for that, I'll have to remember DEBUG_LEAK in the
    future. :)

    It looks like it may be acting correctly since there *is* now a
    fastglobals object that needs collecting, and also a new tuple built
    from co_names + "__builtins__". I don't know why it wouldn't collect
    those in the shell, though. The shell does create a new stack frame for
    every command (where a function's commands are all run in a single stack
    frame), but I don't know why that would matter. I'll look further into it.

    Chris: The build problems should be fixed in the next patch. Thanks for
    spending so much time on benchmarking this.

    Regarding PyEval_EvalCode creating a PyFastGlobalsObject: I'm not sure
    whether it's right thing to do, but I think it is. PyEval_EvalCode gets
    called for eval, exec, running an imported module's code, and everything
    in the interactive prompt - basically, running any code that's not in a
    function or generator. (I think that covers everything.) It seems like
    the patch has PyEval_EvalCode doing the right thing in general, but it
    may suffer a bit in benchmarks.

    @ntoronto
    Copy link
    Mannequin Author

    ntoronto mannequin commented Dec 1, 2007

    I've attached the latest patch, which should fix the build and compile
    problems on Windows. It also passes test_gc - I changed test_gc after
    verifying that it was working correctly. (There are now four objects to
    collect rather than two.) On my systems this passes all regression tests
    that aren't skipped.

    The patch also includes a new fastglobals_test.py, which has a few extra
    benchmarks.

    Benchmark results for both of my systems are here:

    http://spreadsheets.google.com/ccc?key=pHIJrYc_pnIVGXyUxWYFkLw&hl=en_US

    This covers pystones, pyfastglobals_test, and pybench. In the latter
    two, operations that have been most directly affected by the patch are
    in red.

    Pystones is significantly faster. The other two show that the operations
    that aren't supposed to be affected aren't significantly in one
    direction or the other (at least on these two systems). In pybench, an
    average run is just a bit faster. I tested -O2 as well to verify that
    changes in performance aren't due to accidental optimization differences.

    I added the three new parser minibenchmarks to see what would happen to
    normal, idiomatic Python code. I was surprised that it wasn't much
    better, what with all the xranges and lens and accessing a global
    variable all the time. Since this patch makes globals and builtins
    almost as fast as locals and doesn't slow anything else down, we might
    conclude that global and builtlin lookup has a lower impact on overall
    performance than most of us assumed. However, I'm still keen on the
    patch for two reasons (besides the fact that I wrote it :D):

    1. It gets rid of the "_len = len" anti-pattern. The obvious way to
      access globals and builtins is just as fast.

    2. It lays down a framework for speeding up class attribute access.

    What I mean by #2 is that _PyType_Lookup (looking up a method attribute)
    can be very slow when looking up inherited methods. I think I can make
    it take a near constant amount of time no matter how deep the
    inheritance hierarchy is. I think it would only take a per-class adapter
    (like PyFastGlobalsObject) that lazily caches class dict entry pointers
    to methods and clears itself if the MRO changes.

    Not today, though. :)

    @lpd
    Copy link
    Mannequin

    lpd mannequin commented Jan 5, 2008

    The proposed approach to speeding up lookup of inherited methods is not
    quite sound, given that class attributes can be added and removed
    dynamically. Consider:

    class A:
      def f(x): ...
    class B(A):
      pass
    class C(B):
      pass

    If C caches a pointer to A.f, the wrong thing will happen if B.f is
    defined dynamically, even though C's pointer will still point to a valid
    and up-to-date entry for f in A's dict, and C's MRO will not have changed.

    I thought a sufficient fix would be for classes to increment not only
    their own 64-bit dict "version" but that of all classes in their MRO if
    an entry is ever added or removed in their dict. But even this is not
    sufficient. Consider:

    class A:
      def f(x): ...
    class B(A):
      pass
    class C(B):
      pass
    class D:
      pass
    class E(D,C):
      pass

    If D.f is defined dynamically, E's cached pointer to C.f will retrieve
    the wrong value. But C is not in D's MRO.

    I haven't encountered this issue before in a system with multiple base
    classes (my extensive experience is with classic Smalltalk, and
    PostScript), so I don't have an off-the-cuff solution.

    @lpd
    Copy link
    Mannequin

    lpd mannequin commented Jan 5, 2008

    Sorry, I wrote "E's cached pointer to C.f", which of course should be
    "E's cached pointer to A.f".

    @nnorwitz
    Copy link
    Mannequin

    nnorwitz mannequin commented Feb 25, 2008

    I've gone over this at a high-level and looked for various outstanding
    issues. Below are my comments. I didn't delve into this in detail.
    There are a lot of questions too.

    I think this is very interesting and hope that we can get it working,
    100% robust, and verify the improved perf.

    In Include/fastglobals.h, num_entries says it is len(names). But
    there is also an entries field. This is confusing. The name should
    be num_names if it is the length of names.

    isbuiltin is an array of integers that hold boolean values? So most
    of the memory is wasted? How many entries do you expect in isbuiltin?
    This field should probably be packed using each bit. Its type should then
    be unsigned. If there were a small number of PyFastGlobalsObjects it
    wouldn't be a big deal. But from the comment it seems there will be
    one of these objects for each function.

    Did you measure the difference in memory size? For example the size
    at startup with and without the patch. Did you measure the startup
    time? How big is python at the end of a regression test run with and
    without this patch?

    Is there ever an access to the old globals in the old code? I wonder
    if it's worthwhile to change the names in ceval.c, etc.

    Does PyFastGlobals_GET_ITEM_INPLACE really need to be a macro? It's a
    pain to debug macros. If it were a static function in the header
    file, it would almost definitely get inlined.
    PyFastGlobals_ENTRY(op, index) should be captured in a local variable.

    I'm thinking that entryversion should be unsigned. Signed integer
    overflow is not defined (we had some problems recently with this).
    PyFastGlobals_VERSION_SET(op) can overflow.

    I like the removal of one pointer from the frame object.

    How does this interact with multiple interpreters? I see the static
    of __builtins__ in fastglobal.c. See PyInterpreterState in
    Include/pystate.h. (I see that __builtins__ is a string, so this
    access should be fine, but the question is still valid. I just want
    to make sure there aren't any bad interactions.)

    Regarding:
    /* XXX A clever adversary could prevent this from terminating */

    How about the loop is limited to 10 runs or something reasonable?
    That way there cannot be a DoS attack. You can raise a RuntimeError
    if the max iterations is reached.

    Since PyFastGlobals_Update() is public, you should not assert that the
    op is a fast globals, but rather check and set
    PyErr_BadInternalCall(). That's true of all public APIs. Internal
    APIs are fine to have asserts.

    In fg_check_builtins() you may be able to speed this up by checking
    the most common paths first. I know this code is taken from
    frameobject. See
    http://coverage.livinglogic.de/Objects/frameobject.c.html . Hmmm,
    that isn't showing any data currently. I'll ping Walter and ask him
    if he can get that working again.

    I don't understand this comment in PyFastGlobals_CheckBuiltins:
    /* Make sure fg->builtins_entry is updated first */

    It says what the code does, but I don't understand *why* it does it.

    Instead of:
    		PyErr_NoMemory();
    		Py_DECREF(newnames);
    		return NULL;
    
    You can do:
    		Py_DECREF(newnames);
    		return PyErr_NoMemory();

    In PyFastGlobals_New(), you should verify that num_entries doesn't
    overflow before creating a new tuple by increasing the size by 1.

    In this code:
    	isbuiltin = PyMem_NEW(int, num_entries);
    	if (isbuiltin == NULL) {
    		PyErr_NoMemory();
    		Py_DECREF(newnames);
    		PyMem_FREE(isbuiltin);
    		return NULL;
    	}

    The free of isbuiltin should be entries.

    If these arrays will be small (< 256 bytes), it would be better (faster)
    to use
    pymalloc rather than the PyMem interface.

    PyDict_GetEntry should be prefixed with an underscore. I don't think
    this should become part of the public API. How about passing an int
    pointer to GetEntry that will be set if the key is found or clear if not
    found?

    This needs to be addressed:
    + /* This is a bad idea now: if gc breaks a cycle by clearing
    + f_fastglobals, when a generator is finally collected it does this
    + sequence of calls: gen_del->gen_close->gen_send_ex->
    + PyEval_EvalFrameEx. But PyEval_EvalFrameEx references
    + f_fastglobals->globals, which will be cleared by then, resuling
    + in a segfault.
    + ??? Is there a way to preserve this traversal? */
    + /* Py_VISIT(f->f_fastglobals); */

    @gvanrossum
    Copy link
    Member

    Making sure I look at this at least once carefully before releasing.

    @gvanrossum gvanrossum self-assigned this Mar 18, 2008
    @benjaminp
    Copy link
    Contributor

    Ping.

    @gvanrossum
    Copy link
    Member

    Won't have time myself.

    @gvanrossum gvanrossum removed their assignment Aug 5, 2008
    @collinwinter collinwinter mannequin added performance Performance or resource usage and removed type-feature A feature request or enhancement labels Jan 23, 2009
    @pitrou
    Copy link
    Member

    pitrou commented Mar 17, 2010

    There probably isn't any point in such a complication, until perhaps we have a JIT that magnifies the improvement.
    (but I assume the JIT will be able to treat references to globals and builtins as "frozen" using its own logic)

    @collinwinter
    Copy link
    Mannequin

    collinwinter mannequin commented Mar 17, 2010

    Antoine: yes, this optimization is already implemented in the Unladen JIT.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 17, 2010

    Thanks Collin, recommend closing then.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Sep 18, 2010

    Closing as recommended in msg101239.

    @BreamoreBoy BreamoreBoy mannequin closed this as completed Sep 18, 2010
    @larryhastings
    Copy link
    Contributor

    It sort of looks like this was closed because we assumed we were moving to Unladen Swallow. We're not. Should this be reopened?

    @pitrou
    Copy link
    Member

    pitrou commented Feb 12, 2013

    See bpo-10401.
    Quoting myself: “Here is the Nth patch for a globals/builtins cache. As other caches at the same kind, it shows very small to no gain on non-micro benchmarks, showing that contrary to popular belief, globals/builtins lookup are not a major roadblock in today's Python performance.”

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants