Issue 1518: Fast globals/builtins access (patch)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45859

classification

Title:	Fast globals/builtins access (patch)
Type:	performance	Stage:
Components:	Interpreter Core	Versions:	Python 2.6

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	BreamoreBoy, arkanes, benjamin.peterson, christian.heimes, collinwinter, ggenellina, giampaolo.rodola, gregory.p.smith, gvanrossum, jyasskin, kevinwatters, larry, lpd, nnorwitz, ntoronto, pitrou
Priority:	normal	Keywords:	patch

Created on 2007-11-29 09:43 by ntoronto, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fastglobals.patch.txt	ntoronto, 2007-11-29 09:43
fastglobals_test.py	arkanes, 2007-11-29 16:07
fastglobals-1.patch.txt	ntoronto, 2007-12-01 23:00

Messages (18)
msg57927 - (view)	Author: Neil Toronto (ntoronto)	Date: 2007-11-29 09:43
I've attached a patch that reduces LOAD_GLOBAL access time by 19% for globals and 45% for builtins. In my tests, that's barely slower than LOAD_FAST in both cases. The approach is to cache pointers to dictionary entries rather than caching dictionary values as is usually suggested. Dictionaries keep a 64-bit "version" internally that is incremented whenever at least one entry pointer would become invalid. PyFastGlobalsObject is a dictionary adapter, allowing quick access to it using an index instead of a key object. It ensures that its entry pointers are always valid (by comparing a stored version number against the dictionary's) before retrieving values from them. A script of microbenchmarks, fastglobals_test.py, is included in the patch. My dual-core Intel T2300 1.66GHz on Ubuntu 7.04 (compiled with -DNDEBUG -g -O3), gets the following results: Test 2.6a0 trunk 2.6a0 fastglobals % time ---- ----------- ----------------- ---------- Dict ins/del (100000) 41.27 41.59 1.01 Dict get (100000) 21.37 21.35 1.00 Dict set (100000) 21.36 21.33 1.00 Local get (1000000) 15.64 15.60 1.00 Local set (1000000) 16.83 16.94 1.01 Global get (1000000) 21.09 17.04 0.81* Global set (1000000) 34.15 22.80 0.67* Builtin get (1000000) 30.99 17.04 0.55* Function call (100000) 32.87 33.00 1.00 Listcomp (100000) 28.65 25.17 0.88* Pystone 1.1 (500000) 12.46 11.68 0.94* PYBENCH 2.0 9.10 9.05 0.99 (* = probably significant) All regressions that aren't skipped pass except test_gc. I've probably got a refcount issue or unbreakable cycle somewhere.
msg57929 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2007-11-29 12:46
When I run the code of test_gc.py test_function() in a shell I'm getting the expected result of 2 collected items: gc: collectable <dict 0xb78aa13c> gc: collectable <function 0xb78a9374> However the same code embedded in the test suite collects two additional objects: gc: collectable <dict 0x830061c> gc: collectable <function 0x82a3d54> gc: collectable <fastglobals 0x82a4244> gc: collectable <tuple 0x82a168c> I've used gc.set_debug(gc.DEBUG_LEAK) to get the information
msg57933 - (view)	Author: Chris Mellon (arkanes)	Date: 2007-11-29 16:07
In funcobject.c:PyFunction_New, the declarations of op and __name__ need to be moved to the top of the function to compile in MSVC, and if accepted the fastglobals.c file needs to be added to PCBuild. In the test script, the use of import * generates syntax warnings that make the output awkward to read, and the benchmark for dict_get is actually running dict_set. I'm attaching the fixed copy of the test script I used. I see roughly the same speed ups (MSVC 7.1, Windows XP, Intel Core2 Duo @ 2.13Ghz), but I see a 10% slowdown in the dict-insert/delete and dict-set benchmarks which I find very troubling. Test Trunk fastglobals Time difference -------------------------------------------------- Dict insert/del 2.002495495 2.207409125 1.102329134 Dict get 0.750253205 0.745576662 0.993766714 Dict set 0.982695921 1.114997766 1.13463152 Local get 0.533387029 0.51337118 0.96247406 Local set 0.596565774 0.614124914 1.029433703 Global get 0.935605073 0.731136584 0.78145855 Global set 1.48638532 1.03868462 0.69879903 Builtin get 1.392606367 0.735180673 0.52791707 Function call 1.938705781 1.716233004 0.885246756 List comp 1.547780105 1.188215756 0.767690289 PyBench shows an overall slowdown - String mappings in particular, string/unicode predicates, and new instance creation all show significant slowdowns. The results are fairly verbose so I've uploaded them as a google docs spreadsheet at http://spreadsheets.google.com/ccc?key=p7g0z40g_NpvH5UpPTpr-Ag&hl=en I notice that you create a new PyFastGlobals object in every call to PyEval_EvalCode. This might explain some of the general case slowdown, is this really what you want to do?
msg57941 - (view)	Author: Chris Mellon (arkanes)	Date: 2007-11-29 18:05
I may have had some old code in the build or something - I did a clean rebuild to look at some of the slowdowns and the fastglobals_test benchmarks are now even better than Neils. Pybench is still overall slower but it's mostly in float operations, which seems odd and I'd discount it unless someone else can recreate it. I've updated the spreadsheet I linked before with the updated timings, along with my microbenchmark and pystone results. Very impressive speedup on pystone, no doubt because of the heavy use of globals.
msg57965 - (view)	Author: Neil Toronto (ntoronto)	Date: 2007-11-29 23:27
Christian: Thanks for that, I'll have to remember DEBUG_LEAK in the future. :) It looks like it may be acting correctly since there is now a fastglobals object that needs collecting, and also a new tuple built from co_names + "__builtins__". I don't know why it wouldn't collect those in the shell, though. The shell does create a new stack frame for every command (where a function's commands are all run in a single stack frame), but I don't know why that would matter. I'll look further into it. Chris: The build problems should be fixed in the next patch. Thanks for spending so much time on benchmarking this. Regarding PyEval_EvalCode creating a PyFastGlobalsObject: I'm not sure whether it's right thing to do, but I think it is. PyEval_EvalCode gets called for eval, exec, running an imported module's code, and everything in the interactive prompt - basically, running any code that's not in a function or generator. (I think that covers everything.) It seems like the patch has PyEval_EvalCode doing the right thing in general, but it may suffer a bit in benchmarks.
msg58071 - (view)	Author: Neil Toronto (ntoronto)	Date: 2007-12-01 23:00
I've attached the latest patch, which should fix the build and compile problems on Windows. It also passes test_gc - I changed test_gc after verifying that it was working correctly. (There are now four objects to collect rather than two.) On my systems this passes all regression tests that aren't skipped. The patch also includes a new fastglobals_test.py, which has a few extra benchmarks. Benchmark results for both of my systems are here: http://spreadsheets.google.com/ccc?key=pHIJrYc_pnIVGXyUxWYFkLw&hl=en_US This covers pystones, pyfastglobals_test, and pybench. In the latter two, operations that have been most directly affected by the patch are in red. Pystones is significantly faster. The other two show that the operations that aren't supposed to be affected aren't significantly in one direction or the other (at least on these two systems). In pybench, an average run is just a bit faster. I tested -O2 as well to verify that changes in performance aren't due to accidental optimization differences. I added the three new parser minibenchmarks to see what would happen to normal, idiomatic Python code. I was surprised that it wasn't much better, what with all the xranges and lens and accessing a global variable all the time. Since this patch makes globals and builtins almost as fast as locals and doesn't slow anything else down, we might conclude that global and builtlin lookup has a lower impact on overall performance than most of us assumed. However, I'm still keen on the patch for two reasons (besides the fact that I wrote it :D): 1. It gets rid of the "_len = len" anti-pattern. The obvious way to access globals and builtins is just as fast. 2. It lays down a framework for speeding up class attribute access. What I mean by #2 is that _PyType_Lookup (looking up a method attribute) can be very slow when looking up inherited methods. I think I can make it take a near constant amount of time no matter how deep the inheritance hierarchy is. I think it would only take a per-class adapter (like PyFastGlobalsObject) that lazily caches class dict entry pointers to methods and clears itself if the MRO changes. Not today, though. :)
msg59324 - (view)	Author: L. Peter Deutsch (lpd)	Date: 2008-01-05 19:59
The proposed approach to speeding up lookup of inherited methods is not quite sound, given that class attributes can be added and removed dynamically. Consider: class A: def f(x): ... class B(A): pass class C(B): pass If C caches a pointer to A.f, the wrong thing will happen if B.f is defined dynamically, even though C's pointer will still point to a valid and up-to-date entry for f in A's dict, and C's MRO will not have changed. I thought a sufficient fix would be for classes to increment not only their own 64-bit dict "version" but that of all classes in their MRO if an entry is ever added or removed in their dict. But even this is not sufficient. Consider: class A: def f(x): ... class B(A): pass class C(B): pass class D: pass class E(D,C): pass If D.f is defined dynamically, E's cached pointer to C.f will retrieve the wrong value. But C is not in D's MRO. I haven't encountered this issue before in a system with multiple base classes (my extensive experience is with classic Smalltalk, and PostScript), so I don't have an off-the-cuff solution.
msg59325 - (view)	Author: L. Peter Deutsch (lpd)	Date: 2008-01-05 20:00
Sorry, I wrote "E's cached pointer to C.f", which of course should be "E's cached pointer to A.f".
msg62972 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2008-02-25 05:39
I've gone over this at a high-level and looked for various outstanding issues. Below are my comments. I didn't delve into this in detail. There are a lot of questions too. I think this is very interesting and hope that we can get it working, 100% robust, and verify the improved perf. In Include/fastglobals.h, num_entries says it is len(names). But there is also an entries field. This is confusing. The name should be num_names if it is the length of names. isbuiltin is an array of integers that hold boolean values? So most of the memory is wasted? How many entries do you expect in isbuiltin? This field should probably be packed using each bit. Its type should then be unsigned. If there were a small number of PyFastGlobalsObjects it wouldn't be a big deal. But from the comment it seems there will be one of these objects for each function. Did you measure the difference in memory size? For example the size at startup with and without the patch. Did you measure the startup time? How big is python at the end of a regression test run with and without this patch? Is there ever an access to the old globals in the old code? I wonder if it's worthwhile to change the names in ceval.c, etc. Does PyFastGlobals_GET_ITEM_INPLACE really need to be a macro? It's a pain to debug macros. If it were a static function in the header file, it would almost definitely get inlined. PyFastGlobals_ENTRY(op, index) should be captured in a local variable. I'm thinking that entryversion should be unsigned. Signed integer overflow is not defined (we had some problems recently with this). PyFastGlobals_VERSION_SET(op) can overflow. I like the removal of one pointer from the frame object. How does this interact with multiple interpreters? I see the static of __builtins__ in fastglobal.c. See PyInterpreterState in Include/pystate.h. (I see that __builtins__ is a string, so this access should be fine, but the question is still valid. I just want to make sure there aren't any bad interactions.) Regarding: /* XXX A clever adversary could prevent this from terminating / How about the loop is limited to 10 runs or something reasonable? That way there cannot be a DoS attack. You can raise a RuntimeError if the max iterations is reached. Since PyFastGlobals_Update() is public, you should not assert that the op is a fast globals, but rather check and set PyErr_BadInternalCall(). That's true of all public APIs. Internal APIs are fine to have asserts. In fg_check_builtins() you may be able to speed this up by checking the most common paths first. I know this code is taken from frameobject. See http://coverage.livinglogic.de/Objects/frameobject.c.html . Hmmm, that isn't showing any data currently. I'll ping Walter and ask him if he can get that working again. I don't understand this comment in PyFastGlobals_CheckBuiltins: / Make sure fg->builtins_entry is updated first / It says what the code does, but I don't understand why* it does it. Instead of: PyErr_NoMemory(); Py_DECREF(newnames); return NULL; You can do: Py_DECREF(newnames); return PyErr_NoMemory(); In PyFastGlobals_New(), you should verify that num_entries doesn't overflow before creating a new tuple by increasing the size by 1. In this code: isbuiltin = PyMem_NEW(int, num_entries); if (isbuiltin == NULL) { PyErr_NoMemory(); Py_DECREF(newnames); PyMem_FREE(isbuiltin); return NULL; } The free of isbuiltin should be entries. If these arrays will be small (< 256 bytes), it would be better (faster) to use pymalloc rather than the PyMem interface. PyDict_GetEntry should be prefixed with an underscore. I don't think this should become part of the public API. How about passing an int pointer to GetEntry that will be set if the key is found or clear if not found? This needs to be addressed: + /* This is a bad idea now: if gc breaks a cycle by clearing + f_fastglobals, when a generator is finally collected it does this + sequence of calls: gen_del->gen_close->gen_send_ex-> + PyEval_EvalFrameEx. But PyEval_EvalFrameEx references + f_fastglobals->globals, which will be cleared by then, resuling + in a segfault. + ??? Is there a way to preserve this traversal? / + / Py_VISIT(f->f_fastglobals); */
msg63933 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-03-18 16:06
Making sure I look at this at least once carefully before releasing.
msg70467 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2008-07-31 02:03
Ping.
msg70746 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-05 16:01
Won't have time myself.
msg101237 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-03-17 17:36
There probably isn't any point in such a complication, until perhaps we have a JIT that magnifies the improvement. (but I assume the JIT will be able to treat references to globals and builtins as "frozen" using its own logic)
msg101238 - (view)	Author: Collin Winter (collinwinter) *	Date: 2010-03-17 17:41
Antoine: yes, this optimization is already implemented in the Unladen JIT.
msg101239 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-03-17 17:43
Thanks Collin, recommend closing then.
msg116795 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-09-18 15:20
Closing as recommended in msg101239.
msg181942 - (view)	Author: Larry Hastings (larry) *	Date: 2013-02-12 06:02
It sort of looks like this was closed because we assumed we were moving to Unladen Swallow. We're not. Should this be reopened?
msg181946 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-02-12 07:21
See issue 10401. Quoting myself: “Here is the Nth patch for a globals/builtins cache. As other caches at the same kind, it shows very small to no gain on non-micro benchmarks, showing that contrary to popular belief, globals/builtins lookup are not a major roadblock in today's Python performance.”

History
Date	User	Action	Args
2022-04-11 14:56:28	admin	set	github: 45859
2013-02-12 07:21:38	pitrou	set	messages: + msg181946
2013-02-12 06:02:08	larry	set	nosy: + larry messages: + msg181942
2010-09-18 15:20:45	BreamoreBoy	set	status: open -> closed nosy: + BreamoreBoy messages: + msg116795
2010-03-17 17:43:05	pitrou	set	messages: + msg101239
2010-03-17 17:41:55	collinwinter	set	messages: + msg101238
2010-03-17 17:36:34	pitrou	set	messages: + msg101237
2010-02-15 17:18:25	kevinwatters	set	nosy: + kevinwatters
2009-05-24 16:32:34	pitrou	set	nosy: + pitrou
2009-02-21 06:52:55	gregory.p.smith	set	nosy: + gregory.p.smith
2009-01-23 19:48:26	collinwinter	set	nosy: + collinwinter, jyasskin type: enhancement -> performance
2008-08-05 16:01:47	gvanrossum	set	priority: critical -> normal assignee: gvanrossum -> messages: + msg70746
2008-07-31 02:03:22	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg70467
2008-03-26 15:36:54	giampaolo.rodola	set	nosy: + giampaolo.rodola
2008-03-18 16:06:36	gvanrossum	set	priority: critical assignee: gvanrossum messages: + msg63933
2008-02-25 05:39:42	nnorwitz	set	nosy: + nnorwitz messages: + msg62972
2008-02-18 11:59:08	ggenellina	set	nosy: + ggenellina
2008-01-05 20:00:03	lpd	set	messages: + msg59325
2008-01-05 19:59:03	lpd	set	nosy: + lpd messages: + msg59324
2007-12-01 23:01:03	ntoronto	set	files: + fastglobals-1.patch.txt messages: + msg58071
2007-11-29 23:27:26	ntoronto	set	messages: + msg57965
2007-11-29 22:29:01	gvanrossum	set	keywords: + patch nosy: + gvanrossum
2007-11-29 18:05:55	arkanes	set	messages: + msg57941
2007-11-29 16:07:37	arkanes	set	files: + fastglobals_test.py nosy: + arkanes messages: + msg57933
2007-11-29 12:46:37	christian.heimes	set	nosy: + christian.heimes messages: + msg57929
2007-11-29 09:43:57	ntoronto	create