classification
Title: Fast globals/builtins access (patch)
Type: performance Stage:
Components: Interpreter Core Versions: Python 2.6
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, arkanes, benjamin.peterson, christian.heimes, collinwinter, ggenellina, giampaolo.rodola, gregory.p.smith, gvanrossum, jyasskin, kevinwatters, larry, lpd, nnorwitz, ntoronto, pitrou
Priority: normal Keywords: patch

Created on 2007-11-29 09:43 by ntoronto, last changed 2013-02-12 07:21 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
fastglobals.patch.txt ntoronto, 2007-11-29 09:43
fastglobals_test.py arkanes, 2007-11-29 16:07
fastglobals-1.patch.txt ntoronto, 2007-12-01 23:00
Messages (18)
msg57927 - (view) Author: Neil Toronto (ntoronto) Date: 2007-11-29 09:43
I've attached a patch that reduces LOAD_GLOBAL access time by 19% for 
globals and 45% for builtins. In my tests, that's barely slower than 
LOAD_FAST in both cases.

The approach is to cache pointers to dictionary entries rather than 
caching dictionary values as is usually suggested. Dictionaries keep a 
64-bit "version" internally that is incremented whenever at least one 
entry pointer would become invalid. PyFastGlobalsObject is a 
dictionary adapter, allowing quick access to it using an index instead 
of a key object. It ensures that its entry pointers are always valid 
(by comparing a stored version number against the dictionary's) before 
retrieving values from them.

A script of microbenchmarks, fastglobals_test.py, is included in the 
patch. My dual-core Intel T2300 1.66GHz on Ubuntu 7.04 (compiled 
with -DNDEBUG -g -O3), gets the following results:

Test                     2.6a0 trunk  2.6a0 fastglobals    % time
----                     -----------  -----------------  ----------
Dict ins/del    (100000)    41.27              41.59        1.01
Dict get        (100000)    21.37              21.35        1.00
Dict set        (100000)    21.36              21.33        1.00
Local get      (1000000)    15.64              15.60        1.00
Local set      (1000000)    16.83              16.94        1.01
Global get     (1000000)    21.09              17.04        0.81*
Global set     (1000000)    34.15              22.80        0.67*
Builtin get    (1000000)    30.99              17.04        0.55*
Function call   (100000)    32.87              33.00        1.00
Listcomp        (100000)    28.65              25.17        0.88*
Pystone 1.1     (500000)    12.46              11.68        0.94*
PYBENCH 2.0                  9.10               9.05        0.99
(* = probably significant)

All regressions that aren't skipped pass except test_gc. I've probably 
got a refcount issue or unbreakable cycle somewhere.
msg57929 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-11-29 12:46
When I run the code of test_gc.py test_function() in a shell I'm getting
the expected result of 2 collected items:

gc: collectable <dict 0xb78aa13c>
gc: collectable <function 0xb78a9374>

However the same code embedded in the test suite collects two additional
objects:

gc: collectable <dict 0x830061c>
gc: collectable <function 0x82a3d54>
gc: collectable <fastglobals 0x82a4244>
gc: collectable <tuple 0x82a168c>

I've used gc.set_debug(gc.DEBUG_LEAK) to get the information
msg57933 - (view) Author: Chris Mellon (arkanes) Date: 2007-11-29 16:07
In funcobject.c:PyFunction_New, the declarations of op and __name__ need
to be moved to the top of the function to compile in MSVC, and if
accepted the fastglobals.c file needs to be added to PCBuild.

In the test script, the use of import * generates syntax warnings that
make the output awkward to read, and the benchmark for dict_get is
actually running dict_set. I'm attaching the fixed copy of the test
script I used.

I see roughly the same speed ups (MSVC 7.1, Windows XP, Intel Core2 Duo
@ 2.13Ghz), but I see a 10% slowdown in the dict-insert/delete and
dict-set benchmarks which I find very troubling.

Test            Trunk          fastglobals    Time difference
--------------------------------------------------
Dict insert/del 2.002495495    2.207409125    1.102329134
Dict get        0.750253205    0.745576662    0.993766714
Dict set        0.982695921    1.114997766    1.13463152
Local get       0.533387029    0.51337118     0.96247406
Local set       0.596565774    0.614124914    1.029433703
Global get      0.935605073    0.731136584    0.78145855
Global set      1.48638532     1.03868462     0.69879903
Builtin get     1.392606367    0.735180673    0.52791707
Function call   1.938705781    1.716233004    0.885246756
List comp       1.547780105    1.188215756    0.767690289


PyBench shows an overall slowdown - String mappings in particular,
string/unicode predicates, and new instance creation all show
significant slowdowns. The results are fairly verbose so I've uploaded
them as a google docs spreadsheet at
http://spreadsheets.google.com/ccc?key=p7g0z40g_NpvH5UpPTpr-Ag&hl=en

I notice that you create a new PyFastGlobals object in every call to
PyEval_EvalCode. This might explain some of the general case slowdown,
is this really what you want to do?
msg57941 - (view) Author: Chris Mellon (arkanes) Date: 2007-11-29 18:05
I may have had some old code in the build or something - I did a clean
rebuild to look at some of the slowdowns and the fastglobals_test
benchmarks are now even better than Neils. Pybench is still overall
slower but it's mostly in float operations, which seems odd and I'd
discount it unless someone else can recreate it. I've updated the
spreadsheet I linked before with the updated timings, along with my
microbenchmark and pystone results. Very impressive speedup on pystone,
no doubt because of the heavy use of globals.
msg57965 - (view) Author: Neil Toronto (ntoronto) Date: 2007-11-29 23:27
Christian: Thanks for that, I'll have to remember DEBUG_LEAK in the
future. :)

It looks like it may be acting correctly since there *is* now a
fastglobals object that needs collecting, and also a new tuple built
from co_names + "__builtins__". I don't know why it wouldn't collect
those in the shell, though. The shell does create a new stack frame for
every command (where a function's commands are all run in a single stack
frame), but I don't know why that would matter. I'll look further into it.

Chris: The build problems should be fixed in the next patch. Thanks for
spending so much time on benchmarking this.

Regarding PyEval_EvalCode creating a PyFastGlobalsObject: I'm not sure
whether it's right thing to do, but I think it is. PyEval_EvalCode gets
called for eval, exec, running an imported module's code, and everything
in the interactive prompt - basically, running any code that's not in a
function or generator. (I think that covers everything.) It seems like
the patch has PyEval_EvalCode doing the right thing in general, but it
may suffer a bit in benchmarks.
msg58071 - (view) Author: Neil Toronto (ntoronto) Date: 2007-12-01 23:00
I've attached the latest patch, which should fix the build and compile
problems on Windows. It also passes test_gc - I changed test_gc after
verifying that it was working correctly. (There are now four objects to
collect rather than two.) On my systems this passes all regression tests
that aren't skipped.

The patch also includes a new fastglobals_test.py, which has a few extra
benchmarks.

Benchmark results for both of my systems are here:

http://spreadsheets.google.com/ccc?key=pHIJrYc_pnIVGXyUxWYFkLw&hl=en_US

This covers pystones, pyfastglobals_test, and pybench. In the latter
two, operations that have been most directly affected by the patch are
in red.

Pystones is significantly faster. The other two show that the operations
that aren't supposed to be affected aren't significantly in one
direction or the other (at least on these two systems). In pybench, an
average run is just a bit faster. I tested -O2 as well to verify that
changes in performance aren't due to accidental optimization differences.

I added the three new parser minibenchmarks to see what would happen to
normal, idiomatic Python code. I was surprised that it wasn't much
better, what with all the xranges and lens and accessing a global
variable all the time. Since this patch makes globals and builtins
almost as fast as locals and doesn't slow anything else down, we might
conclude that global and builtlin lookup has a lower impact on overall
performance than most of us assumed. However, I'm still keen on the
patch for two reasons (besides the fact that I wrote it :D):

1. It gets rid of the "_len = len" anti-pattern. The obvious way to
access globals and builtins is just as fast.

2. It lays down a framework for speeding up class attribute access.

What I mean by #2 is that _PyType_Lookup (looking up a method attribute)
can be very slow when looking up inherited methods. I think I can make
it take a near constant amount of time no matter how deep the
inheritance hierarchy is. I think it would only take a per-class adapter
(like PyFastGlobalsObject) that lazily caches class dict entry pointers
to methods and clears itself if the MRO changes.

Not today, though. :)
msg59324 - (view) Author: L. Peter Deutsch (lpd) Date: 2008-01-05 19:59
The proposed approach to speeding up lookup of inherited methods is not
quite sound, given that class attributes can be added and removed
dynamically. Consider:

class A:
  def f(x): ...
class B(A):
  pass
class C(B):
  pass

If C caches a pointer to A.f, the wrong thing will happen if B.f is
defined dynamically, even though C's pointer will still point to a valid
and up-to-date entry for f in A's dict, and C's MRO will not have changed.

I thought a sufficient fix would be for classes to increment not only
their own 64-bit dict "version" but that of all classes in their MRO if
an entry is ever added or removed in their dict. But even this is not
sufficient. Consider:

class A:
  def f(x): ...
class B(A):
  pass
class C(B):
  pass
class D:
  pass
class E(D,C):
  pass

If D.f is defined dynamically, E's cached pointer to C.f will retrieve
the wrong value. But C is not in D's MRO.

I haven't encountered this issue before in a system with multiple base
classes (my extensive experience is with classic Smalltalk, and
PostScript), so I don't have an off-the-cuff solution.
msg59325 - (view) Author: L. Peter Deutsch (lpd) Date: 2008-01-05 20:00
Sorry, I wrote "E's cached pointer to C.f", which of course should be
"E's cached pointer to A.f".
msg62972 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2008-02-25 05:39
I've gone over this at a high-level and looked for various outstanding
issues.  Below are my comments.  I didn't delve into this in detail. 
There are a lot of questions too.

I think this is very interesting and hope that we can get it working,
100% robust, and verify the improved perf.

In Include/fastglobals.h, num_entries says it is len(names).  But
there is also an entries field.  This is confusing.  The name should
be num_names if it is the length of names.

isbuiltin is an array of integers that hold boolean values?  So most
of the memory is wasted?  How many entries do you expect in isbuiltin?
This field should probably be packed using each bit.  Its type should then
be unsigned.  If there were a small number of PyFastGlobalsObjects it
wouldn't be a big deal.  But from the comment it seems there will be
one of these objects for each function.

Did you measure the difference in memory size?  For example the size
at startup with and without the patch.  Did you measure the startup
time?  How big is python at the end of a regression test run with and
without this patch?

Is there ever an access to the old globals in the old code?  I wonder
if it's worthwhile to change the names in ceval.c, etc.

Does PyFastGlobals_GET_ITEM_INPLACE really need to be a macro?  It's a
pain to debug macros.  If it were a static function in the header
file, it would almost definitely get inlined.  
PyFastGlobals_ENTRY(op, index) should be captured in a local variable.

I'm thinking that entryversion should be unsigned.  Signed integer
overflow is not defined (we had some problems recently with this).
PyFastGlobals_VERSION_SET(op) can overflow.

I like the removal of one pointer from the frame object.

How does this interact with multiple interpreters?  I see the static
of __builtins__ in fastglobal.c.  See PyInterpreterState in
Include/pystate.h.  (I see that __builtins__ is a string, so this
access should be fine, but the question is still valid.  I just want
to make sure there aren't any bad interactions.)

Regarding:
/* XXX A clever adversary could prevent this from terminating */

How about the loop is limited to 10 runs or something reasonable?
That way there cannot be a DoS attack.  You can raise a RuntimeError
if the max iterations is reached.

Since PyFastGlobals_Update() is public, you should not assert that the
op is a fast globals, but rather check and set
PyErr_BadInternalCall().  That's true of all public APIs.  Internal
APIs are fine to have asserts.

In fg_check_builtins() you may be able to speed this up by checking
the most common paths first.  I know this code is taken from
frameobject.  See
http://coverage.livinglogic.de/Objects/frameobject.c.html .  Hmmm,
that isn't showing any data currently.  I'll ping Walter and ask him
if he can get that working again.

I don't understand this comment in PyFastGlobals_CheckBuiltins:
  /* Make sure fg->builtins_entry is updated first */

It says what the code does, but I don't understand *why* it does it.

Instead of:
		PyErr_NoMemory();
		Py_DECREF(newnames);
		return NULL;

You can do:
		Py_DECREF(newnames);
		return PyErr_NoMemory();

In PyFastGlobals_New(), you should verify that num_entries doesn't
overflow before creating a new tuple by increasing the size by 1.

In this code:
	isbuiltin = PyMem_NEW(int, num_entries);
	if (isbuiltin == NULL) {
		PyErr_NoMemory();
		Py_DECREF(newnames);
		PyMem_FREE(isbuiltin);
		return NULL;
	}

The free of isbuiltin should be entries.

If these arrays will be small (< 256 bytes), it would be better (faster)
to use
pymalloc rather than the PyMem interface.

PyDict_GetEntry should be prefixed with an underscore.  I don't think
this should become part of the public API.  How about passing an int
pointer to GetEntry that will be set if the key is found or clear if not
found?

This needs to be addressed:
+	/* This is a bad idea now: if gc breaks a cycle by clearing
+	f_fastglobals, when a generator is finally collected it does this
+	sequence of calls: gen_del->gen_close->gen_send_ex->
+	PyEval_EvalFrameEx. But PyEval_EvalFrameEx references
+	f_fastglobals->globals, which will be cleared by then, resuling
+	in a segfault.
+	??? Is there a way to preserve this traversal? */
+	/* Py_VISIT(f->f_fastglobals); */
msg63933 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-03-18 16:06
Making sure I look at this at least once carefully before releasing.
msg70467 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-07-31 02:03
Ping.
msg70746 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-05 16:01
Won't have time myself.
msg101237 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-17 17:36
There probably isn't any point in such a complication, until perhaps we have a JIT that magnifies the improvement.
(but I assume the JIT will be able to treat references to globals and builtins as "frozen" using its own logic)
msg101238 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2010-03-17 17:41
Antoine: yes, this optimization is already implemented in the Unladen JIT.
msg101239 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-17 17:43
Thanks Collin, recommend closing then.
msg116795 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-09-18 15:20
Closing as recommended in msg101239.
msg181942 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2013-02-12 06:02
It sort of looks like this was closed because we assumed we were moving to Unladen Swallow.  We're not.  Should this be reopened?
msg181946 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-02-12 07:21
See issue 10401.
Quoting myself: “Here is the Nth patch for a globals/builtins cache. As other caches at the same kind, it shows very small to no gain on non-micro benchmarks, showing that contrary to popular belief, globals/builtins lookup are not a major roadblock in today's Python performance.”
History
Date User Action Args
2013-02-12 07:21:38pitrousetmessages: + msg181946
2013-02-12 06:02:08larrysetnosy: + larry
messages: + msg181942
2010-09-18 15:20:45BreamoreBoysetstatus: open -> closed
nosy: + BreamoreBoy
messages: + msg116795

2010-03-17 17:43:05pitrousetmessages: + msg101239
2010-03-17 17:41:55collinwintersetmessages: + msg101238
2010-03-17 17:36:34pitrousetmessages: + msg101237
2010-02-15 17:18:25kevinwatterssetnosy: + kevinwatters
2009-05-24 16:32:34pitrousetnosy: + pitrou
2009-02-21 06:52:55gregory.p.smithsetnosy: + gregory.p.smith
2009-01-23 19:48:26collinwintersetnosy: + collinwinter, jyasskin
type: enhancement -> performance
2008-08-05 16:01:47gvanrossumsetpriority: critical -> normal
assignee: gvanrossum ->
messages: + msg70746
2008-07-31 02:03:22benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg70467
2008-03-26 15:36:54giampaolo.rodolasetnosy: + giampaolo.rodola
2008-03-18 16:06:36gvanrossumsetpriority: critical
assignee: gvanrossum
messages: + msg63933
2008-02-25 05:39:42nnorwitzsetnosy: + nnorwitz
messages: + msg62972
2008-02-18 11:59:08ggenellinasetnosy: + ggenellina
2008-01-05 20:00:03lpdsetmessages: + msg59325
2008-01-05 19:59:03lpdsetnosy: + lpd
messages: + msg59324
2007-12-01 23:01:03ntorontosetfiles: + fastglobals-1.patch.txt
messages: + msg58071
2007-11-29 23:27:26ntorontosetmessages: + msg57965
2007-11-29 22:29:01gvanrossumsetkeywords: + patch
nosy: + gvanrossum
2007-11-29 18:05:55arkanessetmessages: + msg57941
2007-11-29 16:07:37arkanessetfiles: + fastglobals_test.py
nosy: + arkanes
messages: + msg57933
2007-11-29 12:46:37christian.heimessetnosy: + christian.heimes
messages: + msg57929
2007-11-29 09:43:57ntorontocreate