Fast globals/builtins access (patch) #45859

ntoronto · 2007-11-29T09:43:57Z

BPO	1518
Nosy	@gvanrossum, @gpshead, @pitrou, @larryhastings, @giampaolo, @tiran, @benjaminp
Files	fastglobals.patch.txt fastglobals_test.py fastglobals-1.patch.txt

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-09-18.15:20:45.504>
created_at = <Date 2007-11-29.09:43:57.063>
labels = ['interpreter-core', 'performance']
title = 'Fast globals/builtins access (patch)'
updated_at = <Date 2013-02-12.07:21:38.703>
user = 'https://bugs.python.org/ntoronto'

bugs.python.org fields:

activity = <Date 2013-02-12.07:21:38.703>
actor = 'pitrou'
assignee = 'none'
closed = True
closed_date = <Date 2010-09-18.15:20:45.504>
closer = 'BreamoreBoy'
components = ['Interpreter Core']
creation = <Date 2007-11-29.09:43:57.063>
creator = 'ntoronto'
dependencies = []
files = ['8821', '8823', '8846']
hgrepos = []
issue_num = 1518
keywords = ['patch']
message_count = 18.0
messages = ['57927', '57929', '57933', '57941', '57965', '58071', '59324', '59325', '62972', '63933', '70467', '70746', '101237', '101238', '101239', '116795', '181942', '181946']
nosy_count = 16.0
nosy_names = ['gvanrossum', 'nnorwitz', 'collinwinter', 'gregory.p.smith', 'lpd', 'ggenellina', 'pitrou', 'larry', 'giampaolo.rodola', 'christian.heimes', 'arkanes', 'jyasskin', 'kevinwatters', 'ntoronto', 'benjamin.peterson', 'BreamoreBoy']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue1518'
versions = ['Python 2.6']

ntoronto · 2007-11-29T09:43:47Z

I've attached a patch that reduces LOAD_GLOBAL access time by 19% for
globals and 45% for builtins. In my tests, that's barely slower than
LOAD_FAST in both cases.

The approach is to cache pointers to dictionary entries rather than
caching dictionary values as is usually suggested. Dictionaries keep a
64-bit "version" internally that is incremented whenever at least one
entry pointer would become invalid. PyFastGlobalsObject is a
dictionary adapter, allowing quick access to it using an index instead
of a key object. It ensures that its entry pointers are always valid
(by comparing a stored version number against the dictionary's) before
retrieving values from them.

A script of microbenchmarks, fastglobals_test.py, is included in the
patch. My dual-core Intel T2300 1.66GHz on Ubuntu 7.04 (compiled
with -DNDEBUG -g -O3), gets the following results:

Test 2.6a0 trunk 2.6a0 fastglobals % time
---- ----------- ----------------- ----------
Dict ins/del (100000) 41.27 41.59 1.01
Dict get (100000) 21.37 21.35 1.00
Dict set (100000) 21.36 21.33 1.00
Local get (1000000) 15.64 15.60 1.00
Local set (1000000) 16.83 16.94 1.01
Global get (1000000) 21.09 17.04 0.81*
Global set (1000000) 34.15 22.80 0.67*
Builtin get (1000000) 30.99 17.04 0.55*
Function call (100000) 32.87 33.00 1.00
Listcomp (100000) 28.65 25.17 0.88*
Pystone 1.1 (500000) 12.46 11.68 0.94*
PYBENCH 2.0 9.10 9.05 0.99
(* = probably significant)

All regressions that aren't skipped pass except test_gc. I've probably
got a refcount issue or unbreakable cycle somewhere.

tiran · 2007-11-29T12:46:36Z

When I run the code of test_gc.py test_function() in a shell I'm getting
the expected result of 2 collected items:

gc: collectable <dict 0xb78aa13c>
gc: collectable <function 0xb78a9374>

However the same code embedded in the test suite collects two additional
objects:

gc: collectable <dict 0x830061c>
gc: collectable <function 0x82a3d54>
gc: collectable <fastglobals 0x82a4244>
gc: collectable <tuple 0x82a168c>

I've used gc.set_debug(gc.DEBUG_LEAK) to get the information

arkanes · 2007-11-29T16:07:36Z

In funcobject.c:PyFunction_New, the declarations of op and __name__ need
to be moved to the top of the function to compile in MSVC, and if
accepted the fastglobals.c file needs to be added to PCBuild.

In the test script, the use of import * generates syntax warnings that
make the output awkward to read, and the benchmark for dict_get is
actually running dict_set. I'm attaching the fixed copy of the test
script I used.

I see roughly the same speed ups (MSVC 7.1, Windows XP, Intel Core2 Duo
@ 2.13Ghz), but I see a 10% slowdown in the dict-insert/delete and
dict-set benchmarks which I find very troubling.

Test Trunk fastglobals Time difference
--------------------------------------------------
Dict insert/del 2.002495495 2.207409125 1.102329134
Dict get 0.750253205 0.745576662 0.993766714
Dict set 0.982695921 1.114997766 1.13463152
Local get 0.533387029 0.51337118 0.96247406
Local set 0.596565774 0.614124914 1.029433703
Global get 0.935605073 0.731136584 0.78145855
Global set 1.48638532 1.03868462 0.69879903
Builtin get 1.392606367 0.735180673 0.52791707
Function call 1.938705781 1.716233004 0.885246756
List comp 1.547780105 1.188215756 0.767690289

PyBench shows an overall slowdown - String mappings in particular,
string/unicode predicates, and new instance creation all show
significant slowdowns. The results are fairly verbose so I've uploaded
them as a google docs spreadsheet at
http://spreadsheets.google.com/ccc?key=p7g0z40g_NpvH5UpPTpr-Ag&hl=en

I notice that you create a new PyFastGlobals object in every call to
PyEval_EvalCode. This might explain some of the general case slowdown,
is this really what you want to do?

arkanes · 2007-11-29T18:05:55Z

I may have had some old code in the build or something - I did a clean
rebuild to look at some of the slowdowns and the fastglobals_test
benchmarks are now even better than Neils. Pybench is still overall
slower but it's mostly in float operations, which seems odd and I'd
discount it unless someone else can recreate it. I've updated the
spreadsheet I linked before with the updated timings, along with my
microbenchmark and pystone results. Very impressive speedup on pystone,
no doubt because of the heavy use of globals.

ntoronto · 2007-11-29T23:27:26Z

Christian: Thanks for that, I'll have to remember DEBUG_LEAK in the
future. :)

It looks like it may be acting correctly since there *is* now a
fastglobals object that needs collecting, and also a new tuple built
from co_names + "__builtins__". I don't know why it wouldn't collect
those in the shell, though. The shell does create a new stack frame for
every command (where a function's commands are all run in a single stack
frame), but I don't know why that would matter. I'll look further into it.

Chris: The build problems should be fixed in the next patch. Thanks for
spending so much time on benchmarking this.

Regarding PyEval_EvalCode creating a PyFastGlobalsObject: I'm not sure
whether it's right thing to do, but I think it is. PyEval_EvalCode gets
called for eval, exec, running an imported module's code, and everything
in the interactive prompt - basically, running any code that's not in a
function or generator. (I think that covers everything.) It seems like
the patch has PyEval_EvalCode doing the right thing in general, but it
may suffer a bit in benchmarks.

ntoronto · 2007-12-01T23:00:54Z

I've attached the latest patch, which should fix the build and compile
problems on Windows. It also passes test_gc - I changed test_gc after
verifying that it was working correctly. (There are now four objects to
collect rather than two.) On my systems this passes all regression tests
that aren't skipped.

The patch also includes a new fastglobals_test.py, which has a few extra
benchmarks.

Benchmark results for both of my systems are here:

http://spreadsheets.google.com/ccc?key=pHIJrYc_pnIVGXyUxWYFkLw&hl=en_US

This covers pystones, pyfastglobals_test, and pybench. In the latter
two, operations that have been most directly affected by the patch are
in red.

Pystones is significantly faster. The other two show that the operations
that aren't supposed to be affected aren't significantly in one
direction or the other (at least on these two systems). In pybench, an
average run is just a bit faster. I tested -O2 as well to verify that
changes in performance aren't due to accidental optimization differences.

I added the three new parser minibenchmarks to see what would happen to
normal, idiomatic Python code. I was surprised that it wasn't much
better, what with all the xranges and lens and accessing a global
variable all the time. Since this patch makes globals and builtins
almost as fast as locals and doesn't slow anything else down, we might
conclude that global and builtlin lookup has a lower impact on overall
performance than most of us assumed. However, I'm still keen on the
patch for two reasons (besides the fact that I wrote it :D):

It gets rid of the "_len = len" anti-pattern. The obvious way to
access globals and builtins is just as fast.
It lays down a framework for speeding up class attribute access.

What I mean by #2 is that _PyType_Lookup (looking up a method attribute)
can be very slow when looking up inherited methods. I think I can make
it take a near constant amount of time no matter how deep the
inheritance hierarchy is. I think it would only take a per-class adapter
(like PyFastGlobalsObject) that lazily caches class dict entry pointers
to methods and clears itself if the MRO changes.

Not today, though. :)

lpd · 2008-01-05T19:59:03Z

The proposed approach to speeding up lookup of inherited methods is not
quite sound, given that class attributes can be added and removed
dynamically. Consider:

class A:
  def f(x): ...
class B(A):
  pass
class C(B):
  pass

If C caches a pointer to A.f, the wrong thing will happen if B.f is
defined dynamically, even though C's pointer will still point to a valid
and up-to-date entry for f in A's dict, and C's MRO will not have changed.

I thought a sufficient fix would be for classes to increment not only
their own 64-bit dict "version" but that of all classes in their MRO if
an entry is ever added or removed in their dict. But even this is not
sufficient. Consider:

class A:
  def f(x): ...
class B(A):
  pass
class C(B):
  pass
class D:
  pass
class E(D,C):
  pass

If D.f is defined dynamically, E's cached pointer to C.f will retrieve
the wrong value. But C is not in D's MRO.

I haven't encountered this issue before in a system with multiple base
classes (my extensive experience is with classic Smalltalk, and
PostScript), so I don't have an off-the-cuff solution.

lpd · 2008-01-05T20:00:03Z

Sorry, I wrote "E's cached pointer to C.f", which of course should be
"E's cached pointer to A.f".

nnorwitz · 2008-02-25T05:39:40Z

I've gone over this at a high-level and looked for various outstanding
issues. Below are my comments. I didn't delve into this in detail.
There are a lot of questions too.

I think this is very interesting and hope that we can get it working,
100% robust, and verify the improved perf.

In Include/fastglobals.h, num_entries says it is len(names). But
there is also an entries field. This is confusing. The name should
be num_names if it is the length of names.

isbuiltin is an array of integers that hold boolean values? So most
of the memory is wasted? How many entries do you expect in isbuiltin?
This field should probably be packed using each bit. Its type should then
be unsigned. If there were a small number of PyFastGlobalsObjects it
wouldn't be a big deal. But from the comment it seems there will be
one of these objects for each function.

Did you measure the difference in memory size? For example the size
at startup with and without the patch. Did you measure the startup
time? How big is python at the end of a regression test run with and
without this patch?

Is there ever an access to the old globals in the old code? I wonder
if it's worthwhile to change the names in ceval.c, etc.

Does PyFastGlobals_GET_ITEM_INPLACE really need to be a macro? It's a
pain to debug macros. If it were a static function in the header
file, it would almost definitely get inlined.
PyFastGlobals_ENTRY(op, index) should be captured in a local variable.

I'm thinking that entryversion should be unsigned. Signed integer
overflow is not defined (we had some problems recently with this).
PyFastGlobals_VERSION_SET(op) can overflow.

I like the removal of one pointer from the frame object.

How does this interact with multiple interpreters? I see the static
of __builtins__ in fastglobal.c. See PyInterpreterState in
Include/pystate.h. (I see that __builtins__ is a string, so this
access should be fine, but the question is still valid. I just want
to make sure there aren't any bad interactions.)

Regarding:
/* XXX A clever adversary could prevent this from terminating */

How about the loop is limited to 10 runs or something reasonable?
That way there cannot be a DoS attack. You can raise a RuntimeError
if the max iterations is reached.

Since PyFastGlobals_Update() is public, you should not assert that the
op is a fast globals, but rather check and set
PyErr_BadInternalCall(). That's true of all public APIs. Internal
APIs are fine to have asserts.

In fg_check_builtins() you may be able to speed this up by checking
the most common paths first. I know this code is taken from
frameobject. See
http://coverage.livinglogic.de/Objects/frameobject.c.html . Hmmm,
that isn't showing any data currently. I'll ping Walter and ask him
if he can get that working again.

I don't understand this comment in PyFastGlobals_CheckBuiltins:
/* Make sure fg->builtins_entry is updated first */

It says what the code does, but I don't understand *why* it does it.

Instead of:
		PyErr_NoMemory();
		Py_DECREF(newnames);
		return NULL;

You can do:
		Py_DECREF(newnames);
		return PyErr_NoMemory();

In PyFastGlobals_New(), you should verify that num_entries doesn't
overflow before creating a new tuple by increasing the size by 1.

In this code:
	isbuiltin = PyMem_NEW(int, num_entries);
	if (isbuiltin == NULL) {
		PyErr_NoMemory();
		Py_DECREF(newnames);
		PyMem_FREE(isbuiltin);
		return NULL;
	}

The free of isbuiltin should be entries.

If these arrays will be small (< 256 bytes), it would be better (faster)
to use
pymalloc rather than the PyMem interface.

PyDict_GetEntry should be prefixed with an underscore. I don't think
this should become part of the public API. How about passing an int
pointer to GetEntry that will be set if the key is found or clear if not
found?

This needs to be addressed:
+ /* This is a bad idea now: if gc breaks a cycle by clearing
+ f_fastglobals, when a generator is finally collected it does this
+ sequence of calls: gen_del->gen_close->gen_send_ex->
+ PyEval_EvalFrameEx. But PyEval_EvalFrameEx references
+ f_fastglobals->globals, which will be cleared by then, resuling
+ in a segfault.
+ ??? Is there a way to preserve this traversal? */
+ /* Py_VISIT(f->f_fastglobals); */

gvanrossum · 2008-03-18T16:06:37Z

Making sure I look at this at least once carefully before releasing.

benjaminp · 2008-07-31T02:03:23Z

Ping.

gvanrossum · 2008-08-05T16:01:47Z

Won't have time myself.

pitrou · 2010-03-17T17:36:34Z

There probably isn't any point in such a complication, until perhaps we have a JIT that magnifies the improvement.
(but I assume the JIT will be able to treat references to globals and builtins as "frozen" using its own logic)

collinwinter · 2010-03-17T17:41:56Z

Antoine: yes, this optimization is already implemented in the Unladen JIT.

pitrou · 2010-03-17T17:43:05Z

Thanks Collin, recommend closing then.

BreamoreBoy · 2010-09-18T15:20:45Z

Closing as recommended in msg101239.

larryhastings · 2013-02-12T06:02:08Z

It sort of looks like this was closed because we assumed we were moving to Unladen Swallow. We're not. Should this be reopened?

pitrou · 2013-02-12T07:21:38Z

See bpo-10401.
Quoting myself: “Here is the Nth patch for a globals/builtins cache. As other caches at the same kind, it shows very small to no gain on non-micro benchmarks, showing that contrary to popular belief, globals/builtins lookup are not a major roadblock in today's Python performance.”

ntoronto mannequin added type-feature A feature request or enhancement interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Nov 29, 2007

gvanrossum self-assigned this Mar 18, 2008

gvanrossum removed their assignment Aug 5, 2008

collinwinter mannequin added performance Performance or resource usage and removed type-feature A feature request or enhancement labels Jan 23, 2009

BreamoreBoy mannequin closed this as completed Sep 18, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast globals/builtins access (patch) #45859

Fast globals/builtins access (patch) #45859

ntoronto mannequin commented Nov 29, 2007

ntoronto mannequin commented Nov 29, 2007

tiran commented Nov 29, 2007

arkanes mannequin commented Nov 29, 2007

arkanes mannequin commented Nov 29, 2007

ntoronto mannequin commented Nov 29, 2007

ntoronto mannequin commented Dec 1, 2007

lpd mannequin commented Jan 5, 2008

lpd mannequin commented Jan 5, 2008

nnorwitz mannequin commented Feb 25, 2008

gvanrossum commented Mar 18, 2008

benjaminp commented Jul 31, 2008

gvanrossum commented Aug 5, 2008

pitrou commented Mar 17, 2010

collinwinter mannequin commented Mar 17, 2010

pitrou commented Mar 17, 2010

BreamoreBoy mannequin commented Sep 18, 2010

larryhastings commented Feb 12, 2013

pitrou commented Feb 12, 2013

Fast globals/builtins access (patch) #45859

Fast globals/builtins access (patch) #45859

Comments

ntoronto mannequin commented Nov 29, 2007

ntoronto mannequin commented Nov 29, 2007

tiran commented Nov 29, 2007

arkanes mannequin commented Nov 29, 2007

arkanes mannequin commented Nov 29, 2007

ntoronto mannequin commented Nov 29, 2007

ntoronto mannequin commented Dec 1, 2007

lpd mannequin commented Jan 5, 2008

lpd mannequin commented Jan 5, 2008

nnorwitz mannequin commented Feb 25, 2008

gvanrossum commented Mar 18, 2008

benjaminp commented Jul 31, 2008

gvanrossum commented Aug 5, 2008

pitrou commented Mar 17, 2010

collinwinter mannequin commented Mar 17, 2010

pitrou commented Mar 17, 2010

BreamoreBoy mannequin commented Sep 18, 2010

larryhastings commented Feb 12, 2013

pitrou commented Feb 12, 2013