Title: Cached globals+builtins lookup optimization
Type: performance Stage: test needed
Components: Interpreter Core Versions: Python 3.1, Python 2.7
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ag6502, arigo, collinwinter, loewis, nnorwitz, rhettinger
Priority: normal Keywords: patch

Created on 2006-12-15 00:44 by ag6502, last changed 2012-07-07 13:46 by ag6502. This issue is now closed.

File name Uploaded Description Edit
cached_lookups_12.patch ag6502, 2006-12-30 12:47 Dict lookup caching optimization for LOAD_GLOBAL and LOAD_ATTR on modules review
Messages (11)
msg51536 - (view) Author: Andrea Griffini (ag6502) * Date: 2006-12-15 00:44
This patch implements a speed optimization by introducing a timestamp on dictionaries and by caching lookups of constants so that lookup doesn't need to be repeated if the dictionary didn't change.

Currently the patch implements the cached lookup only for the LOAD_GLOBAL opcode and stores the cache as two extra member in the code object. I'm going to investigate if LOAD_ATTR in the special case of a module is worth caching too.

A cache entry for LOAD_GLOBAL is defined by two timestamps (Py_ssize_t) and a *borrowed* pointer to a PyObject. The use of a borrowed pointer is safe because the value is used only if the dictionaries didn't change, so the result of lookup must still be alive because the reference in the dictionary that returned the cached value it is still in place.

The assumptions are:
- a LOAD_GLOBAL opcode always looks for the same
  symbol (i.e. co_names never changes)
- the dictionary being searched (globals and
  builtins) are mostly constant

On my system I got a speedup of 8%-10% on a real program (a tcp/ip server that handles an in-memory database of python objects) on the checkpoint load + rollforward steps (a few millions LOAD_GLOBAL operations in total) and the varying percentage depends on which compiler options are used to build python.

On another little test program that calls a function

def f(x):
    return math.sin(x) + math.cos(x) + abs(x)

one million times the speedup with the default makefile
is about 7%.

To enable the patch the makefile should define the CACHED_LOOKUPS symbol.

The change to dictobject also exports the timestamp, at the python level; other changes are instead invisible at the python level. I'm not sure at all exporting dict.timestamp() is a good idea however...

Note that after applying the patch you may find an error in userdict if you simply use "make" (there's probably some dependency problem somewhere).
After make clean + make the test suite runs fine on my system.

I'm interested in hearing if also other platforms do actually see a real performance improvement.

msg51537 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-12-16 11:01
How does that deal with timestamp wraparound?
msg51538 - (view) Author: Andrea Griffini (ag6502) * Date: 2006-12-16 17:13
I am experimenting with long long timestamps (based on "unsigned long long" if available or explicitly on two "unsigned long" otherwise). An alternative to slowing down lookups by using 64 bits on 32-bit computers is just using a single 32-bit counter and trigger a "cache flush" when the timestamp rolls over by visiting all live dictionaries and code objects.
This would be IMO a *very* infrequent operation (for example on my P4/2.4GHz just doing nothing else but dict updates getting to 2**32 updates would require 14 minutes).
msg51539 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-12-16 22:57
I think restricting yourself to 32-bit counters is fine (it's also an issue of memory overhead as you have one of these per access to a global, and per dictionary); you have to deal with the overflow, though. I suggest to use an unsigned value, e.g. size_t.

Overflowing should be dealt with through visiting all objects, as you say. It could either happen right when it overflows, or it could get integrated into GC, and only reset when the GC runs. You'd then need to account for those periods when it already has overflowed, but GC hasn't run yet; in these periods, counting should be "on hold".
msg51540 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-12-23 03:33
I only reviewed the code in the patch.  I have not thought conceptually about this, whether it's a good idea, how to break it, etc.  I also did not look to verify all the necessary dict methods were updated to add TOUCHED.

Have you profiled the slow down to dicts in normal cases?  Things like:

   d = {}
   d[0] = 0

etc?  All dict operations are going to be a tiny bit slower due to an additional assignment.  It's probably not measurable, but you should still verify this.

What is LOAD_FAST_HACK in ceval.c?  It doesn't seem necessary for this patch.
Please revert any whitespace/formatting changes that aren't part of the patch (there are a couple in dictobject.c and codeobject.c).

It seems like a little structure would be better than parallel arrays.  This would reduce some calculations and reduce the number of mallocs.  Also there would only be a single assignment in the eval loop and the names would probably wind up being more clear.

Would it be better to use PyObject_Malloc rather than PyMem_Malloc?  It should be faster for sizes <= 256.  (Despite the name, it doesn't have to handle PyObjects.)

Why not use a size_t instead of Py_ssize_t for the timestamp?  This is conceptually unsigned, no?

Use the macro version PyTuple_GET_SIZE(names); in codeobject.c.  It will be a little faster.

If the cache allocs fail when allocating a code object, ceval will crash due to not checking for NULL.

As for the timestamp, there should probably be a macro to access the field for a dict.  This macro should be used in ceval.c/codeobject.c for getting/setting it.

When returning the timestamp in dictobject.c you shouldn't cast it to a long.  Use the appropriate method such as PyInt_FromSsize_t.
msg51541 - (view) Author: Andrea Griffini (ag6502) * Date: 2006-12-23 22:52
I was already changing the code in the direction you're pointing out. I attached the current version (still not complete!) of the patch; more specifically:
- Cached lookups are now stored in an array of structures instead than in two separate arrays
- Macros are used for handling of cache entries
- The timestamp is now unsigned and there is optional support for 64-bit timestamps
- Support for stalling timestamping at 32 bit wrap point
- Removed the LOAD_FAST_HACK (was just an experiment with oprofile; was left there by mistake)
- Fixed memory allocation failures handling in code object creation
- Removed exporting of timestamp value at the python level
- Used as cache size the largest LOAD_ATTR/LOAD_GLOBAL entry found

Still missing are the sorting of co_names to pack cache slots and
the cache reset at gc time if the dictionary timestamp is stalled.

The cached lookup can be used at two levels:
-DCACHED_LOOKUPS enables caching of LOAD_GLOBAL results
-DCACHED_MODULE_LOOKUPS (in addition to -DCACHED_LOOKUPS) also caches the result of a lookup in a module.

Speed results are sometimes very interesting and sometimes strange; I found measurable differences just moving around statements between logically equivalent places after checking what gcc was doing with register allocation (probably those places were indeed not equivalent if taking in account aliasing that isn't going to happen but that a c compiler must assume as possible). I have no idea if the speedups I've measured are better or worse on other processors/architectures.

File Added: cached_lookups_10.patch
msg51542 - (view) Author: Andrea Griffini (ag6502) * Date: 2006-12-30 12:47
I added the two missing parts and the patch should be now functionally complete; now there co_names reordering pass during compilation so that all names used in LOAD_GLOBAL/LOAD_ATTR are moved at the beginning of the tuple, this allows the lookup cache to be allocated only for the needed size. The reordering is done *before* actually assembling basic blocks to avoid to deal with the change of opcode size after reordering.
Also now the timestamp is reset when rolling over to 0 and a sweep is done on all dictionaries and all code objects to clear cached lookups; I didn't implemented it in the gc because: 1) I'm not sure if all code objects and dictionaries are listed (how do I find code objects ?), 2) to do the traversing the code must be in gcmodule, or a richer interface must be exported, 3) what if garbage collecting is disabled ?

The extra cost for dictionary creation is normally (i.e. when a dictionary is recovered from the free pool) just setting the timestamp to 1, the extra cost for dictionary update is higher but around the "noise" of timeit.Timer on my system. The memory cost is three more 32-bit words for every dict and 4 + n*3 more words for every code object where n is the number of elements of co_names that are used in LOAD_GLOBAL/LOAD_ATTR opcodes. The patch as it stands don't need to change marshalling of compiled code (the loaded code objects will simply be "unoptimized" so n will be 1+max(oparg) for all arguments of LOAD_GLOBAL/LOAD_ATTR - wasting some cache slots).

The patch itself doesn't do anything unless the symbol CACHED_LOOKUPS is defined (this will cache LOAD_GLOBALs) or both CACHED_LOOKUPS and CACHED_MODULE_LOOKUPS are defined (this will cache both LOAD_GLOBALs and LOAD_ATTRs when the searched object is a module). This allows to have this optimization as an optional build option (does this make sense ?).
File Added: cached_lookups_12.patch
msg51543 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2007-02-25 17:46
In the patch, the initial and the reset value of
dict_timestamp_value are 0xFFFFFFFFU - 10000.
I guess that it's a value you used while debugging
the clearing code, and that it should now be 1 instead?
msg51544 - (view) Author: Andrea Griffini (ag6502) * Date: 2007-02-26 07:08
There are indeed two symbols used for debugging; one is CACHED_LOOKUP_STATS (Python/ceval.c) that outputs the hit/miss ratios for the cache when python exits and the other is LOOKUP_DEBUG_TIMESTAMP_LIMIT (Objects/dictobject.c) that forces the global counter to wrap around every 10000 dict changes (instead of every 2**32 dict changes).

Actually on this patch I'm getting confusing results on my machine (for example getting better results with statistics ON, that to me just make no sense at all). I got no results from anyone else so I can't say if this patch is actually speeding up or slowing down python. On my machine (trunk python) pystone gets 1.8 seconds without stats, 1.6 seconds with stats (!) and 1.7 seconds without the patch applied.

Does anyone have the time to check on their setup/machines ? So far I'm quite confused...
msg51545 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2007-02-26 15:19
What sort of results do you get with pybench (it's in Tools)?
msg164846 - (view) Author: Andrea Griffini (ag6502) * Date: 2012-07-07 13:46
Closing as it was a partial implementation of a bad idea with questionable gains.
Date User Action Args
2012-07-07 13:46:19ag6502setstatus: open -> closed
resolution: not a bug
messages: + msg164846
2010-07-12 14:48:12BreamoreBoysetnosy: + rhettinger
2009-03-30 18:02:07ajaksu2setnosy: + collinwinter
versions: + Python 3.1, Python 2.7, - Python 2.6

type: performance
stage: test needed
2006-12-15 00:44:46ag6502create