classification
Title: Denser dicts and linear probing
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Jim.Jewett, inada.naoki, jcon, mark.dickinson, neologix, pitrou, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2010-11-13 14:48 by pitrou, last changed 2017-03-09 01:17 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
dictopts.patch pitrou, 2010-11-13 14:48 review
dcbench-py3k.py pitrou, 2010-11-13 14:53
Messages (14)
msg121139 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-13 14:48
This is a patch experiment which does two things:
- make dicts denser by making the resize factor 2 instead of 4 for small dicts
- improve cache locality on collisions by using linear probing

It should be noted that these two changes are not independent. Improving cache locality on collisions makes probing cheaper, and in turn should allow to make small dicts denser.

Linear probing is motivated by the fact that collisions can happen frequently.  The comments in dictobject.c seem a bit mistaken:

“If we *usually* find the key we're looking for on the first try (and, it turns out, we usually do -- the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.”

According to http://www.cse.ust.hk/~yike/pods10-hashing.pdf, however, things are not so rosy. The average number of probes for successful lookups, depending on the load factor "alpha", is given by:

>>> c = lambda alpha: 0.5 * (1 + 1/(1-alpha))

while the average number of probes for unsuccessful lookups is:

>>> cp = lambda alpha: 0.5 * (1 + 1/(1-alpha)**2)

(note: this is with a perfectly random hash function; I intuitively assume an imperfect random function gives higher figures)

For a load factor of 2/3, we then get:

>>> c(2/3)
1.9999999999999998
>>> cp(2/3)
4.999999999999999

Since the current non-linear probing schemes guarantees that each probing will access a different cache line, the cache locality of a lookup becomes very poor.

The problem with linear probing, as noted in the comments, is that it degrades performance quite a lot when the hashing function clusters results. The solution I'm proposing is to apply an *initial* perturbation, by multiplying the hash() with a prime number. Multiplication is very fast on modern CPUs, so this doesn't adversely affect performance.
msg121140 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-13 14:53
Here is a benchmark adapted from another bug entry (I merely adapted the dict sizes in order to better exhibit the performance degradation when the CPU cache becomes too small to hold the whole dict).

Results without the patch:

   10000 words (    9092 keys), 2928369 inserts/s, 13911456 lookups/s, 86 bytes/key (0.8MB)
   20000 words (   17699 keys), 3290037 inserts/s, 12746707 lookups/s, 44 bytes/key (0.8MB)
   40000 words (   34490 keys), 2620007 inserts/s, 7723605 lookups/s, 91 bytes/key (3.0MB)
   80000 words (   67148 keys), 2698863 inserts/s, 6573500 lookups/s, 46 bytes/key (3.0MB)
  160000 words (  130897 keys), 2401067 inserts/s, 4886971 lookups/s, 48 bytes/key (6.0MB)
  320000 words (  254233 keys), 2077558 inserts/s, 5061763 lookups/s, 49 bytes/key (12.0MB)
  640000 words (  493191 keys), 1923967 inserts/s, 4490697 lookups/s, 51 bytes/key (24.0MB)
 1280000 words (  956820 keys), 1792729 inserts/s, 4353711 lookups/s, 52 bytes/key (48.0MB)

Results with the patch:

   10000 words (    9092 keys), 3324590 inserts/s, 13911456 lookups/s, 43 bytes/key (0.4MB)
   20000 words (   17699 keys), 3243603 inserts/s, 13202090 lookups/s, 44 bytes/key (0.8MB)
   40000 words (   34490 keys), 2858372 inserts/s, 10686124 lookups/s, 45 bytes/key (1.5MB)
   80000 words (   67148 keys), 2585146 inserts/s, 6917441 lookups/s, 46 bytes/key (3.0MB)
  160000 words (  130897 keys), 2395923 inserts/s, 6455817 lookups/s, 48 bytes/key (6.0MB)
  320000 words (  254233 keys), 2247141 inserts/s, 5529826 lookups/s, 49 bytes/key (12.0MB)
  640000 words (  493191 keys), 2064675 inserts/s, 5073732 lookups/s, 51 bytes/key (24.0MB)
 1280000 words (  956820 keys), 1997615 inserts/s, 4760878 lookups/s, 52 bytes/key (48.0MB)


Lookups become 10% faster when the dict is bigger than the cache, and even inserts seem to benefit a bit.
(not to mention that small dicts take almost half the memory)
msg121155 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-13 23:02
My previous experiments along these lines showed it was a dead-end.  The number of probes was the most important factor and beat-out any effort to improve cache utilization from increased density.  

Doing extra work (more probes) in order to improve cache effects is very difficult because most real programs have an uneven access pattern so that the most frequently accesses items are usually already in cache.  So, the attempted improvement only helps the less frequently accessed items and isn't worth the extra number of probes.

Another result from earlier experiments is that benchmarking the experiment is laden with pitfalls.  Tight timing loops don't mirror real world programs, nor do access patterns with uniform random distributions.
msg121156 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-13 23:12
> My previous experiments along these lines showed it was a dead-end.
> The number of probes was the most important factor and beat-out any
> effort to improve cache utilization from increased density.  

Can you describe your experiments? What workloads or benchmarks did you
use?

Do note that there are several levels of caches in modern CPUs. L1 is
very fast (latency is 3 or 4 cycles) but rather small (32 or 64KB). L2,
depending on the CPU, has a latency between 10 and 20+ cycles and can be
256KB to 1MB large. L3, when present, is quite larger but also quite
slower (latency sometimes up to 50 cycles).
So, even if access patterns are uneven, it is probably rare to have all
frequently accessed data in L1 (especially with Python since objects are
big).

> Another result from earlier experiments is that benchmarking the
> experiment is laden with pitfalls.  Tight timing loops don't mirror
> real world programs, nor do access patterns with uniform random
> distributions.

I can certainly understand that; can you suggest workloads approaching
"real world programs"?
msg121157 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-13 23:14
FWIW, one way to make a dict denser without increasing the number of probes is to use Brent's Variation of Algorithm D in Knuth.  That optimizes the insertion order to minimize the number of collisions and lets you pack well over two-thirds full without degradation.
msg121158 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-13 23:24
> FWIW, one way to make a dict denser without increasing the number of
> probes is to use Brent's Variation of Algorithm D in Knuth.  That
> optimizes the insertion order to minimize the number of collisions and
> lets you pack well over two-thirds full without degradation.

He, you suggested that several years ago on python-dev and I supplied
the code. Also, IIRC, it didn't bring any improvement - although I don't
remember which benchmarks, if any, were run.

But the experiment here is mostly to decrease the (direct and indirect)
cost of collisions by improving temporal locality of lookups.
msg121162 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-11-13 23:38
See Objects/dictnotes.txt for some of the results.
I spent about full month trying to optimize dict
performance either by tuning parameters or using
different algorithms.

There were a couple wins that were not implemented.
1) Allowing users to control insertion order or
at least specify which keys are frequently accessed
so that we could assure a first-time hit.
2) Allowing users to pre-size a dictionary so that
resizes wouldn't be needed or an so they could control
density.
Guido didn't want to expose these controls.

The PyPy guys published a paper on their results with
alternative dict implementations and specialized dicts.
You might want to look at that.  IIRC, they found only
minor wins.
msg121163 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-11-13 23:40
> See Objects/dictnotes.txt for some of the results.
> I spent about full month trying to optimize dict
> performance either by tuning parameters or using
> different algorithms.

Well, I've seen those results. I'm asking about which workloads or
benchmarks were used so that I can try to reproduce these experiments.
msg149147 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-12-10 11:28
This paper might be of interest:
http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf

Basically, it concludes that most of the time, there's no speedup to be gained from the increased cached locality incurred by linear probing when the input sample differ from a uniform distribution. However, figure 1 shows a clear win with a uniform distribution, and figure 2 shows that even with a non-uniform distribution, there's still a gain as the load factor increases.
So if experiments show measurable improvements, this might be an interesting idea.
However, the problem with linear probing is that the performance will be much more dependent on the input distribution that with quadratic probing/double hashing, so this will definitely degrade with some workloads.

> The problem with linear probing, as noted in the comments, is that it 
> degrades performance quite a lot when the hashing function clusters 
> results. The solution I'm proposing is to apply an *initial* 
> perturbation, by multiplying the hash() with a prime number. 

I fail to see how this would avoid primary clustering: IIUC, you replace the hash function h(k) by h'(k), but you still use linear probing: an empty slot preceded by i filled slots has a probablility (i+1)/N of getting filled next (N being the total number of slots).
msg157916 - (view) Author: Jim Jewett (Jim.Jewett) (Python triager) Date: 2012-04-09 22:38
FWIW, doing a linear probe only within a cache line (changing the 1's bit) before applying perturb might also be useful -- and the results may change if the size of a dictentry were reduced.  (Mark Shannon's now-integrated patch doesn't actually do that for the keys, but it would be doable.)
msg280010 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-03 19:23
What is the status of this issue in the light of compact dict implementation?
msg280036 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2016-11-04 03:16
> - make dicts denser by making the resize factor 2 instead of 4 for small dicts

This had been implemented already when I start compact dict.

> - improve cache locality on collisions by using linear probing

set does this. But dict doesn't do it for now.

In case of compact dict, liner probing only affects index table (dk_indices).
dk_indices is small (64byte when dk_size==64). One or two cache line
can contain whole dk_indices of small dicts.
So performance benefit of linear probing will be smaller than previous
dict implementation.

I'll re-evaluate it.
msg289224 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-08 09:50
I think this issue is outdated. Liner probing doesn't make much sense for current dict implementation.
msg289264 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-03-09 01:17
Agreed.  This doesn't make sense for the compact dict implementation.
History
Date User Action Args
2017-03-09 01:17:53rhettingersetstatus: pending -> closed
resolution: out of date
messages: + msg289264

stage: resolved
2017-03-08 09:50:01serhiy.storchakasetstatus: open -> pending

messages: + msg289224
2016-11-04 03:16:30inada.naokisetmessages: + msg280036
2016-11-03 19:23:01serhiy.storchakasetnosy: + inada.naoki
messages: + msg280010
2012-04-09 22:38:09Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg157916
2012-04-07 17:17:03serhiy.storchakasetnosy: + serhiy.storchaka
2011-12-10 11:29:00neologixsetnosy: + neologix
messages: + msg149147
2011-08-24 06:34:37jconsetnosy: + jcon
2010-11-13 23:40:39pitrousetmessages: + msg121163
2010-11-13 23:38:23rhettingersetmessages: + msg121162
2010-11-13 23:24:06pitrousetmessages: + msg121158
2010-11-13 23:14:17rhettingersetmessages: + msg121157
2010-11-13 23:12:16pitrousetmessages: + msg121156
2010-11-13 23:02:33rhettingersetmessages: + msg121155
2010-11-13 14:53:05pitrousetfiles: + dcbench-py3k.py

messages: + msg121140
2010-11-13 14:48:06pitroucreate