This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author pitrou
Recipients mark.dickinson, pitrou, rhettinger, tim.peters
Date 2010-11-13.14:48:06
SpamBayes Score 1.1459361e-10
Marked as misclassified No
Message-id <1289659692.61.0.467153013048.issue10408@psf.upfronthosting.co.za>
In-reply-to
Content
This is a patch experiment which does two things:
- make dicts denser by making the resize factor 2 instead of 4 for small dicts
- improve cache locality on collisions by using linear probing

It should be noted that these two changes are not independent. Improving cache locality on collisions makes probing cheaper, and in turn should allow to make small dicts denser.

Linear probing is motivated by the fact that collisions can happen frequently.  The comments in dictobject.c seem a bit mistaken:

“If we *usually* find the key we're looking for on the first try (and, it turns out, we usually do -- the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.”

According to http://www.cse.ust.hk/~yike/pods10-hashing.pdf, however, things are not so rosy. The average number of probes for successful lookups, depending on the load factor "alpha", is given by:

>>> c = lambda alpha: 0.5 * (1 + 1/(1-alpha))

while the average number of probes for unsuccessful lookups is:

>>> cp = lambda alpha: 0.5 * (1 + 1/(1-alpha)**2)

(note: this is with a perfectly random hash function; I intuitively assume an imperfect random function gives higher figures)

For a load factor of 2/3, we then get:

>>> c(2/3)
1.9999999999999998
>>> cp(2/3)
4.999999999999999

Since the current non-linear probing schemes guarantees that each probing will access a different cache line, the cache locality of a lookup becomes very poor.

The problem with linear probing, as noted in the comments, is that it degrades performance quite a lot when the hashing function clusters results. The solution I'm proposing is to apply an *initial* perturbation, by multiplying the hash() with a prime number. Multiplication is very fast on modern CPUs, so this doesn't adversely affect performance.
History
Date User Action Args
2010-11-13 14:48:12pitrousetrecipients: + pitrou, tim.peters, rhettinger, mark.dickinson
2010-11-13 14:48:12pitrousetmessageid: <1289659692.61.0.467153013048.issue10408@psf.upfronthosting.co.za>
2010-11-13 14:48:06pitroulinkissue10408 messages
2010-11-13 14:48:06pitroucreate