Message 121139 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pitrou
Recipients	mark.dickinson, pitrou, rhettinger, tim.peters
Date	2010-11-13.14:48:06
SpamBayes Score	1.1459361e-10
Marked as misclassified	No
Message-id	<1289659692.61.0.467153013048.issue10408@psf.upfronthosting.co.za>
In-reply-to

Content
This is a patch experiment which does two things: - make dicts denser by making the resize factor 2 instead of 4 for small dicts - improve cache locality on collisions by using linear probing It should be noted that these two changes are not independent. Improving cache locality on collisions makes probing cheaper, and in turn should allow to make small dicts denser. Linear probing is motivated by the fact that collisions can happen frequently. The comments in dictobject.c seem a bit mistaken: “If we usually find the key we're looking for on the first try (and, it turns out, we usually do -- the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.” According to http://www.cse.ust.hk/~yike/pods10-hashing.pdf, however, things are not so rosy. The average number of probes for successful lookups, depending on the load factor "alpha", is given by: >>> c = lambda alpha: 0.5 * (1 + 1/(1-alpha)) while the average number of probes for unsuccessful lookups is: >>> cp = lambda alpha: 0.5 * (1 + 1/(1-alpha)*2) (note: this is with a perfectly random hash function; I intuitively assume an imperfect random function gives higher figures) For a load factor of 2/3, we then get: >>> c(2/3) 1.9999999999999998 >>> cp(2/3) 4.999999999999999 Since the current non-linear probing schemes guarantees that each probing will access a different cache line, the cache locality of a lookup becomes very poor. The problem with linear probing, as noted in the comments, is that it degrades performance quite a lot when the hashing function clusters results. The solution I'm proposing is to apply an initial* perturbation, by multiplying the hash() with a prime number. Multiplication is very fast on modern CPUs, so this doesn't adversely affect performance.

This is a patch experiment which does two things:
- make dicts denser by making the resize factor 2 instead of 4 for small dicts
- improve cache locality on collisions by using linear probing

It should be noted that these two changes are not independent. Improving cache locality on collisions makes probing cheaper, and in turn should allow to make small dicts denser.

Linear probing is motivated by the fact that collisions can happen frequently.  The comments in dictobject.c seem a bit mistaken:

“If we *usually* find the key we're looking for on the first try (and, it turns out, we usually do -- the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.”

According to http://www.cse.ust.hk/~yike/pods10-hashing.pdf, however, things are not so rosy. The average number of probes for successful lookups, depending on the load factor "alpha", is given by:

>>> c = lambda alpha: 0.5 * (1 + 1/(1-alpha))

while the average number of probes for unsuccessful lookups is:

>>> cp = lambda alpha: 0.5 * (1 + 1/(1-alpha)**2)

(note: this is with a perfectly random hash function; I intuitively assume an imperfect random function gives higher figures)

For a load factor of 2/3, we then get:

>>> c(2/3)
1.9999999999999998
>>> cp(2/3)
4.999999999999999

Since the current non-linear probing schemes guarantees that each probing will access a different cache line, the cache locality of a lookup becomes very poor.

The problem with linear probing, as noted in the comments, is that it degrades performance quite a lot when the hashing function clusters results. The solution I'm proposing is to apply an *initial* perturbation, by multiplying the hash() with a prime number. Multiplication is very fast on modern CPUs, so this doesn't adversely affect performance.

History
Date	User	Action	Args
2010-11-13 14:48:12	pitrou	set	recipients: + pitrou, tim.peters, rhettinger, mark.dickinson
2010-11-13 14:48:12	pitrou	set	messageid: <1289659692.61.0.467153013048.issue10408@psf.upfronthosting.co.za>
2010-11-13 14:48:06	pitrou	link	issue10408 messages
2010-11-13 14:48:06	pitrou	create