Issue10408
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010-11-13 14:48 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
dictopts.patch | pitrou, 2010-11-13 14:48 | review | ||
dcbench-py3k.py | pitrou, 2010-11-13 14:53 |
Messages (14) | |||
---|---|---|---|
msg121139 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-11-13 14:48 | |
This is a patch experiment which does two things: - make dicts denser by making the resize factor 2 instead of 4 for small dicts - improve cache locality on collisions by using linear probing It should be noted that these two changes are not independent. Improving cache locality on collisions makes probing cheaper, and in turn should allow to make small dicts denser. Linear probing is motivated by the fact that collisions can happen frequently. The comments in dictobject.c seem a bit mistaken: “If we *usually* find the key we're looking for on the first try (and, it turns out, we usually do -- the table load factor is kept under 2/3, so the odds are solidly in our favor), then it makes best sense to keep the initial index computation dirt cheap.” According to http://www.cse.ust.hk/~yike/pods10-hashing.pdf, however, things are not so rosy. The average number of probes for successful lookups, depending on the load factor "alpha", is given by: >>> c = lambda alpha: 0.5 * (1 + 1/(1-alpha)) while the average number of probes for unsuccessful lookups is: >>> cp = lambda alpha: 0.5 * (1 + 1/(1-alpha)**2) (note: this is with a perfectly random hash function; I intuitively assume an imperfect random function gives higher figures) For a load factor of 2/3, we then get: >>> c(2/3) 1.9999999999999998 >>> cp(2/3) 4.999999999999999 Since the current non-linear probing schemes guarantees that each probing will access a different cache line, the cache locality of a lookup becomes very poor. The problem with linear probing, as noted in the comments, is that it degrades performance quite a lot when the hashing function clusters results. The solution I'm proposing is to apply an *initial* perturbation, by multiplying the hash() with a prime number. Multiplication is very fast on modern CPUs, so this doesn't adversely affect performance. |
|||
msg121140 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-11-13 14:53 | |
Here is a benchmark adapted from another bug entry (I merely adapted the dict sizes in order to better exhibit the performance degradation when the CPU cache becomes too small to hold the whole dict). Results without the patch: 10000 words ( 9092 keys), 2928369 inserts/s, 13911456 lookups/s, 86 bytes/key (0.8MB) 20000 words ( 17699 keys), 3290037 inserts/s, 12746707 lookups/s, 44 bytes/key (0.8MB) 40000 words ( 34490 keys), 2620007 inserts/s, 7723605 lookups/s, 91 bytes/key (3.0MB) 80000 words ( 67148 keys), 2698863 inserts/s, 6573500 lookups/s, 46 bytes/key (3.0MB) 160000 words ( 130897 keys), 2401067 inserts/s, 4886971 lookups/s, 48 bytes/key (6.0MB) 320000 words ( 254233 keys), 2077558 inserts/s, 5061763 lookups/s, 49 bytes/key (12.0MB) 640000 words ( 493191 keys), 1923967 inserts/s, 4490697 lookups/s, 51 bytes/key (24.0MB) 1280000 words ( 956820 keys), 1792729 inserts/s, 4353711 lookups/s, 52 bytes/key (48.0MB) Results with the patch: 10000 words ( 9092 keys), 3324590 inserts/s, 13911456 lookups/s, 43 bytes/key (0.4MB) 20000 words ( 17699 keys), 3243603 inserts/s, 13202090 lookups/s, 44 bytes/key (0.8MB) 40000 words ( 34490 keys), 2858372 inserts/s, 10686124 lookups/s, 45 bytes/key (1.5MB) 80000 words ( 67148 keys), 2585146 inserts/s, 6917441 lookups/s, 46 bytes/key (3.0MB) 160000 words ( 130897 keys), 2395923 inserts/s, 6455817 lookups/s, 48 bytes/key (6.0MB) 320000 words ( 254233 keys), 2247141 inserts/s, 5529826 lookups/s, 49 bytes/key (12.0MB) 640000 words ( 493191 keys), 2064675 inserts/s, 5073732 lookups/s, 51 bytes/key (24.0MB) 1280000 words ( 956820 keys), 1997615 inserts/s, 4760878 lookups/s, 52 bytes/key (48.0MB) Lookups become 10% faster when the dict is bigger than the cache, and even inserts seem to benefit a bit. (not to mention that small dicts take almost half the memory) |
|||
msg121155 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2010-11-13 23:02 | |
My previous experiments along these lines showed it was a dead-end. The number of probes was the most important factor and beat-out any effort to improve cache utilization from increased density. Doing extra work (more probes) in order to improve cache effects is very difficult because most real programs have an uneven access pattern so that the most frequently accesses items are usually already in cache. So, the attempted improvement only helps the less frequently accessed items and isn't worth the extra number of probes. Another result from earlier experiments is that benchmarking the experiment is laden with pitfalls. Tight timing loops don't mirror real world programs, nor do access patterns with uniform random distributions. |
|||
msg121156 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-11-13 23:12 | |
> My previous experiments along these lines showed it was a dead-end. > The number of probes was the most important factor and beat-out any > effort to improve cache utilization from increased density. Can you describe your experiments? What workloads or benchmarks did you use? Do note that there are several levels of caches in modern CPUs. L1 is very fast (latency is 3 or 4 cycles) but rather small (32 or 64KB). L2, depending on the CPU, has a latency between 10 and 20+ cycles and can be 256KB to 1MB large. L3, when present, is quite larger but also quite slower (latency sometimes up to 50 cycles). So, even if access patterns are uneven, it is probably rare to have all frequently accessed data in L1 (especially with Python since objects are big). > Another result from earlier experiments is that benchmarking the > experiment is laden with pitfalls. Tight timing loops don't mirror > real world programs, nor do access patterns with uniform random > distributions. I can certainly understand that; can you suggest workloads approaching "real world programs"? |
|||
msg121157 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2010-11-13 23:14 | |
FWIW, one way to make a dict denser without increasing the number of probes is to use Brent's Variation of Algorithm D in Knuth. That optimizes the insertion order to minimize the number of collisions and lets you pack well over two-thirds full without degradation. |
|||
msg121158 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-11-13 23:24 | |
> FWIW, one way to make a dict denser without increasing the number of > probes is to use Brent's Variation of Algorithm D in Knuth. That > optimizes the insertion order to minimize the number of collisions and > lets you pack well over two-thirds full without degradation. He, you suggested that several years ago on python-dev and I supplied the code. Also, IIRC, it didn't bring any improvement - although I don't remember which benchmarks, if any, were run. But the experiment here is mostly to decrease the (direct and indirect) cost of collisions by improving temporal locality of lookups. |
|||
msg121162 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2010-11-13 23:38 | |
See Objects/dictnotes.txt for some of the results. I spent about full month trying to optimize dict performance either by tuning parameters or using different algorithms. There were a couple wins that were not implemented. 1) Allowing users to control insertion order or at least specify which keys are frequently accessed so that we could assure a first-time hit. 2) Allowing users to pre-size a dictionary so that resizes wouldn't be needed or an so they could control density. Guido didn't want to expose these controls. The PyPy guys published a paper on their results with alternative dict implementations and specialized dicts. You might want to look at that. IIRC, they found only minor wins. |
|||
msg121163 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-11-13 23:40 | |
> See Objects/dictnotes.txt for some of the results. > I spent about full month trying to optimize dict > performance either by tuning parameters or using > different algorithms. Well, I've seen those results. I'm asking about which workloads or benchmarks were used so that I can try to reproduce these experiments. |
|||
msg149147 - (view) | Author: Charles-François Natali (neologix) * | Date: 2011-12-10 11:28 | |
This paper might be of interest: http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf Basically, it concludes that most of the time, there's no speedup to be gained from the increased cached locality incurred by linear probing when the input sample differ from a uniform distribution. However, figure 1 shows a clear win with a uniform distribution, and figure 2 shows that even with a non-uniform distribution, there's still a gain as the load factor increases. So if experiments show measurable improvements, this might be an interesting idea. However, the problem with linear probing is that the performance will be much more dependent on the input distribution that with quadratic probing/double hashing, so this will definitely degrade with some workloads. > The problem with linear probing, as noted in the comments, is that it > degrades performance quite a lot when the hashing function clusters > results. The solution I'm proposing is to apply an *initial* > perturbation, by multiplying the hash() with a prime number. I fail to see how this would avoid primary clustering: IIUC, you replace the hash function h(k) by h'(k), but you still use linear probing: an empty slot preceded by i filled slots has a probablility (i+1)/N of getting filled next (N being the total number of slots). |
|||
msg157916 - (view) | Author: Jim Jewett (Jim.Jewett) * | Date: 2012-04-09 22:38 | |
FWIW, doing a linear probe only within a cache line (changing the 1's bit) before applying perturb might also be useful -- and the results may change if the size of a dictentry were reduced. (Mark Shannon's now-integrated patch doesn't actually do that for the keys, but it would be doable.) |
|||
msg280010 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2016-11-03 19:23 | |
What is the status of this issue in the light of compact dict implementation? |
|||
msg280036 - (view) | Author: Inada Naoki (methane) * | Date: 2016-11-04 03:16 | |
> - make dicts denser by making the resize factor 2 instead of 4 for small dicts This had been implemented already when I start compact dict. > - improve cache locality on collisions by using linear probing set does this. But dict doesn't do it for now. In case of compact dict, liner probing only affects index table (dk_indices). dk_indices is small (64byte when dk_size==64). One or two cache line can contain whole dk_indices of small dicts. So performance benefit of linear probing will be smaller than previous dict implementation. I'll re-evaluate it. |
|||
msg289224 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2017-03-08 09:50 | |
I think this issue is outdated. Liner probing doesn't make much sense for current dict implementation. |
|||
msg289264 - (view) | Author: Raymond Hettinger (rhettinger) * | Date: 2017-03-09 01:17 | |
Agreed. This doesn't make sense for the compact dict implementation. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:08 | admin | set | github: 54617 |
2017-03-09 01:17:53 | rhettinger | set | status: pending -> closed resolution: out of date messages: + msg289264 stage: resolved |
2017-03-08 09:50:01 | serhiy.storchaka | set | status: open -> pending messages: + msg289224 |
2016-11-04 03:16:30 | methane | set | messages: + msg280036 |
2016-11-03 19:23:01 | serhiy.storchaka | set | nosy:
+ methane messages: + msg280010 |
2012-04-09 22:38:09 | Jim.Jewett | set | nosy:
+ Jim.Jewett messages: + msg157916 |
2012-04-07 17:17:03 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka |
2011-12-10 11:29:00 | neologix | set | nosy:
+ neologix messages: + msg149147 |
2011-08-24 06:34:37 | jcon | set | nosy:
+ jcon |
2010-11-13 23:40:39 | pitrou | set | messages: + msg121163 |
2010-11-13 23:38:23 | rhettinger | set | messages: + msg121162 |
2010-11-13 23:24:06 | pitrou | set | messages: + msg121158 |
2010-11-13 23:14:17 | rhettinger | set | messages: + msg121157 |
2010-11-13 23:12:16 | pitrou | set | messages: + msg121156 |
2010-11-13 23:02:33 | rhettinger | set | messages: + msg121155 |
2010-11-13 14:53:05 | pitrou | set | files:
+ dcbench-py3k.py messages: + msg121140 |
2010-11-13 14:48:06 | pitrou | create |