Message20593

Author | smst |
---|---|

Recipients | |

Date | 2004-04-27.11:41:34 |

SpamBayes Score | |

Marked as misclassified | |

Message-id | |

In-reply-to |

Content | |
---|---|

I've encountered some performance problems when constructing dictionaries with keys of a particular form, and have pinned the problem down to the hashing function. I've reproduced the effect in Python 1.5.2, Python 2.2 and Python 2.3.3. I came across this when loading a marshalled dictionary with about 40000 entries. Loading other dictionaries of this size has always been fast, but in this case I killed the interpreter after a few minutes of CPU thrashing. The performance problem was caused because every key in the dictionary had the same hash value. The problem is as follows: for hashable values X and Y, hash( (X, (X, Y)) ) == hash(Y). This is always true (except in the corner case where hash((X, Y)) is internally calculated to be -1 (the error value) and so -2 is forced as the return value). With data in this form where X varies more than Y (Y is constant, or chosen from relatively few values compared to X) chances of collision become high as X is effectively ignored. The hash algorithm for tuples starts with a seed value, then generates a new value for each item in the tuple by multiplying the iteration's starting value by a constant (keeping the lowest 32 bits) and XORing with the hash of the item. The final value is then XORed with the tuple's length. In Python (ignoring the careful business with -1): # assume 'my_mul' would multiply two numbers and return the low 32 bits value = seed for item in tpl: value = my_mul(const, value) ^ hash(item) value = value ^ len(tpl) The tuple (X, Y) therefore has hash value: my_mul(const, my_mul(const, seed) ^ hash(X)) ^ hash(Y) ^ 2 ...and the tuple (X, (X, Y)) has hash value: my_mul(const, my_mul(const, seed) ^ hash(X)) ^ hash((X, Y)) ^ 2 The outer multiplication is repeated, and is XORed with itself (cancelling both of them). The XORed 2s cancel also, leaving just hash(Y). Note that this cancellation property also means that the same hash value is shared by (X, (X, (X, (X, Y)))), and (X, (X, (X, (X, (X, (X, Y)))))), and so on, and (X, Z, (X, Z, Y)) and so on. I realise that choosing a hash function is a difficult task, so it may be that the behaviour seen here is a tradeoff against other desireable properties -- I don't have the expertise to tell. My naive suggestion would be that an extra multiplication is necessary, which presumably has a performance impact (multiplication being more expensive than XOR) but would reduce the possibility of cancellation. On the other hand, perhaps this particular case is rare enough that it's not worth the effort. For my own application I'm fortunate in that I can probably rearrange the data structure to avoid this case. |

History | |||
---|---|---|---|

Date | User | Action | Args |

2007-08-23 14:21:12 | admin | link | issue942952 messages |

2007-08-23 14:21:12 | admin | create |