Author lemburg
Recipients Arfrever, Huzaifa.Sidhpurwala, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, georg.brandl, gvanrossum, haypo, jcea, lemburg, merwok, pitrou, terry.reedy, v+python
Date 2012-01-06.12:49:15
SpamBayes Score 9.4369e-16
Marked as misclassified No
Message-id <>
In-reply-to <>
Before continuing down the road of adding randomness to hash
functions, please have a good read of the existing dictionary

Major subtleties ahead:  Most hash schemes depend on having a "good" hash
function, in the sense of simulating randomness.  Python doesn't:  its most
important hash functions (for strings and ints) are very regular in common

[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]

This isn't necessarily bad!  To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints.
The same is approximately true when keys are "consecutive" strings.  So this
gives better-than-random behavior in common cases, and that's very desirable.

There's also a file called dictnotes.txt which has more interesting
details about how the implementation is designed.

Please note that the term "collision" is used in a slightly different
way: it refers to trying to find an empty slot in the dictionary
table. Having a collision implies that the hash values of two distinct
objects are the same, but you also get collisions in case two distinct
objects with different hash values get mapped to the same table entry.

An attack can be based on trying to find many objects with the same
hash value, or trying to find many objects that, as they get inserted
into a dictionary, very often cause collisions due to the collision
resolution algorithm not finding a free slot.

In both cases, the (slow) object comparisons needed to find an
empty slot is what makes the attack practical, if the application
puts too much trust into large blobs of input data - which is
the actual security issues we're trying to work around here...

Given the dictionary implementation notes, I'm even less certain
that the randomization change is a good idea. It will likely
introduce a performance hit due to both the added complexity in
calculating the hash as well as the reduced cache locality of
the data in the dict table.

I'll upload a patch that demonstrates the collisions counting
strategy to show that detecting the problem is easy. Whether
just raising an exception is a good idea, is another issue.

It may be better to change the tp_hash slot in Python 3.3
to take an argument, so that the dict implementation can
use the hash function as universal hash family function

The dict implementation could then alter the hash parameter
and recreate the dict table in case the number of collisions
exceeds a certain limit, thereby actively taking action
instead of just relying on randomness solving the issue in
most cases.
Date User Action Args
2012-01-06 12:49:17lemburgsetrecipients: + lemburg, gvanrossum, barry, georg.brandl, terry.reedy, jcea, pitrou, haypo, christian.heimes, benjamin.peterson, merwok, Arfrever, v+python, alex, dmalcolm, Mark.Shannon, Zhiping.Deng, Huzaifa.Sidhpurwala, PaulMcMillan
2012-01-06 12:49:16lemburglinkissue13703 messages
2012-01-06 12:49:15lemburgcreate