Frank Sievertsen wrote:
>
> Frank Sievertsen <python@sievertsen.de> added the comment:
>
>> The suffix only introduces a constant change in all hash values
>> output, so even if you don't know the suffix, you can still
>> generate data sets with collisions by just having the prefix.
>
> That's true. But without the suffix, I can pretty easy and efficient guess the prefix by just seeing the result of a few well-chosen and short repr(dict(X)). I suppose that's harder with the suffix.
Since the hash function is known, it doesn't make things much
harder. Without suffix you just need hash('') to find out what
the prefix is. With suffix, two values are enough.
Say P is your prefix and S your suffix. Let's say you can get the
hash values of A = hash('') and B = hash('\x00').
With Victor's hash function you have (IIRC):
A = hash('') = P ^ (0<<7) ^ 0 ^ S = P ^ S
B = hash('\x00') = ((P ^ (0<<7)) * 1000003) ^ 0 ^ 1 ^ S = (P * 1000003) ^ 1 ^ S
Let X = A ^ B, then
X = P ^ (P * 1000003) ^ 1
since S ^ S = 0 and 0 ^ Y = Y (for any Y), i.e. the suffix doesn't
make any difference.
For P < 500000, you can then easily calculate P from X
using:
P = X // 1000002
(things obviously get tricky once overflow kicks in)
Note that for number hashes the randomization doesn't work at all,
since there's no length or feedback loop involved.
With Victor's approach hash(0) would output the whole seed,
but even if the seed is not known, creating an attack data
set is trivial, since hash(x) = P ^ x ^ S. |