Hash collisions for tuples #78932

jdemeyer · 2018-09-20T13:27:27Z

BPO	34751
Nosy	@tim-one, @rhettinger, @mdickinson, @ericvsmith, @jdemeyer, @sir-sigurd
PRs	bpo-34751: improved hash function for tuples #9471 bpo-34751: improved hash function for tuples #9534
Files	htest.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/tim-one'
closed_at = <Date 2018-10-28.00:07:08.358>
created_at = <Date 2018-09-20.13:27:27.408>
labels = ['interpreter-core', '3.8', 'performance']
title = 'Hash collisions for tuples'
updated_at = <Date 2018-11-05.08:30:09.074>
user = 'https://github.com/jdemeyer'

bugs.python.org fields:

activity = <Date 2018-11-05.08:30:09.074>
actor = 'jdemeyer'
assignee = 'tim.peters'
closed = True
closed_date = <Date 2018-10-28.00:07:08.358>
closer = 'rhettinger'
components = ['Interpreter Core']
creation = <Date 2018-09-20.13:27:27.408>
creator = 'jdemeyer'
dependencies = []
files = ['47857']
hgrepos = []
issue_num = 34751
keywords = ['patch']
message_count = 122.0
messages = ['325870', '325883', '325891', '325936', '325949', '325950', '325953', '325955', '325956', '325957', '325959', '325960', '325964', '325981', '325982', '326016', '326018', '326021', '326023', '326024', '326026', '326029', '326032', '326047', '326067', '326081', '326083', '326086', '326110', '326115', '326117', '326118', '326119', '326120', '326147', '326173', '326193', '326194', '326198', '326200', '326203', '326212', '326214', '326217', '326218', '326219', '326236', '326278', '326290', '326302', '326306', '326309', '326331', '326332', '326333', '326334', '326337', '326352', '326375', '326396', '326435', '326505', '326507', '326508', '326510', '326512', '326583', '326602', '326603', '326615', '326640', '326653', '326667', '326753', '326765', '326820', '326835', '326872', '326874', '326877', '326878', '326883', '326892', '326893', '326894', '326898', '326903', '326905', '326906', '326907', '326908', '326910', '326917', '326918', '326927', '326929', '326947', '326949', '326961', '327005', '327014', '327038', '327042', '327043', '327044', '327045', '327055', '327061', '327063', '327078', '327258', '327260', '327298', '327311', '327319', '327380', '327391', '327394', '327423', '327460', '328665', '329287']
nosy_count = 6.0
nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'eric.smith', 'jdemeyer', 'sir-sigurd']
pr_nums = ['9471', '9534']
priority = 'low'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue34751'
versions = ['Python 3.8']

jdemeyer · 2018-09-20T13:27:27Z

It's not hard to come up with a hash collision for tuples:

>>> hash( (1,0,0) )
2528505496374819208
>>> hash( (1,-2,-2) )
2528505496374819208

The underlying reason is that the hashing code mixes ^ and *. This can create collisions because, for odd x, we have x ^ (-2) == -x and this minus operation commutes with multiplication.

This can be fixed by replacing ^ with +. On top of that, the hashing code for tuples looks needlessly complicated. A simple Bernstein hash suffices.

rhettinger · 2018-09-20T15:06:50Z

It's not hard to come up with a hash collision for tuples:

Nor is it hard to come up with collisions for most simple hash functions. For typical use cases of tuples, what we have works out fine (it just combined the effects of the underlying hashes).

The underlying reason is that the hashing code mixes ^ and *.

Addition also has a relationship to multiplication. I think the XOR is fine.

A simple Bernstein hash suffices.

That is more suited to character data and small hash ranges (as opposed to 64-bit).

On top of that, the hashing code for tuples looks
needlessly complicated.

The important thing is that it has passed our testing (see the comment above) and that it compiles nicely (see the tight disassembly below).

----------------------

L82:
xorq %r14, %rax
imulq %rbp, %rax
leaq 82518(%rbp,%r12,2), %rbp
movq %r13, %r12
movq %rax, %r14
leaq -1(%r13), %rax
cmpq $-1, %rax
je L81
movq %rax, %r13
L73:
addq $8, %rbx
movq -8(%rbx), %rdi
call _PyObject_Hash
cmpq $-1, %rax
jne L82

jdemeyer · 2018-09-20T15:45:12Z

Nor is it hard to come up with collisions for most simple hash functions.

The Bernstein hash algorithm is a simple algorithm for which it can be proven mathematically that collisions cannot be generated if the multiplier is unknown. That is an objective improvement over the current tuple hash. Of course, in practice the multiplier is not secret, but it suffices that it is random enough.

Addition also has a relationship to multiplication.

Of course, but not in a way that can be used to generate hash collisions. In fact, the simple relation between addition and multiplication makes this an actual provable mathematical statement.

That is more suited to character data and small hash ranges (as opposed to 64-bit).

Maybe you are thinking of the multiplier 33 which is classically used for character data? If you replace the multiplier 33 by a larger number like _PyHASH_MULTIPLIER == 1000003 (ideally it would be a truly random number), it works fine for arbitrary sequences and arbitrary bit lengths.

The important thing is that it has passed our testing

That's a silly argument. If it passed testing, it is only because the testing is insufficient.

But really: what I am proposing is a better hash without obvious collisions which won't be slower than the existing hash. Why would you be against that? Why should we "roll our own" hash instead of using a known good algorithm?

rhettinger · 2018-09-21T01:21:21Z

ISTM, you're being somewhat aggressive and condescending. I'll hand you off to Tim who is far better at wrestling over the details :-)

From my point of view, you've made a bold and unsubstantiated claim that Python's existing tuple hash has collision issues with real code. Only a contrived and trivial integer example was shown.

On my end, I made an honest effort to evaluate the suggestion. I read your post,
looked up Bernstein hash, revisited the existing C code and found that everything it is doing makes sense to me. There was a comment indicating that empirical tests were run to get the best possible constants. Also, I disassembled the code to verify that it is somewhat efficiently compiled.

Your responded that I'm being "silly" and then you reversed my decision to close the issue (because there is no evidence that the hash is broken). Accordingly, I'm going to hand it over to Tim who I'm sure will know the history of the existing code, can compare and contrast the various options, evaluate them in the context of typical Python use cases, and assess whether some change is warranted.

tim-one · 2018-09-21T04:46:25Z

@jdemeyer, please define exactly what you mean by "Bernstein hash". Bernstein has authored many hashes, and none on his current hash page could possibly be called "simple":

https://cr.yp.to/hash.html

If you're talking about the teensy

    hash = hash * 33 + c;

thing from now-ancient C folklore, Bernstein is claimed to have replaced addition with xor in that later:

http://www.cse.yorku.ca/~oz/hash.html

In any case, Python _used_ to do plain

    x = (1000003 * x) ^ y;
    
but changed to the heart of the current scheme 14 years ago to address Source Forge bug python/cpython#40190:
    https://github.com/python/cpython/commit/41bd02256f5a2348d2de3d6e5fdfcaeb2fcaaebc#diff-cde4fb3109c243b2c2badb10eeab7fcd

The Source Forge bug reports no longer appear to exist. If you care enough, dig through the messages with subject:

[ python-Bugs-942952 ] Weakness in tuple hash

here:

https://mail.python.org/pipermail/python-bugs-list/2004-May/subject.html

The thrust was that the simpler approach caused, in real life, hashes of nested tuples of the form

(X, (X, Y))

to return hash(Y) - the hash(X) parts across calls magically xor'ed out.

It was also systematically the case that

(X, (Y, Z))

and

(Y, (X, Z))

hashed to the same thing.

So the complications were added for a reason, to address the showed-up-in-real-life pathological cases discussed in the bug report already linked to. Since I don't believe we've had other reports of real-life pathologies since, nobody is going to change this again without truly compelling reasons. Which I haven't seen here, so I'm closing this report again.

BTW, as also explained in that report, the goal of Python's hash functions is to minimize collisions in dicts in real life. We generally couldn't care less whether they "act random", or are hard to out-guess. For example,

 assert all(hash(i) == i for i in range(1000000))

has always succeeded.

ned-deily · 2018-09-21T04:51:41Z

(FWIW, I believe the Source Forge bug reports were forwarded ported to bugs.python.org. This looks like the issue in question: https://bugs.python.org/issue942952 )

tim-one · 2018-09-21T04:55:05Z

Ah! I see that the original SourceForge bug report got duplicated on this tracker, as PR bpo-942952. So clicking on that is a lot easier than digging thru the mail archive.

One message there noted that replacing xor with addition made collision statistics much worse for the cases at issue.

jdemeyer · 2018-09-21T05:54:48Z

ISTM, you're being somewhat aggressive and condescending.

Interestingly, I felt the same with your reply, which explains my rapid reversion of your closing of the ticket. The way I see it: I pointed out a bug in tuple hashing and you just closed that, basically claiming that it's not a bug.

I still don't understand why you are closing this so quickly. To put it differently: if I can improve the hash function, removing that obvious collision, extending the tests to catch this case, while not making it slower, why shouldn't this be done?

jdemeyer · 2018-09-21T05:57:20Z

Since I don't believe we've had other reports of real-life pathologies since

You just did in this bug report. Why doesn't that count?

tim-one · 2018-09-21T06:06:41Z

@jdemeyer, you didn't submit a patch, or give any hint that you _might_. It _looked_ like you wanted other people to do all the work, based on a contrived example and a vague suggestion.

And we already knew from history that "a simple Bernstein hash suffices" is false for tuple hashing (for one possible meaning of "a simple Bernstein hash" - we had one of those 14 years ago).

Neither Raymond nor I want to spend time on this ourselves, so in the absence of somebody else volunteering to do the work, there's no reason to keep the report open uselessly.

If you're volunteering to do the work ... then do the work ;-)

jdemeyer · 2018-09-21T06:10:27Z

To come back to the topic of hashing, with "Bernstein hash" for tuples I did indeed mean the simple (in Python pseudocode):

# t is a tuple
h = INITIAL_VALUE
for x in t:
    z = hash(x)
    h = (h * MULTIPLIER) + z
return h

I actually started working on the code and there is indeed an issue with the hashes of (X, (Y, Z)) and (Y, (X, Z)) being equal that way. But that can be solved by replacing "hash(x)" in the loop by something slightly more complicated. After thinking about it, I decided to go for shifting higher to lower bits, so replacing z by z^(z >> 32) or something like that.

This also has the benefit of mixing the high-order bits into the low-order bits, which is an additional improvement over the current code.

Since I already wrote the code anyway, I'll create a PR for it.

tim-one · 2018-09-21T06:10:39Z

You said it yourself: "It's not hard to come up with ...". That's not what "real life" means. Here:

>>> len(set(hash(1 << i) for i in range(100_000)))
61

Wow! Only 61 hash codes across 100 thousand distinct integers?!

Yup - and I don't care. I contrived the case, just like you apparently contrived yours.

jdemeyer · 2018-09-21T06:18:11Z

For the record: my collision is not contrived. It came up in actual code where I was hashing some custom class using tuples and found an unexpected high number of collisions, which I eventually traced back to the collision I reported here.

By the way, I think the worst real-life hash collision is

>>> hash(-1) == hash(-2)
True

ericvsmith · 2018-09-21T10:17:25Z

I agree with Raymond and Tim here: while it's inevitably true that there are many possible hash collisions, we'd need to see where the current algorithm caused real problems in real code. Pointing out a few examples where there was a collision isn't very helpful.

So, jdemeyer, if it's possible to show (or describe) to us an example of a problem you had, such that we could repeat it, that would be helpful (and necessary) to evaluate any proposed changes. What were the inputs to hash() that caused a problem, and how did that problem manifest itself?

jdemeyer · 2018-09-21T10:34:24Z

So, jdemeyer, if it's possible to show (or describe) to us an example of a problem you had, such that we could repeat it, that would be helpful (and necessary) to evaluate any proposed changes. What were the inputs to hash() that caused a problem, and how did that problem manifest itself?

In all honesty, I don't remember. This was a while ago and at that time I didn't care enough to put up a CPython bug report. Still, this is a collision for tuples of short length (3) containing small integers (0 and -2). Why do you find that contrived?

What prompted me to report this bug now anyway is that I discovered bad hashing practices for other classes too. For example, meth_hash in Objects/methodobject.c simply XORs two hashes and XOR tends to suffer from catastrophic cancellation. So I was hoping to fix tuple hashing as an example for other hash functions to follow.

jdemeyer · 2018-09-21T18:46:15Z

I should also make it clear that the collision hash((1,0,0)) == hash((1,-2,-2)) that I reported is due to the algorithm, it's not due to some bad luck that 2 numbers happen to be the same. There are also many more similar hash collisions for tuples (all involving negative numbers) due to the same underlying mathematical reason.

I still don't understand why you don't consider it a problem. It may be a tiny problem, not really affecting the correct functioning of CPython. But, if we can fix this tiny problem, why shouldn't we?

ericvsmith · 2018-09-21T19:01:35Z

I'm not one to judge the effectiveness of a new hash algorithm. But my personal concern here is making something else worse: an unintended consequence. IMO the bar for changing this would be very high, and at the least it would need to be addressing a known, not theoretical, problem.

That said, I'll let the experts debate your proposed change.

rhettinger · 2018-09-21T19:26:58Z

ISTM the collision of "hash((1,0,0)) == hash((1,-2,-2))" also stems from integers hashing to themselves and that twos-complement arithmetic relation, -x == ~x+1. Further, that example specifically exploits that "hash(-1) == hash(-2)" due to how error codes. In a way, tuple hashing is incidental.

Occasional collisions aren't really a big deal -- it is normal for dicts to have some collisions and to resolve them. I would be more concerned is non-contrived datasets had many elements that gave exactly the same hash value resulting in a pile-up in one place.

FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

    >>> from itertools import product
    >>> len(set(map(hash, product(range(100), repeat=4))))
    100000000

The current hash is in the pretty-darned-good category and so we should be disinclined to change battle-tested code that is known to work well across a broad range of use cases.

jdemeyer · 2018-09-21T19:38:50Z

Further, that example specifically exploits that "hash(-1) == hash(-2)"

No, it does not. There is no hashing of -1 involved in the example hash((1,0,0)) == hash((1,-2,-2)).

tim-one · 2018-09-21T19:44:24Z

For me, it's largely because you make raw assertions with extreme confidence that the first thing you think of off the top of your head can't possibly make anything else worse. When it turns out it does make some things worse, you're equally confident that the second thing you think of is (also) perfect.

So far we don't even have a real-world test case, let alone a coherent characterization of "the problem":

"""
all involving negative numbers) due to the same underlying mathematical reason.
"""

is a raw assertion, not a characterization. The only "mathematical reason" you gave before is that "j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m". Which is true, and a good observation, but doesn't generalize as-is beyond -2.

Stuff like this from the PR doesn't inspire confidence either:

"""
MULTIPLIER = (1000003)**3 + 2 = 1000009000027000029: the multiplier should be big enough and the original 20-bit number is too small for a 64-bit hash. So I took the third power. Unfortunately, this caused a sporadic failure of the testsuite on 32-bit systems. So I added 2 which solved that problem.
"""

Why do you claim the original was "too small"? Too small for what purpose? As before, we don't care whether Python hashes "act random". Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat? What was _the cause_ of the "sporadic failure" (whatever that means), and why did adding 2 fix it? Why isn't there a single word in _the code_ about where the mystery numbers came from?:

You're creating at least as many mysteries as you're claiming to solve.

We're not going to change such heavily used code on a whim.

That said, you could have easily enough _demonstrated_ that there's potentially a real problem with a mix of small integers of both signs:

>>> from itertools import product
>>> cands = list(range(-10, -1)) + list(range(9))
>>> len(cands)
18
>>> _ ** 4
104976
>>> len(set(hash(t) for t in product(cands, repeat=4)))
35380

And that this isn't limited to -2 being in the mix (and noting that -1 wasn't in the mix to begin with):

>>> cands.remove(-2)
>>> len(cands) ** 4
83521
>>> len(set(hash(t) for t in product(cands, repeat=4)))
33323

If that's "the problem", then - sure - it _may_ be worth addressing. Which we would normally do by looking for a minimal change to code that's been working well for over a decade, not by replacing the whole thing "just because".

BTW, continuing the last example:

>>> c1 = Counter(hash(t) for t in product(cands, repeat=4))
>>> Counter(c1.values())
Counter({1: 11539, 2: 10964, 4: 5332, 3: 2370, 8: 1576, 6: 1298, 5: 244})

So despite that there were many collisions, the max number of times any single hash code appeared was 8. That's unfortunate, but not catastrophic.

Still, if a small change could repair that, fine by me.

tim-one · 2018-09-21T19:48:57Z

Oops!

"""
"j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m"
"""

The tail end should say "m*(j^(-2)) == -m*j" instead.

jdemeyer · 2018-09-21T20:07:13Z

FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

    >>> from itertools import product
    >>> len(set(map(hash, product(range(100), repeat=4))))
    100000000

FWIW: the same is true for my new hash function

jdemeyer · 2018-10-03T10:34:14Z

A (simplified and slightly modified version of) xxHash seems to work very well, much better than SeaHash. Just like SeaHash, xxHash also works in parallel. But I'm not doing that and just using this for the loop:

    for y in t:
        y ^= y * (PRIME32_2 - 1)
        acc += y
        acc = ((acc << 13) + (acc >> 19))  # rotate left by 13 bits
        acc *= MULTIPLIER

Plain xxHash does "y *= PRIME32_2" or equivalently "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps slightly with my new tuple test.

tim-one · 2018-10-03T19:40:21Z

Thanks for looking at xxHash! An advantage is that it already comes in 32- and 64-bit versions.

A (simplified and slightly modified version of) xxHash
seems to work very well, much better than SeaHash.

? I've posted several SeaHash cores that suffer no collisions at all in any of our tests (including across every "bad example" in these 100+ messages), except for "the new" tuple test. Which it also passed, most recently with 7 collisions. That was under 64-bit builds, though, and from what follows I figure you're only looking at 32-bit builds for now.

Just like SeaHash, xxHash also works in parallel. But I'm not
doing that and just using this for the loop:

for y in t:
y ^= y * (PRIME32_2 - 1)
acc += y
acc = ((acc << 13) + (acc >> 19)) # rotate left by 13 bits
acc *= MULTIPLIER

Plain xxHash does "y *= PRIME32_2" or equivalently
"y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps
slightly with my new tuple test.

Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me. Does the variant also score well under SMHasher? "I don't see why it wouldn't" is not an answer to that ;-)

Note that the change also slows the loop a bit - "acc += y" can't begin until the multiply finishes, and the following "acc *=" can't begin until the addition finishes. Lengthening the critical path needs to buy something real to justify the extra expense. I don't see it here.

For posterity: the 64-bit version of xxHash uses different primes, and rotates left by 31 instead of by 13.

Note: I'm assuming that by "PRIME32_2" you mean 2246822519U, and that "MULTIPLIER" means 2654435761U. Which appear to be the actual multipliers used in, e.g.,

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

As always, I'm not a fan of making gratuitous changes. In any case, without knowing the actual values you're using, nobody can replicate what you're doing.

That said, I have no reason to favor SeaHash over xxHash, and xxHash also has a real advantage in that it apparently didn't _need_ the "t ^= t << 1" permutation to clear the new tuple test's masses of replicated sign bits.

But the more we make changes to either, the more work _we_ have to do to ensure we haven't introduced subtle weaknesses. Which the Python test suite will never be substantial enough to verify - we're testing almost nothing about hash behavior. Nor should we: people already wrote substantial test suites dedicated to that sole purpose, and we should aim to be "mere consumers" of functions that pass _those_ tests. Python's test suite is more to ensure that known problems _in Python_ don't recur over the decades.

tim-one · 2018-10-03T21:41:59Z

Here's a complete xxHash-based implementation via throwing away C++-isms from

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.

The 32-bit build fares worse:

3 collisions in the old tuple test.
51 collisions in the new tuple test.
173 collisions across product([-3, 3], repeat=20)
141 collisions across product([0.5, 0.25], repeat=20)
no other collisions

The expected number of collisions from tossing 2**20 balls into 2**32 buckets is about 128, with standard deviation 11.3. So 141 is fine, but 173 is highly suspect. 51 for the new tuple test is way out of bounds.

But I don't much care. None of these are anywhere near disasters, and - as I've already said - I know of no non-crypto strength hash that doesn't suffer "worse than random" behaviors in some easily-provoked cases. All you have to do is keep trying.

Much as with SeaHash, adding

t ^= t << 1;

at the top helps a whole lot in the "bad cases", cutting the new test's collisions to 7 in the 32-bit build. It also cuts the product([-3, 3], repeat=20) collisions to 109. But boosts the old tuple test's collisions to 12. I wouldn't do it: it adds cycles for no realistic gain. We don't really care whether the hash "is random" - we do care about avoiding catastrophic pile-ups, and there are none with an unmodified xxHash.

BTW, making that change also _boosts_ the number of "new test" collisions to 23 in the 64-bit build (a little over its passing limit).

#define Py_NHASHBITS (SIZEOF_PY_HASH_T * CHAR_BIT)
#if Py_NHASHBITS >= 64
#    define Py_HASH_MULT1 (Py_uhash_t)11400714785074694791ULL
#    define Py_HASH_MULT2 (Py_uhash_t)14029467366897019727ULL
#    define Py_HASH_LSHIFT 31
#elif Py_NHASHBITS >= 32
#    define Py_HASH_MULT1 (Py_uhash_t)2654435761ULL
#    define Py_HASH_MULT2 (Py_uhash_t)2246822519ULL
#    define Py_HASH_LSHIFT 13
#else
#    error "can't make sense of Py_uhash_t size"
#endif
#define Py_HASH_RSHIFT (Py_NHASHBITS - Py_HASH_LSHIFT)

static Py_hash_t
tuplehash(PyTupleObject *v)
{
    Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
    Py_hash_t y;
    Py_ssize_t len = Py_SIZE(v);
    PyObject **p;
    x = 0x345678UL;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        t = (Py_uhash_t)y;
        x += t * Py_HASH_MULT2;
        x = (x << Py_HASH_LSHIFT) | (x >> Py_HASH_RSHIFT);
        x *= Py_HASH_MULT1;
    }
    x += 97531UL;
    if (x == (Py_uhash_t)-1)
        x = -2;
    return x;
}
#undef Py_NHASHBITS
#undef Py_HASH_MULT1
#undef Py_HASH_MULT2
#undef Py_HASH_LSHIFT
#undef Py_HASH_RSHIFT

jdemeyer · 2018-10-04T06:06:09Z

I've posted several SeaHash cores that suffer no collisions at all in any of our tests (including across every "bad example" in these 100+ messages), except for "the new" tuple test. Which it also passed, most recently with 7 collisions. That was under 64-bit builds, though, and from what follows I figure you're only looking at 32-bit builds for now.

Note that I'm always considering parametrized versions of the hash functions that I'm testing. I'm replacing the fixed multiplier (all algorithms mentioned here have such a thing) by a random multiplier which is 3 mod 8. I keep the other constants. This allows me to look at the *probability* of passing the testsuite. That says a lot more than a simple yes/no answer for a single test. You don't want to pass the testsuite by random chance, you want to pass the testsuite because you have a high probability of passing it.

And I'm testing 32-bit hashes indeed since we need to support that anyway and the probability of collisions is high enough to get interesting statistical data.

For SeaHash, the probability of passing my new tuple test was only around 55%. For xxHash, this was about 85%. Adding some input mangling improved both scores, but the xxHash variant was still better than SeaHash.

jdemeyer · 2018-10-04T08:21:14Z

Note: I'm assuming that by "PRIME32_2" you mean 2246822519U

Yes indeed.

and that "MULTIPLIER" means 2654435761U.

No, I mean a randomly chosen multiplier which is 3 mod 8.

jdemeyer · 2018-10-04T08:24:06Z

people already wrote substantial test suites dedicated to that sole purpose, and we should aim to be "mere consumers" of functions that pass _those_ tests.

There are hash functions that pass those tests which are still bad in practice when used as tuple hash function. That's really unfortunate, but it's a fact that we need to live with. It means that we may need to do some adjustments to the hash functions.

jdemeyer · 2018-10-04T08:27:21Z

Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me.

What I'm doing is the most innocent change: just applying a fixed permutation on the *input* of the hash function. I'm not changing the actual hashing algorithm.

jdemeyer · 2018-10-04T08:30:25Z

In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.

That's pretty bad actually. With 64 bits, you statistically expect something in the order of 10**-8 collisions. So what you're seeing is 9 orders of magnitude too many collisions.

jdemeyer · 2018-10-04T15:30:39Z

Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me.

OK, I won't do that. The difference is not that much anyway (it increases the test success rate from about 85% to 90%)

tim-one · 2018-10-04T16:02:23Z

Note that I'm always considering parametrized
versions of the hash functions that I'm testing.
I'm replacing the fixed multiplier (all algorithms
mentioned here have such a thing) by a random
multiplier which is 3 mod 8.

I've explained before in some detail that this makes NO SENSE for SeaHash: its multiplier was specially constructed to guarantee certain dispersion properties, to maximize the effectiveness of its "propagate right" step.

You already have my explanation about that, and a link to the original Reddit thread in which it was proposed (and eagerly accepted by SeaHash's primary author). Also already explained that its multiplier appears to fail its design criteria at two specific bit positions. Which I can repair, but I would do so not by changing Python's version, but by bringing it to SeaHash's author's attention.

OTOH, I haven't been able to find anything explaining why the xxHash authors picked what they picked. But they certainly didn't have "random" in mind - nobody does. You discovered for yourself by trying things that various properties make for bad multipliers, and people who work on these things "for real" knew that a long time ago. They almost certainly know a great deal more that we haven't thought of at all.

tim-one · 2018-10-04T16:16:03Z

> In the 64-bit build there are no collisions across my
> tests except for 11 in the new tuple test.

That's pretty bad actually. With 64 bits, you statistically
expect something in the order of 10**-8 collisions. So
what you're seeing is 9 orders of magnitude too many collisions.

You wrote the test, so you should remember that you throw away the top 32 bits of the 64-bit hash code :-) Tossing 345130 balls into 2**32 bins has an expected mean of 13.9 collisions and sdev 3.7. 11 collisions is fine.

If I change the test to retain all the hash bits in the 64-bit build, there are no collisions in the new tuple test.

If I change it again to retain only the high 32 hash bits, 27 collisions. Which is statistically suspect, but still universes away from "disaster" territory.

tim-one · 2018-10-04T19:45:42Z

> people already wrote substantial test suites dedicated
> to that sole purpose, and we should aim to be "mere
> consumers" of functions that pass _those_ tests.

There are hash functions that pass those tests which
are still bad in practice when used as tuple hash function.

Really? Which one(s)? If you're talking about that some fail badly when you _replace_ the constants they picked, I doubt they'd accept that as proof of anything. Things like FNV and DJB score poorly on SMHasher to begin with.

That's really unfortunate, but it's a fact that we need
to live with.

I'm surprised they do as well as they do using less than a handful of invertible transformations per input, using state of the same bit width as the inputs.

I don't expect them to be immune to all easily-provoked "worse than random" cases, though. Over time, these hashes change while keeping the same name. As noted before, this is at least the second version of SeaHash. The best of the original algorithms of this kind - murmurhash - is on its 3rd version now.

The motivation is the same: someone bumps into real-life cases where they do _way way way_ worse than "random", and so they try to repair that as cheaply as possible.

Most of these are designed for data centers to do cheap-as-possible reasonably robust fingerprinting of giant data blobs. They could already have used, e.g., SHA-2 for highly robust fingerprinting, but they don't want to pay the very much higher runtime cost.

If you can provoke any deviation from randomness in that kind of hash, it's considered "broken" and is never used again. But in these far cheaper hashes, it's considered part of the tradeoffs. If you read the Reddit thread about SeaHash I cited before, the person who suggested the current transformations noted that there were still weaknesses in some areas of the input space he was able to find just by thinking about how it worked, but far milder than the weaknesses he found by thinking about how the original version worked.

That doesn't imply there aren't far worse weaknesses in areas he _didn't_ think about. So it goes.

It means that we may need to do some adjustments to
the hash functions.

I'm fine with the faithful (barring errors on my part) xxHash I posted here before. And I don't care whether it works "better" or "worse" on Python's tiny test suite if its constants are replaced, or its algorithm is "tweaked". When xxHash version 2 is released, I want it to be as mindless as possible to replace the guts of Python's tuple-hash loop with the guts of xxHash version 2.

It's easy to believe that SMHasher has nothing testing the bit patterns produced by mixing two's-complement integers of similar magnitude but opposite signs. That's not the kind of data giant data centers are likely to have much of ;-) If Appleby can be talked into _adding_ that kind of keyset data to SMHasher, then the next generations of these hashes will deal with it for us. But an objectively small number of collisions more than expected in some such cases now doesn't annoy me enough to want to bother.

tim-one · 2018-10-06T18:18:07Z

As a sanity check, I tried the 32-bit version of MurmurHash2 (version 3 doesn't have a 32-bit version). All of my Python tests had collisions, and while the new tuple test dropped to 15, product([0.5, 0.25], repeat=20) skyrocketed from 141 (32-bit xxHash) to 3848.

I could live with that too, but xxHash does better overall on our tiny tests, and requires less code and fewer cycles.

Here's what I used:

static Py_hash_t
tuplehash(PyTupleObject *v)
{
    const Py_uhash_t m = 0x5bd1e995;
    const int r = 24;
    Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
    Py_hash_t y;
    Py_ssize_t len = Py_SIZE(v);
    PyObject **p;
    x = 0x345678UL ^ (Py_uhash_t)len;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        Py_uhash_t t = (Py_uhash_t)y;
        t *= m;
        t ^= t >> r;
        t *= m;
        x *= m;
        x ^= t;
    }
    x ^= x >> 13;
    x *= m;
    x ^= x >> 15;
    if (x == (Py_uhash_t)-1)
        x = -2;
    return x;
}

tim-one · 2018-10-06T22:24:16Z

Thinking about the way xxHash works prompted me to try this initialization change:

    x = 0x345678UL + (Py_uhash_t)len; // adding in the length is new

That was a pure win in my tests, slashing the number of collisions in the new tuple test, 32-bit build, from 51 to 17. In the 64-bit build, cut from 11 to 10, but when looking at the high 32 hash bits instead from 27 to 15. There were also small improvements in other 32-bit build collision stats.

Full-blown xxHash (& SeaHash, and murmurhash, and ...) also fold in the length, but not inside their core loops. It's useful info!

But they fold it in _after_ their core loops, not before. They're apparently worried about this: if their internal state ever reaches 0, that's a fixed point so long as all remaining incoming data is zeroes (in Python, e.g., picture adding one or more trailing integer or float zeroes or ..., which hash to 0). Folding in the length at the end snaps it slightly out of zeroland.

I don't really care about that. Folding in the length at the start quickly leads to large differences across iterations for tuples of different lengths (thanks to repeated mulitiplication and "propagate right" steps), which is especially helpful in things like the nested tuple tests where there's almost as much variation in tuple lengths as there is in the few little integers bottom-level tuples contain.

jdemeyer · 2018-10-07T18:05:41Z

I updated PR 9471 with a tuple hash function based on xxHash. The only change w.r.t. the official xxHash specification is that I'm not using parallellism and just using 1 accumulator. Please have a look.

tim-one · 2018-10-07T21:13:39Z

Attaching htest.py so we have a common way to compare what various things do on the tests that have been suggested. unittest sucks for that. doctest too. Here's current code output from a 32-bit build; "ideally" we want "got" values not much larger than "mean" (these are counting collisions):

range(100) by 3; 32-bit hash codes; mean 116.42 got 0
-10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 69,596
-50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 708,066
0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 875,000
[-3, 3] by 20; 32-bit hash codes; mean 128.00 got 1,047,552
[0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 1,048,568
old tuple test; 32-bit hash codes; mean 7.43 got 6
new tuple test; 32-bit hash codes; mean 13.87 got 102,922

And under a 64-bit build, where the full hash code is considered, and also its lowest and highest 32 bits. Note, e.g., that the old tuple test is an utter disaster if we only look at the high 32 bits. Which is actually fine by me - the point of this is to show what happens. Judging is a different (albeit related) issue ;-)

range(100) by 3; 64-bit hash codes; mean 0.00 got 0
range(100) by 3; 32-bit lower hash codes; mean 116.42 got 0
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 989,670
-10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 69,596
-10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 69,596
-10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 101,438
-50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 708,066
-50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 708,066
-50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 994,287
0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 500,000
0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 875,000
0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 989,824
[-3, 3] by 20; 64-bit hash codes; mean 0.00 got 1,047,552
[-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 1,047,552
[-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 1,047,552
[0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 1,048,544
[0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 1,048,575
[0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 1,048,544
old tuple test; 64-bit hash codes; mean 0.00 got 0
old tuple test; 32-bit lower hash codes; mean 7.43 got 6
old tuple test; 32-bit upper hash codes; mean 7.43 got 128,494
new tuple test; 64-bit hash codes; mean 0.00 got 102,920
new tuple test; 32-bit lower hash codes; mean 13.87 got 102,922
new tuple test; 32-bit upper hash codes; mean 13.87 got 178,211

tim-one · 2018-10-08T03:39:43Z

Here's htest.py output from

git pull https://github.com/jdemeyer/cpython.git bpo-34751

and a variation. The number of collisions in the variation appear in square brackets at the end of each line.

32-bit build:

range(100) by 3; 32-bit hash codes; mean 116.42 got 0 [0]
-10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
[-3, 3] by 20; 32-bit hash codes; mean 128.00 got 140 [108]
[0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 99 [154]
old tuple test; 32-bit hash codes; mean 7.43 got 4 [2]
new tuple test; 32-bit hash codes; mean 13.87 got 19 [21]

64-bit build:

range(100) by 3; 64-bit hash codes; mean 0.00 got 0 [0]
range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]
-10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 0 [0]
-10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 1 [0]
-10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 1 [0]
-50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 128 [0]
-50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 116 [8]
0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 123 [258]
0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 121 [0]
[-3, 3] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 129 [117]
[-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 137 [115]
[0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 126 [131]
[0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 137 [130]
old tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
old tuple test; 32-bit lower hash codes; mean 7.43 got 12 [5]
old tuple test; 32-bit upper hash codes; mean 7.43 got 54 [52]
new tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
new tuple test; 32-bit lower hash codes; mean 13.87 got 10 [6]
new tuple test; 32-bit upper hash codes; mean 13.87 got 20 [30]

They're all fine by me. This is what the variation does:

Removes all "final mix" code after the loop ends.
Changes initialization to add in the length:

    Py_uhash_t acc = _PyHASH_XXPRIME_5 + (Py_uhash_t)len;

The vast bulk of the "final mix" code applies a chain of tuple-agnostic permutations. The only exception is adding in the length, which #2 moves to the start instead.

Applying permutations after the loop ends changes nothing about the number of full-width hash collisions, which is Python's _primary_ concern. Two full-width hash codes are the same after the final mix if and only if they're the same before the final mix starts.

In xxHash they're concerned about getting close to "avalanche" perfection (each input bit affects all final output bits with probability about 0.5 for each). There aren't enough permutations _inside_ the loop to achieve that for the last input or two, so they pile up more permutations after the loop.

While we do care about "propagate left" and "propagate right" to mix up very regular and/or sparse hash codes, "avalanche perfection" is of no particular value to us. To the contrary, in typical cases like product(range(100), repeat=3) it's better for us not to destroy all the regularity:

range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]

"Avalanche perfection" is what drives the number of collisions up with the final mix code intact. Without it, we get "waaaaaaay better than random" numbers of collisions.

The other reason it's attractive to drop it: the final mix code is one long critical path (no instruction-level parallelism), adding 8 serialized machine instructions including two multiplies. That's a real expense for typically-short tuples used as dict keys or set elements.

Nothing is anywhere near disaster territory either way, although "worse than random" remains quite possible, especially when cutting a 64-bit hash in half (do note that, either way, there are no collisions in any test when retaining the full 64-bit hash code).

tim-one · 2018-10-09T04:31:44Z

We need to worry about timing too :-( I'm using this as a test. It's very heavy on using 3-tuples of little ints as dict keys. Getting hash codes for ints is relatively cheap, and there's not much else going on, so this is intended to be very sensitive to changes in the speed of tuple hashing:

def doit():
    from time import perf_counter as now
    from itertools import product

    s = now()
    d = dict.fromkeys(product(range(150), repeat=3), 0)
    for k in d:
         d[k] += 1
    for k in d:
         d[k] *= 2
    f = now()
    return f - s

I run that five times in a loop on a mostly quiet box, and pick the smallest time as "the result".

Compared to current master, a 64-bit release build under Visual Studio takes 20.7% longer. Ouch! That's a real hit.

Fiddling the code a bit (see the PR) to convince Visual Studio to emit a rotate instruction instead of some shifts and an add reduces that to 19.3% longer. A little better.

The variant I discussed last time (add in the length at the start, and get rid of all the post-loop avalanche code) reduces it to 8.88% longer. The avalanche code is fixed overhead independent of tuple length, so losing it is more valuable (for relative speed) the shorter the tuple.

I can't speed it more. These high-quality hashes have unforgiving long critical paths, and every operation appears crucial to their good results in _some_ test we already have. But "long" is relative to our current tuple hash, which does relatively little to scramble bits, and never "propagates right" at all. In its loop, it does one multiply nothing waits on, and increments the multplier for the next iteration while the multiply is going on.

Note: "the real" xxHash runs 4 independent streams, but it only has to read 8 bytes to get the next value to fold in. That can go on - in a single thread - while other streams are doing arithmetic (ILP). We pretty much have to "stop the world" to call PyObject_Hash() instead.

We could call that 4 times in a row and _then_ start arithmetic. But most tuples that get hashed are probably less than 4 long, and the code to mix stream results together at the end is another bottleneck. My first stab at trying to split it into 2 streams ran substantially slower on realistic-length (i.e., small) tuples - so that was also my last stab ;-)

I can live with the variant. When I realized we never "propagate right" now, I agreed with Jeroen that the tuple hash is fundamentally "broken", despite that people hadn't reported it as such yet, and despite that that flaw had approximately nothing to do with the issue this bug report was opened about. Switching to "a real" hash will avoid a universe of bad cases, and xxHash appears to be as cheap as a hash without glaringly obvious weaknesses gets: two multiplies, an add, and a rotate.

jdemeyer · 2018-10-09T10:23:36Z

Changes initialization to add in the length:

What's the rationale for that change? You always asked me to stay as close as possible to the "official" hash function which adds in the length at the end. Is there an actual benefit from doing it in the beginning?

jdemeyer · 2018-10-09T11:38:54Z

I pushed an update at PR 9471. I think I took into account all your comments, except for moving the length addition from the end to the begin of the function.

tim-one · 2018-10-09T16:06:31Z

> Changes initialization to add in the length:

What's the rationale for that change? You always asked
me to stay as close as possible to the "official" hash function
which adds in the length at the end. Is there an actual benefit
from doing it in the beginning?

The heart of xxHash is the state-updating code, neither initialization nor finalization. Likewise for SeaHash or any of the others.

Without the post-loop avalanche code, adding the length at the end has very little effect - it will typically change only the last two bits of the final result. _With_ the avalanche code, the length can affect every bit in the result. But adding it in at the start also achieves that - same as changing any bit in the initial accumulator value.

Adding it in at the start instead also takes the addition off the critical path. Which may or may not save a cycle or two (depending on processor and compiler), but can't hurt speed.

I noted before that adding the length at the end _can_ break out of a zero fixed-point (accumulator becomes 0 and all remaining values hash to 0). Adding it at the start loses that.

So there is a theoretical danger there ... OK, trying it both ways I don't see any significant differences in my test results or a timing difference outside the noise range. So I'm happy either way.

jdemeyer · 2018-10-10T08:04:56Z

it will typically change only the last two bits of the final result

Which is great if all that you care about is avoiding collisions.

rhettinger · 2018-10-28T00:06:42Z

New changeset aeb1be5 by Raymond Hettinger (jdemeyer) in branch 'master':
bpo-34751: improved hash function for tuples (GH-9471)
aeb1be5

jdemeyer · 2018-11-05T08:30:09Z

Many thanks!

rhettinger added 3.8 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Sep 20, 2018

rhettinger closed this as completed Sep 20, 2018

rhettinger added the performance Performance or resource usage label Sep 20, 2018

jdemeyer reopened this Sep 20, 2018

rhettinger assigned tim-one Sep 21, 2018

tim-one closed this as completed Sep 21, 2018

jdemeyer reopened this Sep 21, 2018

tim-one added topic-XML interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-XML labels Oct 4, 2018

rhettinger closed this as completed Oct 28, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Hash collisions for tuples #78932

Hash collisions for tuples #78932

Comments

jdemeyer commented Sep 20, 2018

jdemeyer commented Sep 20, 2018

rhettinger commented Sep 20, 2018

jdemeyer commented Sep 20, 2018

rhettinger commented Sep 21, 2018

tim-one commented Sep 21, 2018

ned-deily commented Sep 21, 2018

tim-one commented Sep 21, 2018

jdemeyer commented Sep 21, 2018

jdemeyer commented Sep 21, 2018

tim-one commented Sep 21, 2018

jdemeyer commented Sep 21, 2018

tim-one commented Sep 21, 2018

jdemeyer commented Sep 21, 2018

ericvsmith commented Sep 21, 2018

jdemeyer commented Sep 21, 2018

jdemeyer commented Sep 21, 2018

ericvsmith commented Sep 21, 2018

rhettinger commented Sep 21, 2018

jdemeyer commented Sep 21, 2018

tim-one commented Sep 21, 2018

tim-one commented Sep 21, 2018

jdemeyer commented Sep 21, 2018

jdemeyer commented Oct 3, 2018

tim-one commented Oct 3, 2018

tim-one commented Oct 3, 2018

jdemeyer commented Oct 4, 2018

jdemeyer commented Oct 4, 2018

jdemeyer commented Oct 4, 2018

jdemeyer commented Oct 4, 2018

jdemeyer commented Oct 4, 2018

jdemeyer commented Oct 4, 2018

tim-one commented Oct 4, 2018

tim-one commented Oct 4, 2018

tim-one commented Oct 4, 2018

tim-one commented Oct 6, 2018

tim-one commented Oct 6, 2018

jdemeyer commented Oct 7, 2018

tim-one commented Oct 7, 2018

tim-one commented Oct 8, 2018

tim-one commented Oct 9, 2018

jdemeyer commented Oct 9, 2018

jdemeyer commented Oct 9, 2018

tim-one commented Oct 9, 2018

jdemeyer commented Oct 10, 2018

rhettinger commented Oct 28, 2018

jdemeyer commented Nov 5, 2018