This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Hash collisions for tuples
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Priority: low Keywords: patch

Created on 2018-09-20 13:27 by jdemeyer, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
htest.py tim.peters, 2018-10-07 21:13
Pull Requests
URL Status Linked Edit
PR 9471 merged jdemeyer, 2018-09-21 12:50
PR 9534 closed jdemeyer, 2018-09-24 13:41
Messages (122)
msg325870 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-20 13:27
It's not hard to come up with a hash collision for tuples:

>>> hash( (1,0,0) )
2528505496374819208
>>> hash( (1,-2,-2) )
2528505496374819208

The underlying reason is that the hashing code mixes ^ and *. This can create collisions because, for odd x, we have x ^ (-2) == -x and this minus operation commutes with multiplication.

This can be fixed by replacing ^ with +. On top of that, the hashing code for tuples looks needlessly complicated. A simple Bernstein hash suffices.
msg325883 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-09-20 15:06
> It's not hard to come up with a hash collision for tuples:

Nor is it hard to come up with collisions for most simple hash functions.  For typical use cases of tuples, what we have works out fine (it just combined the effects of the underlying hashes).

> The underlying reason is that the hashing code mixes ^ and *. 

Addition also has a relationship to multiplication.   I think the XOR is fine.

> A simple Bernstein hash suffices.

That is more suited to character data and small hash ranges (as opposed to 64-bit).  

>  On top of that, the hashing code for tuples looks 
> needlessly complicated.

The important thing is that it has passed our testing (see the comment above) and that it compiles nicely (see the tight disassembly below).


----------------------

L82:
    xorq    %r14, %rax
    imulq   %rbp, %rax
    leaq    82518(%rbp,%r12,2), %rbp
    movq    %r13, %r12
    movq    %rax, %r14
    leaq    -1(%r13), %rax
    cmpq    $-1, %rax
    je  L81
    movq    %rax, %r13
L73:
    addq    $8, %rbx
    movq    -8(%rbx), %rdi
    call    _PyObject_Hash
    cmpq    $-1, %rax
    jne L82
msg325891 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-20 15:45
> Nor is it hard to come up with collisions for most simple hash functions.

The Bernstein hash algorithm is a simple algorithm for which it can be proven mathematically that collisions cannot be generated if the multiplier is unknown. That is an objective improvement over the current tuple hash. Of course, in practice the multiplier is not secret, but it suffices that it is random enough.

> Addition also has a relationship to multiplication.

Of course, but not in a way that can be used to generate hash collisions. In fact, the simple relation between addition and multiplication makes this an actual provable mathematical statement.

> That is more suited to character data and small hash ranges (as opposed to 64-bit).

Maybe you are thinking of the multiplier 33 which is classically used for character data? If you replace the multiplier 33 by a larger number like _PyHASH_MULTIPLIER == 1000003 (ideally it would be a truly random number), it works fine for arbitrary sequences and arbitrary bit lengths.

> The important thing is that it has passed our testing

That's a silly argument. If it passed testing, it is only because the testing is insufficient.

But really: what I am proposing is a better hash without obvious collisions which won't be slower than the existing hash. Why would you be against that? Why should we "roll our own" hash instead of using a known good algorithm?
msg325936 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-09-21 01:21
ISTM, you're being somewhat aggressive and condescending.  I'll hand you off to Tim who is far better at wrestling over the details :-)

From my point of view, you've made a bold and unsubstantiated claim that Python's existing tuple hash has collision issues with real code.  Only a contrived and trivial integer example was shown.

On my end, I made an honest effort to evaluate the suggestion.  I read your post, 
looked up Bernstein hash, revisited the existing C code and found that everything it is doing makes sense to me.  There was a comment indicating that empirical tests were run to get the best possible constants.  Also, I disassembled the code to verify that it is somewhat efficiently compiled.

Your responded that I'm being "silly" and then you reversed my decision to close the issue (because there is no evidence that the hash is broken).  Accordingly, I'm going to hand it over to Tim who I'm sure will know the history of the existing code, can compare and contrast the various options, evaluate them in the context of typical Python use cases, and assess whether some change is warranted.
msg325949 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-21 04:46
@jdemeyer, please define exactly what you mean by "Bernstein hash".  Bernstein has authored many hashes, and none on his current hash page could possibly be called "simple":

    https://cr.yp.to/hash.html

If you're talking about the teensy 

    hash = hash * 33 + c;

thing from now-ancient C folklore, Bernstein is claimed to have replaced addition with xor in that later:

    http://www.cse.yorku.ca/~oz/hash.html

In any case, Python _used_ to do plain

    x = (1000003 * x) ^ y;
    
but changed to the heart of the current scheme 14 years ago to address Source Forge bug #942952:
    https://github.com/python/cpython/commit/41bd02256f5a2348d2de3d6e5fdfcaeb2fcaaebc#diff-cde4fb3109c243b2c2badb10eeab7fcd

The Source Forge bug reports no longer appear to exist.  If you care enough, dig through the messages with subject:

    [ python-Bugs-942952 ] Weakness in tuple hash

here:

    https://mail.python.org/pipermail/python-bugs-list/2004-May/subject.html

The thrust was that the simpler approach caused, in real life, hashes of nested tuples of the form

    (X, (X, Y))

to return hash(Y) - the hash(X) parts across calls magically xor'ed out.

It was also systematically the case that

   (X, (Y, Z))

and

   (Y, (X, Z))

hashed to the same thing.

So the complications were added for a reason, to address the showed-up-in-real-life pathological cases discussed in the bug report already linked to.  Since I don't believe we've had other reports of real-life pathologies since, nobody is going to change this again without truly compelling reasons.  Which I haven't seen here, so I'm closing this report again.

BTW, as also explained in that report, the goal of Python's hash functions is to minimize collisions in dicts in real life.  We generally couldn't care less whether they "act random", or are hard to out-guess.  For example,

     assert all(hash(i) == i for i in range(1000000))

has always succeeded.
msg325950 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-09-21 04:51
(FWIW, I believe the Source Forge bug reports were forwarded ported to bugs.python.org.  This looks like the issue in question: https://bugs.python.org/issue942952 )
msg325953 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-21 04:55
Ah!  I see that the original SourceForge bug report got duplicated on this tracker, as PR #942952.  So clicking on that is a lot easier than digging thru the mail archive.

One message there noted that replacing xor with addition made collision statistics much worse for the cases at issue.
msg325955 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 05:54
> ISTM, you're being somewhat aggressive and condescending. 

Interestingly, I felt the same with your reply, which explains my rapid reversion of your closing of the ticket. The way I see it: I pointed out a bug in tuple hashing and you just closed that, basically claiming that it's not a bug.

I still don't understand why you are closing this so quickly. To put it differently: if I can improve the hash function, removing that obvious collision, extending the tests to catch this case, while not making it slower, why shouldn't this be done?
msg325956 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 05:57
> Since I don't believe we've had other reports of real-life pathologies since

You just did in this bug report. Why doesn't that count?
msg325957 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-21 06:06
@jdemeyer, you didn't submit a patch, or give any hint that you _might_.  It _looked_ like you wanted other people to do all the work, based on a contrived example and a vague suggestion.

And we already knew from history that "a simple Bernstein hash suffices" is false for tuple hashing (for one possible meaning of "a simple Bernstein hash" - we had one of those 14 years ago).

Neither Raymond nor I want to spend time on this ourselves, so in the absence of somebody else volunteering to do the work, there's no reason to keep the report open uselessly.

If you're volunteering to do the work ... then do the work ;-)
msg325959 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 06:10
To come back to the topic of hashing, with "Bernstein hash" for tuples I did indeed mean the simple (in Python pseudocode):

# t is a tuple
h = INITIAL_VALUE
for x in t:
    z = hash(x)
    h = (h * MULTIPLIER) + z
return h

I actually started working on the code and there is indeed an issue with the hashes of (X, (Y, Z)) and (Y, (X, Z)) being equal that way. But that can be solved by replacing "hash(x)" in the loop by something slightly more complicated. After thinking about it, I decided to go for shifting higher to lower bits, so replacing z by z^(z >> 32) or something like that.

This also has the benefit of mixing the high-order bits into the low-order bits, which is an additional improvement over the current code.

Since I already wrote the code anyway, I'll create a PR for it.
msg325960 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-21 06:10
You said it yourself:  "It's not hard to come up with ...".  That's not what "real life" means.  Here:

>>> len(set(hash(1 << i) for i in range(100_000)))
61

Wow!  Only 61 hash codes across 100 thousand distinct integers?!

Yup - and I don't care.  I contrived the case, just like you apparently contrived yours.
msg325964 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 06:18
For the record: my collision is not contrived. It came up in actual code where I was hashing some custom class using tuples and found an unexpected high number of collisions, which I eventually traced back to the collision I reported here.

By the way, I think the worst real-life hash collision is

>>> hash(-1) == hash(-2)
True
msg325981 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2018-09-21 10:17
I agree with Raymond and Tim here: while it's inevitably true that there are many possible hash collisions, we'd need to see where the current algorithm caused real problems in real code. Pointing out a few examples where there was a collision isn't very helpful.

So, jdemeyer, if it's possible to show (or describe) to us an example of a problem you had, such that we could repeat it, that would be helpful (and necessary) to evaluate any proposed changes.  What were the inputs to hash() that caused a problem, and how did that problem manifest itself?
msg325982 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 10:34
> So, jdemeyer, if it's possible to show (or describe) to us an example of a problem you had, such that we could repeat it, that would be helpful (and necessary) to evaluate any proposed changes.  What were the inputs to hash() that caused a problem, and how did that problem manifest itself?

In all honesty, I don't remember. This was a while ago and at that time I didn't care enough to put up a CPython bug report. Still, this is a collision for tuples of short length (3) containing small integers (0 and -2). Why do you find that contrived?

What prompted me to report this bug now anyway is that I discovered bad hashing practices for other classes too. For example, meth_hash in Objects/methodobject.c simply XORs two hashes and XOR tends to suffer from catastrophic cancellation. So I was hoping to fix tuple hashing as an example for other hash functions to follow.
msg326016 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 18:46
I should also make it clear that the collision hash((1,0,0)) == hash((1,-2,-2)) that I reported is due to the algorithm, it's not due to some bad luck that 2 numbers happen to be the same. There are also many more similar hash collisions for tuples (all involving negative numbers) due to the same underlying mathematical reason.

I still don't understand why you don't consider it a problem. It may be a tiny problem, not really affecting the correct functioning of CPython. But, if we can fix this tiny problem, why shouldn't we?
msg326018 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2018-09-21 19:01
I'm not one to judge the effectiveness of a new hash algorithm. But my personal concern here is making something else worse: an unintended consequence. IMO the bar for changing this would be very high, and at the least it would need to be addressing a known, not theoretical, problem.

That said, I'll let the experts debate your proposed change.
msg326021 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-09-21 19:26
ISTM the collision of "hash((1,0,0)) == hash((1,-2,-2))" also stems from integers hashing to themselves and that twos-complement arithmetic relation, -x == ~x+1.  Further, that example specifically exploits that "hash(-1) == hash(-2)" due to how error codes.  In a way, tuple hashing is incidental.

Occasional collisions aren't really a big deal -- it is normal for dicts to have some collisions and to resolve them.  I would be more concerned is non-contrived datasets had many elements that gave exactly the same hash value resulting in a pile-up in one place.

FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

    >>> from itertools import product
    >>> len(set(map(hash, product(range(100), repeat=4))))
    100000000

The current hash is in the pretty-darned-good category and so we should be disinclined to change battle-tested code that is known to work well across a broad range of use cases.
msg326023 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 19:38
> Further, that example specifically exploits that "hash(-1) == hash(-2)"

No, it does not. There is no hashing of -1 involved in the example hash((1,0,0)) == hash((1,-2,-2)).
msg326024 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-21 19:44
For me, it's largely because you make raw assertions with extreme confidence that the first thing you think of off the top of your head can't possibly make anything else worse.  When it turns out it does make some things worse, you're equally confident that the second thing you think of is (also) perfect.

So far we don't even have a real-world test case, let alone a coherent characterization of "the problem":

"""
all involving negative numbers) due to the same underlying mathematical reason.
"""

is a raw assertion, not a characterization.  The only "mathematical reason" you gave before is that "j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m".  Which is true, and a good observation, but doesn't generalize as-is beyond -2.

Stuff like this from the PR doesn't inspire confidence either:

"""
MULTIPLIER = (1000003)**3 + 2 = 1000009000027000029: the multiplier should be big enough and the original 20-bit number is too small for a 64-bit hash. So I took the third power. Unfortunately, this caused a sporadic failure of the testsuite on 32-bit systems. So I added 2 which solved that problem.
"""

Why do you claim the original was "too small"?  Too small for what purpose?  As before, we don't care whether Python hashes "act random".  Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat?  What was _the cause_ of the "sporadic failure" (whatever that means), and why did adding 2 fix it?  Why isn't there a single word in _the code_ about where the mystery numbers came from?:

You're creating at least as many mysteries as you're claiming to solve.

We're not going to change such heavily used code on a whim.

That said, you could have easily enough _demonstrated_ that there's potentially a real problem with a mix of small integers of both signs:

>>> from itertools import product
>>> cands = list(range(-10, -1)) + list(range(9))
>>> len(cands)
18
>>> _ ** 4
104976
>>> len(set(hash(t) for t in product(cands, repeat=4)))
35380

And that this isn't limited to -2 being in the mix (and noting that -1 wasn't in the mix to begin with):

>>> cands.remove(-2)
>>> len(cands) ** 4
83521
>>> len(set(hash(t) for t in product(cands, repeat=4)))
33323

If that's "the problem", then - sure - it _may_ be worth addressing.  Which we would normally do by looking for a minimal change to code that's been working well for over a decade, not by replacing the whole thing "just because".

BTW, continuing the last example:

>>> c1 = Counter(hash(t) for t in product(cands, repeat=4))
>>> Counter(c1.values())
Counter({1: 11539, 2: 10964, 4: 5332, 3: 2370, 8: 1576, 6: 1298, 5: 244})

So despite that there were many collisions, the max number of times any single hash code appeared was 8.  That's unfortunate, but not catastrophic.

Still, if a small change could repair that, fine by me.
msg326026 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-21 19:48
Oops!

"""
"j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m"
"""

The tail end should say "m*(j^(-2)) == -m*j" instead.
msg326029 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 20:07
> FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

    >>> from itertools import product
    >>> len(set(map(hash, product(range(100), repeat=4))))
    100000000

FWIW: the same is true for my new hash function
msg326032 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 20:27
Concerning the MULTIPLIER:

> Why do you claim the original was "too small"?  Too small for what purpose?

If the multiplier is too small, then the resulting hash values are small too. This causes collisions to appear for smaller numbers:

BEFORE:
>>> L = [(a,b) for a in range(2) for b in range(2**21)]
>>> len(L), len(set(hash(x) for x in L))
(4194304, 2097152)

AFTER:
>>> L = [(a,b) for a in range(2) for b in range(2**21)]
>>> len(L), len(set(hash(x) for x in L))
(4194304, 4194304)

Again, not a huge problem, but at least there is room for improvement.

> Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat?  What was _the cause_ of the "sporadic failure" (whatever that means)

With "sporadic failure", I mean a hash collision not due to the algorithm but due to two numbers which happen to be the same. This might not even affect actual usage, but it did cause the testsuite to fail on 32 bit machines (48 collisions if I recall correctly while only 20 are allowed).

> and why did adding 2 fix it?

Because it's really just random chance. Some constants just happen to produce more collisions on the specific testsuite input. It should also be said that, due to the structure of that input, collisions are not independent. For example, if hash((a,b,c)) == hash((d,e,f)), then also hash((a,b,c,x)) == hash((d,e,f,x)) for example.

> Why isn't there a single word in _the code_ about where the mystery numbers came from?

Ideally, it should be just a random number. I can fix that by using a "Nothing up my sleeve number".
msg326047 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-21 21:50
I pushed a minor update to the PR, changing the MULTIPLIER and explaining the chosen constants.
msg326067 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-22 03:47
>> Why do you claim the original was "too small"?  Too small for
>> what purpose?

> If the multiplier is too small, then the resulting hash values are
> small too. This causes collisions to appear for smaller numbers:

All right!  An actual reason ;-)  So there are two things so far I clearly agree with, but they're independent of "the problem" this report was opened about:

1. On a 64-bit box life would be better if we picked a bigger multiplier.  This should be done via preprocessor #if selection, though, not via the inscrutable rules C uses to cut back an integer literal out of range for the type of the variable it's assigned to.

2. A raw integer hash of -1 should be changed to something other than -2.  It was always a Poor Idea to pick an integer of small magnitude for the substitute.  Some newer types didn't repeat this mistake.  For example, frozensets replace a raw hash of -1 with 
590923713.  Alas, this is hard to change now, because hashable objects of any type that compare equal to -1 also need to return the same substitute value (e.g., hash(-1.0) == hash(decimal.Decimal("-100e-2")) == hash(-1+0j) == -2).  So I'm afraid it's been hard-coded into user-defined numeric classes too :-(  There would be no obvious problem in changing the _tuple_ hash to use a different substitute value, although it's not clear that would buy anything either.  hash(-1) == -2 was the original sin.


[about multipliers]
> Because it's really just random chance.
> ...
> Ideally, it should be just a random number.

How do you know this?  I'm not aware of any research papers about picking multipliers in this context, but would love to see one.

I am aware of much research about multipliers in the somewhat related contexts of linear congruential pseudo-random number generators, and of "multiplicative hashing", and "any random multiplier is fine" couldn't be further from the truth _in_ those contexts.

Guido picked a prime to begin with (actually, at first it was 3(!)), and comments in the current code sure say that whoever added this part was keen on primes too:

"""
/* The addend 82520, was selected from the range(0, 1000000) for
   generating the greatest number of prime multipliers for tuples
   up to length eight:
"""

I don't know that primes are important here, but neither do I know that they're _not_ important here.  Widely used production code isn't the place to conduct experiments, so the status quo is favored in the absence of truly compelling reasons.  Just repeating the raw assertion "random is fine" isn't compelling.  If it were true, we could, e.g., multiply by a large power of 2 via shifting instead, and save a few cycles.

For my best guess at what "the problem" here actually is, in the

>>> cands = list(range(-10, -1)) + list(range(9))
>>> len(set(hash(t) for t in product(cands, repeat=4)))

example, which currently (on my 64-bit box) generates 35,380 distinct hash codes from the 104,976 inputs, ALL the collisions go away merely by changing the character "^" to "+" in the current code.

Which was your original suggestion.  Which you appear to be opposed to now?  I'm unclear about why, if so.

That's a change I could probably live with, although I have no strong reason to believe it couldn't cause problems for applications that currently work fine.  I don't have more time to think about that now, though.
msg326081 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-22 07:36
> I don't know that primes are important here, but neither do I know that they're _not_ important here.

Hashes are effectively computed modulo 2**N. "Primes" are meaningless in that setting (except for the prime 2 which does have a meaning). For example, 1000003 is prime but 1000003 + 2**64 is not. But these represent the same number modulo 2**64. Also, multiplication by any odd number is a permutation modulo 2**N, so every odd number is invertible.
msg326083 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-22 08:03
> Which was your original suggestion.  Which you appear to be opposed to now?  I'm unclear about why, if so.

I'm not strictly opposed to that. It's just that I have less confidence in the current ad-hoc hash compared to something following the DJBX33A scheme.
msg326086 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-22 09:12
> I'm not aware of any research papers about picking multipliers in this context, but would love to see one.

The only real condition that I can think of is that the order should be large: we do not want MULTIPLIER**n = 1 (mod 2**N) for a small number n.

Other than that, we could pick the multiplier to guarantee no hash collisions on some chosen subset of inputs. A bit like your product(range(100), repeat=4) example but then for more inputs.

If you agree with this approach, I could try to find a good multiplier this way.
msg326110 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-22 19:26
So you don't know of any directly relevant research either.  "Offhand I can't see anything wrong" is better than nothing, but very far from "and we know it will be OK because [see references 1 and 2]".

That Bernstein's DJBX33A has been widely used inspires no confidence whatsoever.  As already explained, Python _did_ use DJBX33X (with a different multiplier), and it systematically screwed up catastrophically in some nested tuple cases already spelled out.  Bernstein himself moved to DJBX33X (using xor instead of addition), and that would have screwed up too on a mix of smallish integers of both signs.

What Python does now, except for changing the multiplier, is essentially version 1a of the _also_ very widely used - but more recent - Fowler/Noll/Vo hash family[1]:

    # Version 1a, recommended over version 1 (which does * before ^).
    hash = offset_basis
    for each octet_of_data to be hashed
        hash = hash xor octet_of_data
        hash = hash * FNV_prime
    return hash

They did extensive studies, and very deliberately use a prime multiplier subject to a number of other constraints.  Not based on "offhand I can't see why not", but based on analysis and running extensive tests.

But, same as with Bernstein's variants (which predate FNV's work on picking "good" multipliers), they were developed in the context of hashing sequences of unsigned 8-bit integers.  There's no a priori reason I can see to expect them to work well in contexts other than that.  Indeed, FNV puts special constraints on the last 8 bits of the primes they pick, presumably because they're only concerned with hashing sequences of 8-bit values.

FYI, for 32-bit hashes they use multiplier 16777619, and for 64-bit 1099511628211.

In the absence of coherent analysis directly relevant to what Python actually needs here, we're all flying blind.  What we do have is over a decade of positive real-world experience with the made-up hacks Python used to worm around a class of gross flaws its prior DJBX33X approach suffered, taking DJBX33X out of its original context and applying it in an area it wasn't designed for.  Which made it close to FNV-1a, but also in a context _that_ wasn't designed for.

However, it _is_ structurally close to FNV-1a, and using prime multipliers was a big deal to them.  "But offhand I don't see why" isn't a good enough reason to abandon that.  To the contrary, in the absence of _proof_ that it doesn't actually matter, I'd be happiest using the same multipliers (given above) and initial values "the standard" FNV-1a algorithms use.

[1] http://www.isthe.com/chongo/tech/comp/fnv/index.html#FNV-1a
msg326115 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-22 21:20
Thanks for the reference, I never heard of the FNV hash. Unfortunately, I haven't seen a reference for the rationale of how they pick their multiplier.

It's not clear what you are suggesting though. Keep using the FNV-ish hash somehow? Or using a DJBX33A hash with the FNV multiplier?
msg326117 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-22 21:38
> the made-up hacks Python used to worm around a class of gross flaws its prior DJBX33X approach suffered, taking DJBX33X out of its original context and applying it in an area it wasn't designed for.

But we know why the DJBX33A hash didn't work (nested tuples), so we can think of the best way to solve that. Python messes with the multiplier, which makes it quite a different hash. Surely, if you believe that the precise choice of multiplier matters a lot, then you should also agree that arbitrarily changing the multiplier in the loop is a bad idea.

My proposal instead is to keep the structure of the DJBX33A hash but change the hash of the individual items to be hashed. That's a much less invasive change to the known algorithm.

Finally, something that I haven't mentioned here: an additional advantage of my approach is that high-order bits become more important:

BEFORE:
>>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
500000

AFTER:
>>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
1000000

Again, I'm not claiming that this is a major issue. Just additional evidence that maybe my new hash might actually be slightly better than the existing hash.
msg326118 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-09-23 00:01
I don't think any change should be made unless we agree that there is a real problem to be solved rather than because the OP is brazenly insistent on forcing through some change.  We shouldn't feel shoved into altering something that we don't agree is broken and into replacing it with something that we're less sure about.  Also, I'm concerned that that patch drops features like the post-addition of a constant and incorporating length signature -- it is as if we're starting from scratch rather than keeping the incremental improvements made over decades.  AFAICT, the only motivation for change is that the OP is relentless; otherwise, none of us would be giving this further thought.  I echo Eric's concern that it is easy to inadvertently make Python worse.
msg326119 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-23 00:16
I strive not to believe anything in the absence of evidence ;-)

FNV-1a supplanted Bernstein's scheme in many projects because it works better.  Indeed, Python itself used FNV for string hashing before the security wonks got exercised over collision attacks.  It worked great.  But "better" depends on what you value.  For example, FNV-1a has far better "avalanche" behavior than Bernstein[1].

Please note that I don't _object_ to your code!  It may work fine.  Or it may not in some cases.  The problem is that we have no way to tell.  There's no theory here, just misleading appeals to experience with the algorithms in contexts they were _intended_ to be used in.  Nobody expected some patterns of nested tuples to fail catastrophically before, and nobody expected mixed-sign integers to lead to poor (but not catastrophic) behavior after the former was "repaired".  Nobody now expects the next 10 catastrophes either.

We can only rely on contrived tests and bug reports.  Python's current scheme is unbeatable on that count, because it's the only scheme for which over a decade of experience _in context_ exists.

Which experience says there are no known catastophically bad real cases.  Which is worth a whole lot in a project that's so widely used now.

That said, the "repair" before was at least as much pure hackery as your proposed hackery, and - I agree - completely undercut the _theory_ FNV was based on (although, to be fair, I doubt anyone else _knew_ they were re-inventing a damaged FNV-1a at the time).

So ... since FNV-1a is in fact better in its intended context than the Bernstein one-liners in the same context, how about adding your

    t += (t >> (4 * sizeof(t)));

hack to the _current_ code (& delete the code changing the multiplier)?  Then we could switch to the "official" FNV 32-bit and 64-bit multipliers too, and know at least that "almost everything" about the resulting algorithm is known to work exceedingly well in its original context.

[1] https://sites.google.com/site/murmurhash/avalanche
msg326120 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-23 00:27
Raymond, I share your concerns.  There's no reason at all to make gratuitous changes (like dropping the "post-addition of a constant and incorporating length signature"), apart from that there's no apparent reason for them existing to begin with ;-)

Unintended consequences can abound.

Still, the last repair was pretty hacky, and a very well-known and highly respected algorithm (FNV-1a) _is_ hiding under it.  I would like to pursue seeing whether a more direct form of FNV-1a can be rehabilitated to worm around the known problematic cases.  That would also, if successful, give us a principled way to pick a better multiplier for 64-bit boxes.
msg326147 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-23 08:29
> We shouldn't feel shoved into altering something that we don't agree is broken

It *is* broken. You are just denying the reality.

That's also the reason that I'm insisting so much: I don't want to push my personal fix (despite that it may seem so), I want to fix a bug.
msg326173 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-23 19:13
Oh, I don't agree that it's "broken" either.  There's still no real-world test case here demonstrating catastrophic behavior, neither even a contrived test case demonstrating that, nor a coherent characterization of what "the problem" is.

I'm nevertheless open to making improvements, but keeping in mind foremost that _all_ changes are potentially damaging to patterns of data we know nothing about precisely because they've been working fine for many years.  So, e.g., there's no chance that gratuitous changes will be accepted.  For example, don't _like_ adding 97531UL at the end?  Tough luck - it's staying, and it's not worth one word of argument.

To my eyes, the _strongest_ case in all these messages is for boosting the multiplier size on 64-bit boxes.  That's principled and well-motivated.  Python itself changed the multiplier from 3 to 1000003 long ago for the same reasons.  But that apparently has nothing to do with "the problem" this report was opened about ;-)
msg326193 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-09-24 00:15
> To my eyes, the _strongest_ case in all these messages is 
> for boosting the multiplier size on 64-bit boxes.  That's
> principled and well-motivated.

Yes, that would be perfectly reasonable (though to some extent the objects in the tuple also share some of the responsibility for getting all bits into play).
msg326194 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-24 00:33
Has anyone figured out the real source of the degeneration when mixing in negative integers?  I have not.  XOR always permutes the hash range - it's one-to-one.  No possible outputs are lost, and XOR with a negative int isn't "obviously degenerate" in general:

>>> for i in range(-16, 17):
...    print(i, i ^ -5)
-16 11
-15 10
-14 9
-13 8
-12 15
-11 14
-10 13
-9 12
-8 3
-7 2
-6 1
-5 0
-4 7
-3 6
-2 5
-1 4
0 -5
1 -6
2 -7
3 -8
4 -1
5 -2
6 -3
7 -4
8 -13
9 -14
10 -15
11 -16
12 -9
13 -10
14 -11
15 -12
16 -21

Clear enough? "Distinct integers in, distinct integers out", but with the twist that the sign bit flips.  That's _seemingly_ minor to me.  Yet it has massive effects on tuple hash distribution.  Here's a function to show that:

    def doit(cands, repeat=4):
        from collections import Counter
        from itertools import product

        print(len(cands), "cands **", repeat, "=", len(cands)**repeat)
        c1 = Counter(map(hash, product(cands, repeat=repeat)))
        print(len(c1), "distinct hash codes")
        c2 = Counter(c1.values())
        for dups in sorted(c2):
            print(dups, c2[dups])

Then an example we're proud of:

>>> doit(range(100))
100 cands ** 4 = 100000000
100000000 distinct hash codes
1 100000000

Much the same, mixing in negative ints, but _excluding_ the problematic -1 and -2 inputs:

>>> cands = list(range(-50, 51))
>>> cands.remove(-1) # hashes to -2
>>> cands.remove(-2) # for j odd, j ^ -2 == -j
>>> doit(cands)
99 cands ** 4 = 96059601
15736247 distinct hash codes
1 70827
2 1005882
3 228578
4 5000424
5 19728
6 1047762
8 8363046

What accounts for such a massive difference?  It's not merely that we're using negative ints:

>>> doit(range(-101, -1)) # leave -2 in, for a change
100 cands ** 4 = 100000000
100000000 distinct hash codes
1 100000000

So, on their own, negative ints are as spectacularly well-behaved in this range as positive ints, and including -2 creates no problems.

I suspect it's related to that x ^ -(x + 1) == -1, but not in a way obvious enough to be ... well, obvious ;-)
msg326198 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-24 00:41
[Raymond, on boosting the multiplier on 64-bit boxes]
> Yes, that would be perfectly reasonable (though to some
> extent the objects in the tuple also share some of the
> responsibility for getting all bits into play).

It's of value independent of that.  Tuples of ints used as keys and set elements are very important, and I doubt we'll ever give up on that `hash(i) == i` for the typically "not huge" ints used in such contexts.  Jeroen gave a reasonable example of how boosting the multiplier can help in a case of that above:

https://bugs.python.org/msg326032
msg326200 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-24 04:16
FYI, using this for the guts of the tuple hash works well on everything we've discussed.  In particular, no collisions in the current test_tuple hash test, and none either in the cases mixing negative and positive little ints.  This all remains so using the current multiplier, or "the standard" FNV-1a multipliers 16777619UL (32 bit) and 1099511628211UL (64 bit).  I've done no other tests:

while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
		Py_uhash_t t = (Py_uhash_t)y;
		t ^= t << 7;
		x = (x ^ t) * mult;
}

Note that the multiplier doesn't change anymore.  The twist is adding

		t ^= t << 7;

to _permute_ the native hash space (stuff like "t += t >> 16" is a many-to-one function, not a permutation).  Other than that, it's exactly FNV-1a.  I don't know that 7 is especially magical - it's just the first thing I tried.

In the absence of a real analysis, the intuition is simply that "t ^= t << 7" will clear masses of leading sign bits when hashing "small" negative integers.  For whatever reason(s), they appear to be the cause of the troubles.

However, since that line of code permutes the space, exactly the same statistics can be provoked by some other inputs.
msg326203 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-24 05:50
BTW, those tests were all done under a 64-bit build.  Some differences in a 32-bit build:

1. The test_tuple hash test started with 6 collisions.  With the change, it went down to 4.  Also changing to the FNV-1a 32-bit multiplier boosted it to 8.  The test passes in all those cases.

2. For 4-tuples taken from range(-50, 51) omitting -1 and -2:  massive number of collisions under the current code.  Died with MemoryError with the change - ran out of memory to build the Counter recording all the distinct hash codes.

Changing it to 3-tuples instead:  somewhere around 140 collisions with the change, and also changing to the FNV-1a 32-bit multiplier eliminated all collisions.

All of this was done under Win 10, using Visual Studio 2017.
msg326212 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-24 09:45
> Has anyone figured out the real source of the degeneration when mixing in negative integers?

The underlying reason for the collisions is the following mathematical relation:

x ^ -x = -2 << i

where i is the index of the smallest set bit of x. In particular, x ^ -x = -2 for odd x.

Now consider two consecutive hash iterations:

y = (x ^ a) * m1
z = (y ^ b) * m2

and suppose that x ^ a is odd. Now if we replace a by a ^ -2, then x ^ a will be replaced by -(x ^ a) and y will be replaced by -y. If we also replace b by b ^ -2, then y ^ b will be replaced by y ^ b. In other words, we have a collision.

This kind of bad interaction between ^ and * only happens with the FNV-style hashing. The Bernstein hash using + instead of ^ does not suffer from this problem. That makes it a better choice in my opinion.
msg326214 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-24 09:52
While writing up the analysis above, it occurred to me that collisions already happen for 2-tuples:

>>> hash((3, -2)) == hash((-3, 0))
True

These kind of 2-tuples of small integers don't look contrived at all. I can easily see them appearing, in mathematical applications for example.

As for real-world usage: the only thing that I can say is that I discovered these hash collisions a while ago, while working on SageMath. I was testing the hash for a custom class and I found collisions, which I traced back to collisions for tuples.

In any case, it is hard to find real-world problems where a bad hash really matters, since Python works fine with a broken hash too.
msg326217 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-24 10:01
> stuff like "t += t >> 16" is a many-to-one function, not a permutation

Yes, I am aware of that. However, the number of collisions here is really quite small. It's very unlikely to hit one by accident.

I also chose >> over << for two reasons:

1. It brings the high-order in play: https://bugs.python.org/msg326117

2. It avoids collisions on the low-order bits: when you do t ^= t << 7, then you are not changing the lower 7 bits at all. So applications using hash(x) % 128 will still see all the problems that we are trying to fix.
msg326218 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-24 10:33
> For example, FNV-1a has far better "avalanche" behavior than Bernstein

We don't care about that though. We want to have no collisions, not a random output.
msg326219 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-24 10:39
> In the absence of a real analysis, the intuition is simply that "t ^= t << 7" will clear masses of leading sign bits when hashing "small" negative integers.

That's a clever solution. If you want to go that route, I would rather suggest t ^= t << 1 which is essentially the same fix but which has probably less collisions on the low-order bits.
msg326236 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-24 13:48
I created a new PR based on Tim's t ^= t << 7 idea, except that I'm using << 1 instead, to have more mixing in the lower bits.

With the standard FNV multiplier on 64 bits, I did get collisions while testing. I haven't figured out exactly why these occurred, but it's probably due to the high number of 0 bits. Instead, I chose 3**41 as multiplier. But of course, there are still plenty of bikeshedding opportunities for the multiplier...
msg326278 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-24 18:05
> when you do t ^= t << 7, then you are not changing
> the lower 7 bits at all.

I want to leave low-order hash bits alone.  That's deliberate.

The most important tuple component types, for tuples that are hashable, are strings and contiguous ranges of "not huge" ints.  The current string hash works hard to "act randomly" in every bit position - there's no point in trying to "improve" anything about the hashes it already produces.

In contrast, the hashes of any contiguous range of N not-huge integers (excluding -1) are already, by construction, guaranteed to have no collisions at all in their low-order (roughly) log2(N) bits.  They can't be improved in this respect, because they're already optimal in this respect.

So, if anything, I'd look at increasing the left shift rather than reducing it.  The low bits are already un-improvable in the most important cases.


> So applications using hash(x) % 128 will still see
> all the problems that we are trying to fix.

?  The only "problem" I've seen identified was in mixing positive and negative integers.  Here on a 32-build with the FNV-1a 32-bit multiplier:

>> from itertools import product
>> from collections import Counter
>> cands = list(range(-50, 51))
>> cands.remove(-1)
>> c = Counter()
>> for t in product(cands, repeat=4):
..     c[hash(t) & 0x7f] += 1
>>> len(c)
128

So all 128 lower 7-bit patterns showed up.  And they're quite evenly distributed:

>>> c2 = Counter(c.values())
>>> for k in sorted(c2):
...     print(k, c2[k])
...
781202 1
781207 1
781209 2
781212 1
781213 2
781214 1
781215 4
781221 3
781222 1
781225 3
781226 1
781227 3
781228 2
781229 2
781230 1
781231 1
781232 2
781233 3
781234 1
781235 4
781236 2
781237 1
781238 2
781240 5
781241 6
781242 1
781243 1
781244 1
781245 1
781247 1
781248 2
781251 2
781252 4
781253 3
781254 5
781255 2
781256 2
781257 3
781259 2
781260 1
781261 1
781262 1
781263 2
781264 4
781265 2
781266 1
781267 1
781268 4
781269 1
781270 1
781271 2
781272 1
781274 2
781275 1
781276 1
781278 1
781280 1
781281 2
781282 2
781285 1
781286 2
781288 1
781295 1
781297 2
781301 1
781302 1
781304 1
781307 1

> With the standard FNV multiplier on 64 bits, I did
> get collisions while testing.

When testing what, specifically?  And the standard 32-bit FNV multiplier, or the standard 64-bit FNV multiplier?

> Instead, I chose 3**41 as multiplier. But of course,
> there are still plenty of bikeshedding opportunities
> for the multiplier...

Not for me.  If the universally used 64-bit FNV multiplier can't be used in the context of Python's tuple hash fiddled to use an almost-pure form of FNV-1a, then that approach is dead to me.  Needing to pick multipliers out of thin air instead implies the theory underlying FNV-1a doesn't transfer to this context.

About which I have no current opinion.  It may or may not.  Since we're flying blind, I'm just willing to _assume_ it does until proven otherwise by testing.
msg326290 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-24 21:37
Jeroen, I understood the part about -2 from your initial report ;-)  That's why the last code I posted didn't use -2 at all (neither -1, which hashes to -2).  None of the very many colliding tuples contained -2 in any form.  For example, these 8 tuples all have the same hash now:

(-4,  -4,  -4,  40)    (-4,  -4,  4, -48)
( 4,   4,  -4,  40)    ( 4,   4,  4, -48)
(-4,  28, -28, -48)    (-4,  28, 28,  40)
( 4, -28, -28, -48)    ( 4, -28, 28,  40)

Your last example (with (3, -2) and (-3, 0)) also implicitly relies on that:

    j is even implies (j ^ -3) == -(j ^ 3)

There are apparently piles of similar identities :-(

I appreciate that

    a*M + C = b*M + C (mod N)

implies

    a = b (mod N)

when M is coprime to N, and also that the theory of linear combinations modulo 2**K is far better known than the well-hidden theory FNV developed.
msg326302 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-25 00:13
> advantage of my approach is that high-order bits become more
> important:

I don't much care about high-order bits, beyond that we don't systematically _lose_ them.  The dict and set lookup routines have their own strategies for incorporating high-order bits into the probe sequence, and those routines are the primary reason hash() exists in Python.

So there's scant reason to try to let high bits influence low bits (but good reason _not_ to, because - as explained before - the low-order bits are already in great shape for the important tuple component types).

Something else appears to be going on in this specific example, which has a very touchy set of hash codes:

> BEFORE:
> >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
> 500000

Which is what I get too on (at least) a 64-bit build.

> AFTER:
> >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
> 1000000

Ditto.  However, I get exactly the same from the code I posted before, which has no right shifts at all.

More, if I take the code exactly as it exists today, and merely comment out the

    mult += (Py_hash_t)(82520UL + len + len);

line, I also get a million distinct hash codes.

Half of the hash codes have bit 2**60 set, and apart from those all the hash codes have no bits set higher than 2**6.  Any number of accidents could cause the 2**60 bits to wipe each other out, and merely changing the sequence of multipliers was demonstrably enough to stop that.

Indeed, merely changing 82520 to 82524 in the current code is enough to get a million unique hash codes in this case.
msg326306 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-25 00:42
Just noting that this Bernstein-like variant appears to work as well as the FNV-1a version in all the goofy ;-) endcase tests I've accumulated:

    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        Py_uhash_t t = (Py_uhash_t)y;
        t ^= t << 7;
        x = x * mult + t;
    }

They're identical except for the last line.  FNV-1a uses

         x = (x ^ t) * mult;

instead.
msg326309 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-25 01:51
And one more:

        x = (x * mult) ^ t;

also appears to work equally well.  So, way back when, it appears we _could_ have wormed around the disaster du jour just by applying a shift-xor permutation to the raw hash results.

Note the implication:  if we believed our tuple hash worked OK before then (& we did), adding a shift-xor step would have changed nothing about it _except_ to permute the input space.  That alone was enough to repair the nested tuple problems.

Note that permutations are a bog standard way to improve algorithms with bad behaviors in some cases.  For example, at the end of the Mersenne Twister they do 4 permutations to improve otherwise-marginal results from "the real" pseudo-random generator:

    y ^= (y >> 11);
    y ^= (y << 7) & 0x9d2c5680U;
    y ^= (y << 15) & 0xefc60000U;
    y ^= (y >> 18);
    return y;

At this point I see no "good reason" to prefer any of

    x = (x * mult) ^ t; // like Bernstein 33X & FNV-1
    x = (x * mult) + t; // like Bernstein 33A
    x = (x ^ t) * mult; // like FNV-1a

to any other.  In their _original_ contexts, they were applied to longish sequences of unsigned bytes.  But tuples used as keys and set elements are typically much shorter than strings, so results about how they worked in their original contexts are largely irrelevant for that reason too.

While lots of non-endcase testing is also needed, at this point I don't have any real fear about any of them.

As a sanity check,

        x = (x * mult) | t;

is disastrous for all the endcase tests.  So I believe I really am compiling and running the code ;-)
msg326331 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-25 08:34
> The low bits are already un-improvable in the most important cases.

Maybe you misunderstood me. Suppose that there is a hash collision, say hash((3, 3)) == hash((-3, -3)) and you change the hashing algorithm to fix this collision. If that change does NOT affect the lower N bits, then you still have a collision hash((3, 3)) % 2**N == hash((-3, -3)) % 2**N. This is relevant because the dict implementation starts by taking the lower bits first. So ideally we want to avoid hash collisions in the lower bits too.

This may also be a reason to avoid the standard FNV multipliers: the 64-bit FNV multiplier was chosen with the full 64-bit hash range in mind. It was never meant to work well when truncated to some lower N bits.
msg326332 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-25 08:42
> When testing what, specifically?  And the standard 32-bit FNV multiplier, or the standard 64-bit FNV multiplier?

FNV-1a with the t ^= 2 * t mangling running my new testsuite on either PR 9471 or PR 9534 using the 64-bit FNV multiplier to produce 64-bit hashes. In other words, the code from PR 9534 but just changing the multiplier.

On the full 64-bit range, I got 2 collisions (where statistically 0 would be expected). When truncated to 32-bits, I got about 1700 collisions (where statistically about 15 would be expected).
msg326333 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-25 08:46
> j is even implies (j ^ -3) == -(j ^ 3)

This follows from what I posted before: if j is even, then j ^ 3 is odd, so we can apply the rule x ^ -2 = -x to x = j ^ 3:

(j ^ 3) ^ -2 = -(j ^ 3)

which implies

j ^ (3 ^ -2) = -(j ^ 3)

or equivalently

j ^ -3 = -(j ^ 3)
msg326334 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-25 08:51
> And one more:

        x = (x * mult) ^ t;

also appears to work equally well.

The order of operations does not really matter: you can write the loop as

x *= mult   # Appears only in FNV-1
x ^= t[0]
x *= mult
x ^= t[1]
x *= mult
x ^= t[2]
x *= mult
x ^= t[3]
x *= mult  # Appears only in FNV-1a

The initial multiplication is equivalent to changing the initial value and the final multiplication is just a permutation of the outputs. None of those should really affect the quality of the hash.
msg326337 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-25 10:00
> I want to leave low-order hash bits alone.  That's deliberate.

Since I didn't understand the rationale for this and since shifting << 1 also seems to work well, I edited PR 9471 to use DJBX33A with t ^= t << 1.

Since you insisted on adding 97531 at the end, I put that back.
msg326352 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-25 13:15
Regarding t ^= t << 7: I tested PR 9471 with all shift values from 1 to 20. The new testsuite passed for all shifts from 1 to 13, except for 6. It failed for 6 and for 14 to 20. This indicates that smaller shift values are better (even when looking at 32-bit outputs).
msg326375 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-25 16:54
> Suppose that there is a hash collision, say hash((3, 3)) ==
> hash((-3, -3)) and you change the hashing algorithm to fix
> this collision.

There are _two_ hash functions at play in that collision:  the tuple hash function, and the integer hash function.  Regardless of whether the _tuple_ hash function does

    t ^= t << shift;
or
    t += t >> shift;

or anything else involving just `t`, that only directly affects the result of the _int_ hash function.  Whose low order bits cannot be improved in general - they're already optimally spread out in the most important cases.

> If that change does NOT affect the lower N bits,
> then you still have a collision hash((3, 3)) % 2**N ==
> hash((-3, -3)) % 2**N. This is relevant because the dict
> implementation starts by taking the lower bits first. So
> ideally we want to avoid hash collisions in the lower
> bits too.

Which the int hash on its own already does a superb job of doing in the most important cases.  If you feel you just have to mess with low-order bits, do it instead on the _tuple_ hash intermediate results, not on its inputs.  Like

    x += x >> 16;

after t is folded in to x.  It's x you care about directly, not t.  Damaging desirable properties in t to _indirectly_ gain something in x is too clever by half.

Mucking with t to avoid the nested-tuple catastrophes is worth it, but I prefer to do that with as little damage to t's desirable properties as reasonably possible.

> This may also be a reason to avoid the standard FNV
> multipliers: the 64-bit FNV multiplier was chosen with
> the full 64-bit hash range in mind. It was never meant
> to work well when truncated to some lower N bits.

Do you believe any other multiplier would work better toward that end?  For any odd multiplier M, the last k bits of i*M are determined solely by the last k bits of i, and

    [(last k bits of i*M) for i in range(anything, anything + 2**k)]

is a permutation of range(2**k).  The high order bits of i have nothing to do with it, and any odd multiplier permutes the lower k bits over any stretch of 2**k consecutive multiplicands.

Extracting low-order bits is intended to be done by "xor folding" in FNV, and I _expect_ the same would be prudent for any other multiplier:

https://tools.ietf.org/html/draft-eastlake-fnv-15#page-7

The dict and set lookup functions already do something stronger than xor folding to bring high bits into play, but do so incrementally as the probe sequence unfolds.
msg326396 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-25 19:15
>> j is even implies (j ^ -3) == -(j ^ 3)

> This follows from what I posted before: if j is even, then
> j ^ 3 is odd, so we can apply the rule x ^ -2 = -x to x = j ^ 3
> ...

Thanks!  That helps a lot.  I had a blind spot there.  This kind of thing would have been best spelled out at the start:  it's not just that -2 has some bad effects, but that there's a whole universe of potentially damaging identities:

j odd  implies   j      = -(j ^ -2)
j even implies  (j ^ 1) = -(j ^ -1)
j odd  implies  (j ^ 2) = -(j ^ -4)
j even implies  (j ^ 3) = -(j ^ -3)
j odd  implies  (j ^ 4) = -(j ^ -6)
j even implies  (j ^ 5) = -(j ^ -5)
j odd  implies  (j ^ 6) = -(j ^ -8)
j even implies  (j ^ 7) = -(j ^ -7)
j odd  implies  (j ^ 8) = -(j ^ -10)
j even implies  (j ^ 9) = -(j ^ -9)
j odd  implies  (j ^ 10) = -(j ^ -12)
j even implies  (j ^ 11) = -(j ^ -11)
j odd  implies  (j ^ 12) = -(j ^ -14)
j even implies  (j ^ 13) = -(j ^ -13)
j odd  implies  (j ^ 14) = -(j ^ -16)
j even implies  (j ^ 15) = -(j ^ -15)
...

Every integer pairs up with another of the opposite sign in one of these - but with only one.  That goes a long way toward explaining why collisions are frequent when mixing signs on ints of similar magnitude, but also why they don't "pile up" catastrophically.

But it also suggests a way to construct cases that _are_ catastrophic.  For example:

>>> len(set(map(hash, product([-3, 3], repeat=20))))
1024

Ouch!

>>> c = Counter(map(hash, product([-3, 3], repeat=20)))
>>> len(c)
1024
>>> max(c.values())
1024
>>> min(c.values())
1024

So each of the 1024 hash codes showed up 1024 times.
msg326435 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-26 09:06
> There are _two_ hash functions at play in that collision:  the tuple hash function, and the integer hash function.  Regardless of whether the _tuple_ hash function does [anything involving just `t`], that only directly affects the result of the _int_ hash function

...which is what you want here. The collision hash((3,3)) == hash((-3,-3)) is due a specific structure in the input to the FNV algorithm. The operation t ^= t << 7 that you suggested (and which I approve, except for the shift amount) is meant precisely to destroy that structure.

> If you feel you just have to mess with low-order bits, do it instead on the _tuple_ hash intermediate results, not on its inputs. It's x you care about directly, not t.

That would not be a good solution, because that destroys the structure of the has algorithm. For Bernstein for example, each loop iteration should do something like

    x = (x * m) + t

for *some* value t depending on the input. If you mess with x, then it really becomes a different hash algorithm. That is far more dangerous than simply applying a permutation on t.

> Mucking with t to avoid the nested-tuple catastrophes is worth it, but I prefer to do that with as little damage to t's desirable properties as reasonably possible.

Which "desirable properties" of t does the operation t ^= (t << 1) damage?
msg326505 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-26 20:43
High-order bit:  please restore the original tuple hash test.  You have the worst case of "but I didn't write it" I've ever encountered ;-)  Your new test is valuable, but I've seen several cases now where it fails to detect any problems where the original test fails catastrophically.  Tests are written to ensure known problems don't recur, so there's _never_ "a reason" to throw one away unless the test itself is in error.  The test suite only grows over time, and that's how it should be.

Example:  when I replace the computational guts of the tuple hash with the single line (64-bit build, and this is the only code remaining in the loop apart from setting y to the next tuple component's hash):

    x = (x * mult) + y;

both tests fail spectacularly.

But if I add one more line:

    x = (x * mult) + y;
    x += x >> 16;
    
your new test falls to 0 collisions, while the original test still suffers 83447 collisions (an improvement - down from 120050 - but still a spectacular failure).

We're flying blind enough here without deliberately plucking out one of our eyes ;-)  Indeed, the original tuple test is the only one in my collection that showed even a single collision after the two-line replacement just above.  "My collection" means your test, the original tuple hash test, and every example given in all the messages in this bug report.

Another "surprise":  replace addition with xor in the second line above:

    x = (x * mult) + y;
    x ^= x >> 16;

and then the original tuple test drops from 83447 to 0(!) collisions, while your new one jumps from 0 to a measly 12.


> The collision hash((3,3)) == hash((-3,-3)) is due a specific
> structure in the input to the FNV algorithm.

This is all so hypothetical it's impossible to know.  Perhaps the tuple hash function simply subtracts the hash of the first component from the hash of the second.  Then it's impossible to make any deterministic change to the component hashes that can stop either of the tuples from hashing to 0.  The weakness is entirely in the hash combining function then.

> The operation t ^= t << 7 that you suggested (and which I approve, except
> for the shift amount) is meant precisely to destroy that structure.

Except I have no reason to believe that destroying the input structure is of unique value.  As shown above, it's easy to change the tuple hash in a simple way that passes all known tests without touching `y` (aka `t`) at all.

>> It's x you care about directly, not t.

> That would not be a good solution, because that destroys the structure
> of the hash algorithm.   For Bernstein for example, each loop iteration
> should do something like
>
>    x = (x * m) + t
>
> for *some* value t depending on the input. If you mess with x, then it
> really becomes a different hash algorithm.

The two-liner above with the xor in the second line is exactly Bernstein 33A, followed by a permutation of 33A's _output_ space.  It works fine in tests.  Why should I care about merely that it's "different"?  33A on its own fails spectacularly - it damned well _better_ be "different" ;-)

There is no "Bernstein theory" to appeal to here - just raw assertions.  I'm not attached to any of these approaches.  The strongest I can say from trying many things is that "chaining permutations works best, but no detail appears to matter much, except that multiplication is the most valuable of all the permutations I've tried".

x*M is a permutation of the word-size space for any odd M, and any "big enough" odd M I tried appears to work about as well as any other.  Adding t is another.  The shift-xor in the second line above is a third (but the "+=" version instead was not a permutation, and it failed).

> That is far more dangerous than simply applying a permutation on t.

Why?

> Which "desirable properties" of t does the operation t ^= (t << 1) damage?

On second thought, none!  Across any contiguous range of integers, that preserves that the last k bits (for all k >= 1) remain as evenly distributed as in the inputs.  That wasn't obvious to me at first.  It's not true if the left shift is replaced by a right shift, though.

However, with a shift count of 1 I've generally found that the _original_ tuple hash test fails (not spectacularly, but fails all the same).  "Generally" means across a substantial number of varying the permutations used in other places.
msg326507 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-26 20:50
> Do you believe any other multiplier would work better toward that end?

Absolutely. Ideally, the multiplier should just be a random 64-bit number.
msg326508 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-26 20:52
> please restore the original tuple hash test.

Sure. I wasn't sure what to do and was I afraid that having 2 tests for tuple hashes would be too much. If that's OK for you, then surely I will restore the test.
msg326510 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-26 21:06
> The two-liner above with the xor in the second line is exactly Bernstein 33A, followed by a permutation of 33A's _output_ space.

Not output space, but internal state (I assume that you do that operation inside the loop). It's replacing DJBX33A by a different algorithm which is not DJBX33A. It may or may not work, that's not my point. It's just that I would avoid changing the structure of the algorithm if there is no reason to.
msg326512 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-26 21:23
>> The two-liner above with the xor in the second line is
>> exactly Bernstein 33A, followed by a permutation
>> of 33A's _output_ space.

> Not output space, but internal state

?  33A's output _is_ its internal state at the end.  This is a distinction that makes no difference.  I do distinguish here between 33A and the output of Python's `tuplehash()`, but the output space of both on a 64-bit box is a 64-bit int, so that's another pointless distinction to me.

> (I assume that you do that operation inside the loop).

Yes, as I said at the start, "this is the only code remaining in the loop apart from setting y to the next tuple component's hash".

> It's replacing DJBX33A by a different algorithm which is not
> DJBX33A.

Replacing DJBX33A's multiplier of 33 is also a different algorithm.  So is working with inputs other than unsigned bytes.  So is mucking with the inputs before adding them in.

> It may or may not work, that's not my point. It's just that
> I would avoid changing the structure of the algorithm if
> there is no reason to.

Which is also fine by me, except I see no actual reason to care.  All variations of "chain permutations" I've tried appear to work fine, except for those (which I haven't mentioned at all) that tried replacing multiplication with "weaker" permutations.

A minor exception is that, as already mentioned, applying the "leftshift-xor" permutation to the inputs with a shift count of 1 didn't pass the original tuple hash test in several cases.
msg326583 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-27 18:55
I should have spelled this out before:  these are all permutations, so in general permuting the result space of `x * mult + y` (or any other permutation involving x and y) is exactly the same as not permuting it but applying a different permutation to y instead.

Specifically, the sequence:

    x = x * mult + y  (mod 2**N)
    x = P(x)  # P is any permutation of N-bit integers

is the same as (and this isn't "deep" - it's trivial):

    x = x * mult + Q(y, x)  (mod 2**N)
    
where Q(y, x) = P(x * mult + y) - x * mult  (mod 2**N)

Q is a "richer" class of permutation though, because it's a different permutation for each fixed value of `x`, and uses multiplication to help disperse bits.

While it would take a lot of work to quantify this, in my experiments the tuple hash is significantly less touchy when permuting x after than when permuting y before, presumably because Q is richer.

The two tuple hash tests have many inputs whose tuple component hashes have very similar (even identical) bit patterns.  There's only so much dirt-simple permutations (like "y ^= y << 1") can do to break the patterns.  So, e.g., change a shift count, change the multiplier, ..., and at least one of those two tests fails depressingly often.  Not spectacularly, but over the limit they allow.  Q(y, x) does a far better job of magnifying small differences.

Something I haven't tried:  use a richer permutation of y that _doesn't_ depend on x.  For example, the frozenset hash has much the same underlying challenge, and uses this permutation to "blow up" small input differences:

static Py_uhash_t
_shuffle_bits(Py_uhash_t h)
{
    return ((h ^ 89869747UL) ^ (h << 16)) * 3644798167UL;
}

But that's a lot more expensive than a single shift-xor, and the speed of tuple hashing is more important than of frozenset hashing.
msg326602 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-28 04:20
Perhaps worth noting that FNV-1a works great if used as _intended_:  on a stream of unsigned bytes.  All tests except the new tuple hash test suffer no collisions; the new test suffers 14.  Nothing is needed to try to worm around nested tuple catastrophes, or to worm around mixing integers of similar magnitude but different signs.  The obvious downside:  on a 64-bit box, it's 8 times as many multiplies :-(  Here's a 64-bit version:

    Py_uhash_t t = (Py_uhash_t)y;
    for (int i = 0; i < sizeof(t); ++i) {
        x = (x ^ (t & 0xff)) * (Py_uhash_t)1099511628211ULL;
        t >>= 8;
    }
msg326603 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-28 04:43
Also worth noting:  other projects need to combine hashes too.  Here's a 64-bit version of the highly regarded C++ Boost[1] library's hash_combine() function (I replaced its 32-bit int literal with "a random" 64-bit one):

    x ^= (Py_uhash_t)y + 0x94ae1875aa5647f1UL + (x << 6) + (x >> 2);

It's very cheap.  It also sucks horribly if used as the guts Python's tuple hash loop ;-)

This isn't a paradox.  Best I can tell, Python may be unique in trying _not_ to make all hashes "look random".  If you have only randomish hashes to combine, the Boost approach works fine.
msg326615 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-28 08:48
> Replacing DJBX33A's multiplier of 33 is also a different algorithm.  So is working with inputs other than unsigned bytes.

I would argue that this is just extending the parameters of the algorithm. If the algorithm is general enough, then that shouldn't be a problem.

> So is mucking with the inputs before adding them in.

Not really. If you consider it an algorithm for hashing a sequence, that is just changing the input sequence.
msg326640 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-09-28 15:38
I spent about 2 days doing an extensive study of the FNV and DJB algorithms. I share my conclusions below.

To be very clear what I mean, I am talking about the following algorithms (t is a tuple and m is the multiplier which is always assumed to be odd).

def FNV(t, m):
    h = 1
    for x in t:
        h = ((h * m) ^ x) % 2**32
    return h

def DJB(t, m):
    h = 1
    for x in t:
        h = ((h * m) + x) % 2**32
    return h

I am restricting to 32 bits because we need to support that anyway and because it's computationally easier than 64 bits.

I took the point of view of fixing two different tuples t and u and then seeing for which multipliers m a collision hash(t, m) == hash(u, m) occurs. Since the lower k bits of the hash depend only on the lower k bits of the multiplier, it is actually feasible to list all m's for which there is a collision; I wrote a C program to do that. Note that there are 2**31 possible odd multipliers and 2**32 different possible hash values: so you expect about 0.5 collisions on average.

For the entries of the tuples, I took random integers in the range from 0 to 63 (so the issue with negative numbers in FNV does not occur). I mostly considered tuples of length 4, 5 and 6.

It turns out that both algorithms have pairs (t, u) for which a massive number of collisions occur:

- For FNV, the tuples (12, 50, 52, 24, 3), (28, 18, 52, 56, 19) have 2**26 collisions (all multipliers with lower byte 0x01, 0x7f, 0x81 or 0xff give collisions)
- For DJB, the tuples (22, 10, 12, 22, 29), (23, 14, 18, 26, 30) have 2**24 collisions (all multipliers with lower byte 0xff give collisions)

However, when you then look at the multipliers for these bad cases, they all have a special structure in the lower bits of which there are 2 cases:

- A special bit pattern like 0x001, 0x003, 0x555, 0xccd, ...
- A zero of a low-degree polynomial like 0x9fa5 which is a zero of x^2 + x + 2 modulo 2**16 (this may only be relevant for DJB)

In order to eliminate these 2 cases, I decided to fix the lower byte of the multiplier. I wanted it to be 3 or 5 mod 8 to have maximal multiplicative order and then I checked which byte had the worst algebraic relations and no obvious bit pattern. I decided to take 0xc5 as lower byte.

When considering only multipliers with 0xc5 as lower byte, I found no very bad cases. I checked lots of pairs of tuples of length <= 6 and entries <= 64 and the worst I could find was 8192 collisions for FNV and 1024 collisions for DJB. This maximum number was found for tuples of length 5. Also, the maximum number of collisions seems bounded as the bitsize increases. For a 25-bit hash, I got the same maximum numbers of collisions as for a 32-bit hash. So, the probability of getting a collision decreases by a factor of 2 for every bit added, which is what you want.

When testing various bounds on the entries and tuple lengths, DJB is generally a little bit better than FNV (by 2 metrics: the maximum number of collisions and the root-mean-square of the number of collisions).

So my conclusion would be to use DJB as core algorithm, taking care to pick a good multiplier (or at least not an obviously bad multiplier).
msg326653 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-28 19:15
[Tim]
> Perhaps worth noting that FNV-1a works great if used as
> _intended_:  on a stream of unsigned bytes.
> ...
>
>    Py_uhash_t t = (Py_uhash_t)y;
>    for (int i = 0; i < sizeof(t); ++i) {
>        x = (x ^ (t & 0xff)) * (Py_uhash_t)1099511628211ULL;
>        t >>= 8;
>    }

And that's all - no hacks for nested tuples, no hacks for mixed-sign ints, just 100% pure FNV-1a.

So, just for fun, how about 100% pure Bernstein 33A or 33X?  Those replace the 3rd line above with, respectively,

        x = (x * 33) + (t & 0xff); // 33A

or

        x = (x * 33) ^ (t & 0xff); // 33X


And those _also_ work great:  no collisions at all across my collection, except for the new tuple test, where they suffer 10 and 8 collisions respectively (note that the new test retains only the 32 least-significant hash bits).

Those are remarkable partly because multiplying by 33 is a weak permutation on its own:  it's just a left shift (by 5) and an add.  And while you can find odd multipliers with lower order (mod 2**N) than 33, you have to work to find one - it's exceptionally poor in this respect.  For example, pow(33, 8, 256) == 1, but half of the odd multipliers in range(256) have order 64 (mod 256).  Only 16 of 'em have order <= 8.

Then again, we're applying that weak permutation 8 times per 64-bit input here, and folding in a fresh byte each time.  It seems that, overall, that's "stronger" than applying a stronger - but still easily spelled - permutation just one or two times to a 64-bit input.

The only thing I've tried using 64-bit ops that was comparably robust across these tests used the frozenset hash's permutation on the inputs:

    Py_uhash_t t = (Py_uhash_t)y;
    t = ((t ^ 89869747UL) ^ (t << 16)) * 3644798167UL;
    x = (x * mult) + t;

That was quite insensitive to the choice of `mult`.  For example, in this instead:

    Py_uhash_t t = (Py_uhash_t)y;
    t ^= t << 1;
    x = (x * mult) + t;

the original tuple test had 24 collision and the new test 6 using the current multiplier; changing to the FNV-1a 32-bit multiplier, those zoomed to 857 and 44; then to the FNV-1a 64-bit multiplier, zoomed again to 477 and 2027; then to the "random" 1484268081, way down to 0 and 25; and finally to the "random" 5517301966916670289, down to 0 and 4.  Which didn't inspire confidence ;-)
msg326667 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-29 06:03
Jeroen, thanks for helping us fly slightly less blind! ;-) It's a lot of work.

I'd say you may as well pick a prime.  It's folklore, but "a reason" is that you've discovered that regular bit patterns in multipliers can hurt, and sticking to primes eliminates worlds of simple & subtle bit patterns.

Another bit of folklore:  pick a multiplier such that for each byte, neither it nor its complement are close in value to any other byte of the multiplier.  Which may seem more valuable for byte-oriented hashes, yet the byte-oriented FNV violates it spectacularly:  all FNV multipliers have _no_ bits set higher than 2**8 except for their leading bit.  So the longer the FNV multiplier, the more all-0 bytes it contains.

Which appears to be a deliberate choice to limit how quickly each new input byte propagates to other lower-order state bits across iterations.  The DJB33 algorithms accomplish that by using a tiny multiplier (relative to the output hash width) instead.

As I explained in other msgs here, treating tuple component hashes as sequences of bytes seems to deliver excellent results with the unaltered byte-oriented versions of FNV-1[a] and DJBX33[AX] - too good for anything in my test collection to detect anything even suspicious, let alone "a problem" (although it would be easy to contrive problem cases for 33[AX] - 33 is way "too small" to avoid collisions even across tuples with a single component).

So part of our challenge appears to me to be that we're trying instead to produce a hash from inputs of the _same_ bit width.  DJB/FNV were designed to produce hashes of width at least 4 times their inputs' bit widths.  Each input has a relatively tiny effect on the internal state for them.  For us, each input can change the entire internal state.
msg326753 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-30 21:46
An "aha!" moment for me:  I noted before that replacing the tuple hash loop with

    Py_uhash_t t = (Py_uhash_t)y;
    t ^= t << 1;
    x = (x * mult) + t;

does OK (64-bit build):  most tests had no collisions, but the original tuple test had 24 (a modest failure) and the new one 6.

Horrible results with a different scheme prompted me to add one line before the shift-xor:

    t = (t >> 3) | (t << 61);

That is, merely rotate the input right by 3 bits first.  Disaster!  Almost all tests suffered major collisions, and the old test 235424 and the new one 344919.

What's going on?  With hindsight, it's obvious:  multiplication is a horrible "mixer" _in this context_.  Because nearly all the tests are slinging little integers, almost all the input variation is in the last handful of bits.  Multiplication is great for spreading low-order bits around.  But rotate them to the high end, and multiplication is next to useless:  it only propagates them "to the left", and they're already at the left end then.  This part has virtually nothing to do with whether + or ^ is used, or with whether multiplication is done before or after.  It's actually the multiplication that's the weakness, and has nothing to do with which multiplier is used.

This isn't a problem with FNV or DJB when used _as intended_, because their output is much wider than their inputs.  The high-order bit of an input for them is still a lower-order bit to their much wider multiplication, and eventually propagates.  It isn't a problem for "multiplicative hashing" algorithms either, because those use a double-width multiply and only retain the _high_ half of a double-width product.  We're only retaining the _lower_ half.

I also noted before that replacing the shift-xor with the frozenset hash's input scrambler:

    t = ((t ^ 89869747UL) ^ (t << 16)) * 3644798167UL;

worked great.  But it's equally a disaster if the inputs are rotated right by 3 first.  Same reason:  it too only propagates "to the left".

So almost all tuple hash schemes on the table, both current and various experiments here, are systematic disasters if input hashes vary mostly in the high-order bits.  We haven't noticed because the current string hashes vary about the same in all bit positions, while contiguous ranges of not-huge ints have hashes that vary in the low-order bits.

The only schemes in all these messages that are "obviously immune" are those that used a (any) flavor of FNV or DJB as intended, treating each input hash as a sequence of unsigned bytes.
msg326765 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-01 06:13
Noting that the failure of high-order bits to propagate is one form of "avalanche" failure.  A recent 64-bit hash, SeaHash, takes this approach which is said to have provably "perfect" avalanche behavior:

    Py_uhash_t t = (Py_uhash_t)y;
    t *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
    t ^= (t >> 32) >> (t >> 60);
    t *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

The giant constant is just a 63-bit "patternless" prime (for whatever reason, the author wants this transformation to be easily invertible, so a prime is necessary - this is NOT a "crypto" hash).  The first multiply propagates low-order bits left.  Then the next line uses high-order bits to change low-order bits.  Extracting a variable shift count from the data itself is a clever idea taken from the PCG family of PRNGs - you have to work to contrive data where this doesn't "act kinda random".  The last line then propagates the - now altered by the high-order bits - lower-order bits left again.

Followed by

    x = (x * mult) + t;

this yields a tuple hash that passes all the tests I have.  The only collisions are in the new tuple test, which suffers 14.

Better, add the "catastrophic" right-rotate

    t = (t >> 3) | (t << 61);

at the start and it's still only the new tuple test that has a collision - it rises to 19 then, close to (but still under) its failure cutoff.

What I haven't tried:  in context it would be nice to eliminate the second multiply by the giant constant.  We're doing a multiply anyway to fold `t` into `x`, which will propagate the low-order bits left on the next iteration's `x * mult`.  That would ruin SeaHash's provable guarantees, but I'm more concerned about saving some cycles than purity ;-)  If that added enough collisions to push the second tuple test "not much" over the limit, I'd raise its limit.

Gonzo:  "the real" SeaHash duplicates the code above and works on 4 inputs simultaneously, designed to keep a modern processor's instruction pipelines as busy as possible.  I'd rather not bother (most tuples are short).

So this is principled and, if the SeaHash theory is right, immune to any simple kind of catastrophic failure.  Is it worth it?  I'd sleep better, yes ;-)  Note that the first multiply by the giant constant can be active at the same time as the `x * mult` at the end, so I'm guessing the speed hit would be bearable.

There's no truly _cheap_ way to get good propagation from all bit positions.  SeaHash has the fastest way to do that I've encountered.
msg326820 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-01 18:47
If SeaHash is interesting to us (it is to me), I suggest dropping our DJB/FNV combining part entirely and using a purer form, like so:

    Py_uhash_t t = (Py_uhash_t)y;
    x ^= t ^ (t << 1);
    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
    x ^= (x >> 32) >> (x >> 60);
    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

The only collisions with that were 14 in the new tuple test.  Rotate t right by 3 first, and that drops to 13.  Much better than utter catastrophe ;-)

100% pure SeaHash does

    x ^= t;
    
at the start first, instead of `t ^ (t << 1)` on the RHS.  That fails the second tuple test, with 98 collisions.  Not catastrophic, just "too many".  But there's only so much we can expect from a cheap non-crypto-strength hash.  `t ^ (t << 1)` is a pragmatic tweek to get rid of masses of uselessly repeated sign bits, which do occur in realistic tuples (mixing positive and negative ints).  Adding that tweak shouldn't harm any of SeaHash's provable global properties, since `t ^ (t << 1)` is just a permutation of the input space.

Different constants would be needed for a 32-bit version (best guesses, untried:  a 31-bit random prime; s/32/16/; s/60/29/).

I don't yet know about potential SeaHash licensing issues.  It's part of Rust:

https://docs.rs/seahash/3.0.5/seahash/
msg326835 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-01 22:33
>    Py_uhash_t t = (Py_uhash_t)y;
>    x ^= t ^ (t << 1);
>    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;
>    x ^= (x >> 32) >> (x >> 60);
>    x *= (Py_uhash_t)0x6eed0e9da4d94a4fULL;

Comment out either one of the multiplies, and it still passes all the tests.  Commenting out the first one is "more valuable", because on all but the last loop iteration the latency of the last-line multiply will overlap with the next call to hash a tuple component.  It even reduces the number of collisions in the new tuple test (down to 7 - and still no collisions at all in other tests).

I'm not clear on why there are two multiplies.  With both, or either one alone, low and high bits still get their chances to affect the other end.

Presumably it boils down to some detail in "the proofs" - but, as for FNV, while various things are _claimed_, I can't find a writeup of "the proofs" anywhere :-(
msg326872 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 10:03
> the author wants this transformation to be easily invertible, so a prime is necessary

A multiplication by any odd number modulo 2**64 is invertible. As I argued before, the concept of primes is meaningless (except for the prime 2) when computing modulo 2**64.
msg326874 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 10:41
This weekend I realized something important which I didn't realize before: some hash functions which I assumed to be good (i.e. small chance of collisions between any given two tuples) turned out to often fail the tests. This is because you don't just want to minimize collisions, you also want to minimize *correlations* between collisions.

More precisely: for a given hash function (considering the multiplier as parameter), it can happen that there are 4 tuples t, u, v, w such that whether or not hash(t) == hash(u) is correlated to whether or not hash(v) == hash(w). Such correlations increase the standard deviation of the number of collisions in the tests a lot (even if the average is unaffected), which leads to significant chances of failing the tests.

So with this in mind I stopped testing pairs of tuples but I ran the actual testsuites. The metric I'm using is now the probability that the testsuite passes for randomly chosen multipliers (3 mod 8). For example, the original tuple hash has a probability of around 97% of passing the original testsuite.

None of the hash functions that I tried (DJB or FNV with input mangling like t ^= t << 7) achieved such a high probability of passing the original test. The *only* variation that I found which passes the original test and my new test (and a third "random" test which I haven't mentioned before) with a high enough probability was FNV with input mangling with a second multiplier:

h = 1
for y in INPUT:
    t = hash(y)
    t ^= t * SOME_LARGE_EVEN_NUMBER   # instead of t ^= t << SHIFT
    h = (h ^ t) * MULTIPLIER
msg326877 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 12:38
SeaHash seems to be designed for 64 bits. I'm guessing that replacing the shifts by

x ^= ((x >> 16) >> (x >> 29))

would be what you'd do for a 32-bit hash. Alternatively, we could always compute the hash with 64 bits (using uint64_t) and then truncate at the end if needed.

However, when testing the hash function

    for t in INPUT:
        x ^= hash(t)
        x *= MULTIPLIER
        x ^= ((x >> 16) >> (x >> 29))
        x *= MULTIPLIER

It fails horribly on the original and my new testsuite. I'm guessing that the problem is that the line x ^= ((x >> 16) >> (x >> 29)) ignores low-order bits of x, so it's too close to pure FNV which is known to have problems. When replacing the first line of the loop above by x += hash(t) (DJB-style), it becomes too close to pure DJB and it also fails horribly because of nested tuples.

So it doesn't seem that the line x ^= ((x >> 16) >> (x >> 29)) (which is what makes SeaHash special) really helps much to solve the known problems with DJB or FNV.
msg326878 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 12:41
Correction: the FNV variant of SeaHash only fails the new testsuite, not the old one. The DJB variant of SeaHash fails both.
msg326883 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 14:42
> 100% pure SeaHash does x ^= t at the start first, instead of `t ^ (t << 1)` on the RHS.

Indeed. Some initial testing shows that this kind of "input mangling" (applying such a permutation on the inputs) actually plays a much more important role to avoid collisions than the SeaHash operation x ^= ((x >> 16) >> (x >> 29)).

So my suggestion remains

for y in INPUT:
    t = hash(y)
    t ^= t * SOME_LARGE_EVEN_NUMBER
    h ^= t
    h *= MULTIPLIER

Adding in the additional SeaHash operations

    x ^= ((x >> 16) >> (x >> 29))
    x *= MULTIPLIER

does not increase the probability of the tests passing.
msg326892 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 18:32
A meta-comment:  a great deal of work has been done on non-crypto hashes in recent years.  I'm not interested in rolling our own if one of the "winners" from that process can be adapted.  For that reason, I've only been looking at those that scored 10 (best possible) on Appleby's SMHasher[1] test suite, which is used by everyone who does recognized work in this field.  SeaHash appears to be the fastest of those, with short & simple code.

I'm concerned that I've been putting way too much weight on "the new" tuple test.  That uses a grand total of 10 tiny integers in -5 .. 5 (omits -1).  _All_ base-component variation is in the last 3 bits, while the top 61 bits are all 0 or all 1.  Great for testing nested tuples with tiny mixed-sign integers, but that's a minuscule region of the problem space.

[1] https://github.com/aappleby/smhasher
msg326893 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 18:52
>> the author wants this transformation to be easily
>> invertible, so a prime is necessary

> A multiplication by any odd number modulo 2**64 is
> invertible.

Right!  My mistake.


> As I argued before, the concept of primes is
> meaningless (except for the prime 2) when computing
> modulo 2**64.

I don't know why they're using a prime.  But that doesn't mean they don't have "a reason".  You seem quick to dismiss things if they don't make instant sense to you at first glance.  I've noted before, e.g., that sticking to a prime eliminates a world of regular bit patterns in the multiplier.

The original SeaHash used a different prime, and a fixed right shift of 32 (but twice in different places).

Switching to the current prime, and the variable shift, can be traced back to this long comment on Reddit:

https://www.reddit.com/r/rust/comments/5fdf8z/seahash_a_blazingly_fast_portable_hash_function/dakdii1/

The prime isn't "random":  it was constructed so that flipping a low-order bit to 1 would directly affect at least one of the topmost 4 bits, which in turn are used to select the shift count.  While that doesn't seem to matter for our tiny test suite, as the message shows it made huge improvements in other regions of the input space.

I haven't reported this, but doing "x ^= x >> 32" instead worked fine for our test suite too.  Well, big deal - we're testing almost nothing beyond two kinds of "oops!" cases (nested tuples, and mixed-sign tiny ints).
msg326894 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 19:15
> So my suggestion remains
>
> for y in INPUT:
>    t = hash(y)
>    t ^= t * SOME_LARGE_EVEN_NUMBER
>    h ^= t
>    h *= MULTIPLIER

On the face of it, I'd be amazed if that passed SMHasher, because it only propagates bits "to the left".  All hashes that score well on that test work to propagate "to the right" too.  SeaHash does that with its variable-count right-shift-xor.  Another popular choice in other hashes is a rotate.

As noted several times before, our tuple tests only use small integers, where there's virtually no variation in the high-order hash bits beyond "all 0" or "all 1".  Since all the variation is in a handful of low-order bits, "propagate right" _appears_ to be a waste of time.

If we, e.g., tested tuples of little floats instead, we may have "concluded" that it's "propagate left" that's a waste of time:

>>> for i in range(5):
...     x = 1 / 2**i
...     print(x, hex(hash(x)
...
1.0                   0x1
0.5    0x1000000000000000
0.25    0x800000000000000
0.125   0x400000000000000
0.0625  0x200000000000000
msg326898 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 19:37
> If we, e.g., tested tuples of little floats instead ...

Speaking of which:

>>> from itertools import product
>>> len(set(map(hash, product([0.5, 0.25], repeat=20))))
32

32 hash codes out of 1048576 distinct two-tuples isn't exactly confidence-inspiring either ;-)  No scheme that only "propagates to the left" is going to help that much, because the variation in those hashes is all in the high-order bits.
msg326903 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 20:57
> I've noted before, e.g., that sticking to a prime eliminates a world of regular bit patterns in the multiplier.

Why do you think this? 0x1fffffffffffffff is prime :-)

Having regular bit patterns and being prime are independent properties.

To be clear: I don't have anything against picking a prime but we just shouldn't pretend that primes are important when they are not. That's all...
msg326905 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 21:01
> For that reason, I've only been looking at those that scored 10 (best possible) on Appleby's SMHasher[1] test suite, which is used by everyone who does recognized work in this field.

So it seems that this SMHasher test suite doesn't catch the problem that we're seeing with negative integers.

> I'm concerned that I've been putting way too much weight on "the new" tuple test. [...] that's a minuscule region of the problem space.

I'll admit that it's a miniscule region of the problem space. However, we ideally want a hash that works well for all kinds of inputs. If the hash function is good, it shouldn't be possible to write a hash collision test function which has a significantly higher chance of failing than random chance.
msg326906 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 21:05
>>> from itertools import product
>>> len(set(map(hash, product([0.5, 0.25], repeat=20))))
32

Good catch! Would you like me to add this to the testsuite?
msg326907 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-02 21:09
> For that reason, I've only been looking at those that scored 10 (best possible) on Appleby's SMHasher[1] test suite

Do you have a list of such hash functions?
msg326908 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 21:12
>> I've noted before, e.g., that sticking to a prime
>> eliminates a world of regular bit patterns in the
>> multiplier.

> Why do you think this? 0x1fffffffffffffff is prime :-)

Point taken ;-)  But "a world of" is not the same as "the universe".  For example, sticking to a prime you'll never get 8 bytes the same.  Etc - "a world of" extremely regular patterns are eliminated.


> Having regular bit patterns and being prime are independent
> properties.
>
> To be clear: I don't have anything against picking a prime
> but we just shouldn't pretend that primes are important
> when they are not. That's all...

I don't like arguments from ignorance.  As I've said, I don't know why SeaHash uses a prime.  Neither do you.  Our collective ignorance doesn't imply the designer didn't have a good reason.

I can't think of a "good reason" for it either, but that's where we differ:  I think "so don't make gratuitous changes" while you think "therefore it can't possibly matter" ;-)
msg326910 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 21:21
>> For that reason, I've only been looking at those that
>> scored 10 (best possible) on Appleby's SMHasher[1] test suite

> Do you have a list of such hash functions?

A world of searching.  The ones rated "Excellent" here (previously cited) for a start:

https://docs.rs/seahash/3.0.5/seahash/
msg326917 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 21:41
> >>> from itertools import product
> >>> len(set(map(hash, product([0.5, 0.25], repeat=20))))
> 32

> Good catch! Would you like me to add this to the testsuite?

It's in mine already ;-)  I've added all the "bad examples" in all the messages here.  Sooner or later they'll get folded into Python's test suite.

BTW, there were no collisions in that under whatever 64-bit Python I last compiled.  That was a SeaHash variant.  I'm not certain, but I believe it had "t ^= t << 1" at the start and with the first multiply commented out.

Having learning _something_ about why SeaHash does what it does, I'm not convinced the first multiply is of much value.  As a standalone bit-scrambler for a single 64-bit input, it does matter.  But in the tuple hash context, we're running it in a loop.  Strictly alternating "propagate left" and "propagate right" seems to me almost as good - although that's just intuition.
msg326918 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 21:57
> So it seems that this SMHasher test suite doesn't
> catch the problem that we're seeing with negative integers.

Seems to be so, but I've never run SMHasher myself.  I believe it's focused on statistical properties, like avalanche and bit independence.


> However, we ideally want a hash that works well for
> all kinds of inputs. If the hash function is good,
> it shouldn't be possible to write a hash collision
> test function which has a significantly higher chance
> of failing than random chance.

I know of no such hash functions short of crypto-strength ones.  Nobody uses those for hash tables, though, because they're much slower and usually much more involved.
msg326927 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-02 23:50
FYI, this appears to be a good overview of what SMHasher is looking for:

https://github.com/aappleby/smhasher/wiki/SMHasher

Someg of everything: performance, correctness, statistical measures, and specific kinds of keysets that have proved problematic for other non-crypto hash functions.
msg326929 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-03 01:15
> SeaHash seems to be designed for 64 bits.

Absolutely.

> I'm guessing that replacing the shifts by
>
> x ^= ((x >> 16) >> (x >> 29))
>
> would be what you'd do for a 32-bit hash.

My guess too.  But "the prime" is a puzzle.  As noted before, the 64-bit prime is specially constructed to satisfy specific dispersion properties.  While I haven't triple-checked this, I believe it fails at two specific bit positions:

- Bit 2**16 doesn't propagate into the leading 4 bits at all.

- Bit 2**17 does propagate into bit 2**60, but _only_ into that bit among the highest 4 bits.  One of the criteria is that the prime shouldn't propagate a lower-order bit into _only_ the least-significant of the 4 leading bits.

Repairing that would probably require adding a bit or two.  But the prime already has 35 bits set.  The greater the imbalance between 0 and 1 bits, the more multiply acts like a simpler sequence of shift-and-adds or shift-and-subtracts (for example, at an extreme, consider the multiplier (1 << 63) - 1).

Anyway, it seems the density of 1 bits would have to be even higher for a 32-bit multiplier to ensure all low-order bits directly propagated to at least one of the topmost 3 bits, but not to the third-most significant alone.

Or I could search for a 31-bit prime - or just an odd integer - that got as close as possible with 16 or 17 bits set.  Or ...

> Alternatively, we could always compute the hash with
> 64 bits (using uint64_t) and then truncate at the end
> if needed.

Provided we continue to like it at all ;-)  In which case I'd probably end it with

    32_bit_result = (32_bit_result_type)(x ^ (x >> 32));

instead.
msg326947 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-03 08:21
> I know of no such hash functions short of crypto-strength ones.

Being crypto-strength and having few collisions statistically are different properties.

For non-crypto hash functions it's typically very easy to generate collisions once you know the parameters (which are called the "key" for crypto hash functions). But I think that you shouldn't be able to produce a large number of collisions if you don't know the parameters in advance.
msg326949 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-03 09:02
I'm having a look at xxHash, the second-fastest hash mentioned on https://docs.rs/seahash/3.0.5/seahash/
msg326961 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-03 10:34
A (simplified and slightly modified version of) xxHash seems to work very well, much better than SeaHash. Just like SeaHash, xxHash also works in parallel. But I'm not doing that and just using this for the loop:

    for y in t:
        y ^= y * (PRIME32_2 - 1)
        acc += y
        acc = ((acc << 13) + (acc >> 19))  # rotate left by 13 bits
        acc *= MULTIPLIER

Plain xxHash does "y *= PRIME32_2" or equivalently "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps slightly with my new tuple test.
msg327005 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-03 19:40
Thanks for looking at xxHash!  An advantage is that it already comes in 32- and 64-bit versions.

> A (simplified and slightly modified version of) xxHash
> seems to work very well, much better than SeaHash.

?  I've posted several SeaHash cores that suffer no collisions at all in any of our tests (including across every "bad example" in these 100+ messages), except for "the new" tuple test.  Which it also passed, most recently with 7 collisions.  That was under 64-bit builds, though, and from what follows I figure you're only looking at 32-bit builds for now.

>  Just like SeaHash, xxHash also works in parallel. But I'm not
> doing that and just using this for the loop:
>
>    for y in t:
>        y ^= y * (PRIME32_2 - 1)
>        acc += y
>        acc = ((acc << 13) + (acc >> 19))  # rotate left by 13 bits
>        acc *= MULTIPLIER
>
> Plain xxHash does "y *= PRIME32_2" or equivalently
> "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps
> slightly with my new tuple test.

Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me.  Does the variant also score well under SMHasher?  "I don't see why it wouldn't" is not an answer to that ;-)

Note that the change also slows the loop a bit - "acc += y" can't begin until the multiply finishes, and the following "acc *=" can't begin until the addition finishes.  Lengthening the critical path needs to buy something real to justify the extra expense.  I don't see it here.

For posterity:  the 64-bit version of xxHash uses different primes, and rotates left by 31 instead of by 13.

Note:  I'm assuming that by "PRIME32_2" you mean 2246822519U, and that "MULTIPLIER" means 2654435761U.  Which appear to be the actual multipliers used in, e.g.,

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

As always, I'm not a fan of making gratuitous changes.  In any case, without knowing the actual values you're using, nobody can replicate what you're doing.

That said, I have no reason to favor SeaHash over xxHash, and xxHash also has a real advantage in that it apparently didn't _need_ the "t ^= t << 1" permutation to clear the new tuple test's masses of replicated sign bits.

But the more we make changes to either, the more work _we_ have to do to ensure we haven't introduced subtle weaknesses.  Which the Python test suite will never be substantial enough to verify - we're testing almost nothing about hash behavior.  Nor should we:  people already wrote substantial test suites dedicated to that sole purpose, and we should aim to be "mere consumers" of functions that pass _those_ tests.  Python's test suite is more to ensure that known problems _in Python_ don't recur over the decades.
msg327014 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-03 21:41
Here's a complete xxHash-based implementation via throwing away C++-isms from

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.

The 32-bit build fares worse:

- 3 collisions in the old tuple test.
- 51 collisions in the new tuple test.
- 173 collisions across product([-3, 3], repeat=20)
- 141 collisions across product([0.5, 0.25], repeat=20)
- no other collisions

The expected number of collisions from tossing 2**20 balls into 2**32 buckets is about 128, with standard deviation 11.3.  So 141 is fine, but 173 is highly suspect.  51 for the new tuple test is way out of bounds.

But I don't much care.  None of these are anywhere near disasters, and - as I've already said - I know of no non-crypto strength hash that doesn't suffer "worse than random" behaviors in some easily-provoked cases.  All you have to do is keep trying.

Much as with SeaHash, adding

    t ^= t << 1;

at the top helps a whole lot in the "bad cases", cutting the new test's collisions to 7 in the 32-bit build.  It also cuts the product([-3, 3], repeat=20) collisions to 109.  But boosts the old tuple test's collisions to 12.  I wouldn't do it:  it adds cycles for no realistic gain.  We don't really care whether the hash "is random" - we do care about avoiding catastrophic pile-ups, and there are none with an unmodified xxHash.

BTW, making that change also _boosts_ the number of "new test" collisions to 23 in the 64-bit build (a little over its passing limit).

#define Py_NHASHBITS (SIZEOF_PY_HASH_T * CHAR_BIT)
#if Py_NHASHBITS >= 64
#    define Py_HASH_MULT1 (Py_uhash_t)11400714785074694791ULL
#    define Py_HASH_MULT2 (Py_uhash_t)14029467366897019727ULL
#    define Py_HASH_LSHIFT 31
#elif Py_NHASHBITS >= 32
#    define Py_HASH_MULT1 (Py_uhash_t)2654435761ULL
#    define Py_HASH_MULT2 (Py_uhash_t)2246822519ULL
#    define Py_HASH_LSHIFT 13
#else
#    error "can't make sense of Py_uhash_t size"
#endif
#define Py_HASH_RSHIFT (Py_NHASHBITS - Py_HASH_LSHIFT)

static Py_hash_t
tuplehash(PyTupleObject *v)
{
    Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
    Py_hash_t y;
    Py_ssize_t len = Py_SIZE(v);
    PyObject **p;
    x = 0x345678UL;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        t = (Py_uhash_t)y;
        x += t * Py_HASH_MULT2;
        x = (x << Py_HASH_LSHIFT) | (x >> Py_HASH_RSHIFT);
        x *= Py_HASH_MULT1;
    }
    x += 97531UL;
    if (x == (Py_uhash_t)-1)
        x = -2;
    return x;
}
#undef Py_NHASHBITS
#undef Py_HASH_MULT1
#undef Py_HASH_MULT2
#undef Py_HASH_LSHIFT
#undef Py_HASH_RSHIFT
msg327038 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-04 06:06
> I've posted several SeaHash cores that suffer no collisions at all in any of our tests (including across every "bad example" in these 100+ messages), except for "the new" tuple test.  Which it also passed, most recently with 7 collisions.  That was under 64-bit builds, though, and from what follows I figure you're only looking at 32-bit builds for now.

Note that I'm always considering parametrized versions of the hash functions that I'm testing. I'm replacing the fixed multiplier (all algorithms mentioned here have such a thing) by a random multiplier which is 3 mod 8. I keep the other constants. This allows me to look at the *probability* of passing the testsuite. That says a lot more than a simple yes/no answer for a single test. You don't want to pass the testsuite by random chance, you want to pass the testsuite because you have a high probability of passing it.

And I'm testing 32-bit hashes indeed since we need to support that anyway and the probability of collisions is high enough to get interesting statistical data.

For SeaHash, the probability of passing my new tuple test was only around 55%. For xxHash, this was about 85%. Adding some input mangling improved both scores, but the xxHash variant was still better than SeaHash.
msg327042 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-04 08:21
> Note:  I'm assuming that by "PRIME32_2" you mean 2246822519U

Yes indeed.

> and that "MULTIPLIER" means 2654435761U.

No, I mean a randomly chosen multiplier which is 3 mod 8.
msg327043 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-04 08:24
> people already wrote substantial test suites dedicated to that sole purpose, and we should aim to be "mere consumers" of functions that pass _those_ tests.

There are hash functions that pass those tests which are still bad in practice when used as tuple hash function. That's really unfortunate, but it's a fact that we need to live with. It means that we may need to do some adjustments to the hash functions.
msg327044 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-04 08:27
> Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me.

What I'm doing is the most innocent change: just applying a fixed permutation on the *input* of the hash function. I'm not changing the actual hashing algorithm.
msg327045 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-04 08:30
> In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.

That's pretty bad actually. With 64 bits, you statistically expect something in the order of 10**-8 collisions. So what you're seeing is 9 orders of magnitude too many collisions.
msg327055 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-04 15:30
> Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me.

OK, I won't do that. The difference is not that much anyway (it increases the test success rate from about 85% to 90%)
msg327061 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-04 16:02
> Note that I'm always considering parametrized
> versions of the hash functions that I'm testing.
> I'm replacing the fixed multiplier (all algorithms
> mentioned here have such a thing) by a random
> multiplier which is 3 mod 8.

I've explained before in some detail that this makes NO SENSE for SeaHash:  its multiplier was specially constructed to guarantee certain dispersion properties, to maximize the effectiveness of its "propagate right" step.

You already have my explanation about that, and a link to the original Reddit thread in which it was proposed (and eagerly accepted by SeaHash's primary author).  Also already explained that its multiplier appears to fail its design criteria at two specific bit positions.  Which I can repair, but I would do so not by changing Python's version, but by bringing it to SeaHash's author's attention.

OTOH, I haven't been able to find anything explaining why the xxHash authors picked what they picked.  But they certainly didn't have "random" in mind - nobody does.  You discovered for yourself by trying things that various properties make for bad multipliers, and people who work on these things "for real" knew that a long time ago.  They almost certainly know a great deal more that we haven't thought of at all.
msg327063 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-04 16:16
>> In the 64-bit build there are no collisions across my
>> tests except for 11 in the new tuple test.

> That's pretty bad actually. With 64 bits, you statistically
> expect something in the order of 10**-8 collisions. So
> what you're seeing is 9 orders of magnitude too many collisions.

You wrote the test, so you should remember that you throw away the top 32 bits of the 64-bit hash code :-)  Tossing 345130 balls into 2**32 bins has an expected mean of 13.9 collisions and sdev 3.7.  11 collisions is fine.

If I change the test to retain all the hash bits in the 64-bit build, there are no collisions in the new tuple test.

If I change it again to retain only the high 32 hash bits, 27 collisions.  Which is statistically suspect, but still universes away from "disaster" territory.
msg327078 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-04 19:45
>> people already wrote substantial test suites dedicated
>> to that sole purpose, and we should aim to be "mere
>> consumers" of functions that pass _those_ tests.

> There are hash functions that pass those tests which
> are still bad in practice when used as tuple hash function.

Really?  Which one(s)?  If you're talking about that some fail badly when you _replace_ the constants they picked, I doubt they'd accept that as proof of anything.  Things like FNV and DJB score poorly on SMHasher to begin with.


> That's really unfortunate, but it's a fact that we need
> to live with.

I'm surprised they do as well as they do using less than a handful of invertible transformations per input, using state of the same bit width as the inputs.

I don't expect them to be immune to all easily-provoked "worse than random" cases, though.  Over time, these hashes change while keeping the same name.  As noted before, this is at least the second version of SeaHash.  The best of the original algorithms of this kind - murmurhash - is on its 3rd version now.

The motivation is the same:  someone bumps into real-life cases where they do _way way way_ worse than "random", and so they try to repair that as cheaply as possible.

Most of these are designed for data centers to do cheap-as-possible reasonably robust fingerprinting of giant data blobs.  They could already have used, e.g., SHA-2 for highly robust fingerprinting, but they don't want to pay the very much higher runtime cost.

If you can provoke any deviation from randomness in that kind of hash, it's considered "broken" and is never used again.  But in these far cheaper hashes, it's considered part of the tradeoffs.  If you read the Reddit thread about SeaHash I cited before, the person who suggested the current transformations noted that there were still weaknesses in some areas of the input space he was able to find just by thinking about how it worked, but far milder than the weaknesses he found by thinking about how the original version worked.

That doesn't imply there aren't far worse weaknesses in areas he _didn't_ think about.  So it goes.

> It means that we may need to do some adjustments to
> the hash functions.

I'm fine with the faithful (barring errors on my part) xxHash I posted here before.  And I don't care whether it works "better" or "worse" on Python's tiny test suite if its constants are replaced, or its algorithm is "tweaked".  When xxHash version 2 is released, I want it to be as mindless as possible to replace the guts of Python's tuple-hash loop with the guts of xxHash version 2.

It's easy to believe that SMHasher has nothing testing the bit patterns produced by mixing two's-complement integers of similar magnitude but opposite signs.  That's not the kind of data giant data centers are likely to have much of ;-)  If Appleby can be talked into _adding_ that kind of keyset data to SMHasher, then the next generations of these hashes will deal with it for us.  But an objectively small number of collisions more than expected in some such cases now doesn't annoy me enough to want to bother.
msg327258 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-06 18:18
As a sanity check, I tried the 32-bit version of MurmurHash2 (version 3 doesn't have a 32-bit version).  All of my Python tests had collisions, and while the new tuple test dropped to 15, product([0.5, 0.25], repeat=20) skyrocketed from 141 (32-bit xxHash) to 3848.

I could live with that too, but xxHash does better overall on our tiny tests, and requires less code and fewer cycles.

Here's what I used:

static Py_hash_t
tuplehash(PyTupleObject *v)
{
    const Py_uhash_t m = 0x5bd1e995;
    const int r = 24;
    Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
    Py_hash_t y;
    Py_ssize_t len = Py_SIZE(v);
    PyObject **p;
    x = 0x345678UL ^ (Py_uhash_t)len;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        Py_uhash_t t = (Py_uhash_t)y;
        t *= m;
        t ^= t >> r;
        t *= m;
        x *= m;
        x ^= t;
    }
    x ^= x >> 13;
    x *= m;
    x ^= x >> 15;
    if (x == (Py_uhash_t)-1)
        x = -2;
    return x;
}
msg327260 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-06 22:24
Thinking about the way xxHash works prompted me to try this initialization change:

    x = 0x345678UL + (Py_uhash_t)len; // adding in the length is new

That was a pure win in my tests, slashing the number of collisions in the new tuple test, 32-bit build, from 51 to 17.  In the 64-bit build, cut from 11 to 10, but when looking at the high 32 hash bits instead from 27 to 15.  There were also small improvements in other 32-bit build collision stats.

Full-blown xxHash (& SeaHash, and murmurhash, and ...) also fold in the length, but not inside their core loops.  It's useful info!

But they fold it in _after_ their core loops, not before.  They're apparently worried about this:  if their internal state ever reaches 0, that's a fixed point so long as all remaining incoming data is zeroes (in Python, e.g., picture adding one or more trailing integer or float zeroes or ..., which hash to 0).  Folding in the length at the end snaps it slightly out of zeroland.

I don't really care about that.  Folding in the length at the start quickly leads to large differences across iterations for tuples of different lengths (thanks to repeated mulitiplication and "propagate right" steps), which is especially helpful in things like the nested tuple tests where there's almost as much variation in tuple lengths as there is in the few little integers bottom-level tuples contain.
msg327298 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-07 18:05
I updated PR 9471 with a tuple hash function based on xxHash. The only change w.r.t. the official xxHash specification is that I'm not using parallellism and just using 1 accumulator. Please have a look.
msg327311 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-07 21:13
Attaching htest.py so we have a common way to compare what various things do on the tests that have been suggested.  unittest sucks for that.  doctest too.  Here's current code output from a 32-bit build; "ideally" we want "got" values not much larger than "mean" (these are counting collisions):

range(100) by 3; 32-bit hash codes; mean 116.42 got 0
-10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 69,596
-50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 708,066
0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 875,000
[-3, 3] by 20; 32-bit hash codes; mean 128.00 got 1,047,552
[0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 1,048,568
old tuple test; 32-bit hash codes; mean 7.43 got 6
new tuple test; 32-bit hash codes; mean 13.87 got 102,922

And under a 64-bit build, where the full hash code is considered, and also its lowest and highest 32 bits.  Note, e.g., that the old tuple test is an utter disaster if we only look at the high 32 bits.  Which is actually fine by me - the point of this is to show what happens.  Judging is a different (albeit related) issue ;-)

range(100) by 3; 64-bit hash codes; mean 0.00 got 0
range(100) by 3; 32-bit lower hash codes; mean 116.42 got 0
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 989,670
-10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 69,596
-10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 69,596
-10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 101,438
-50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 708,066
-50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 708,066
-50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 994,287
0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 500,000
0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 875,000
0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 989,824
[-3, 3] by 20; 64-bit hash codes; mean 0.00 got 1,047,552
[-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 1,047,552
[-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 1,047,552
[0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 1,048,544
[0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 1,048,575
[0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 1,048,544
old tuple test; 64-bit hash codes; mean 0.00 got 0
old tuple test; 32-bit lower hash codes; mean 7.43 got 6
old tuple test; 32-bit upper hash codes; mean 7.43 got 128,494
new tuple test; 64-bit hash codes; mean 0.00 got 102,920
new tuple test; 32-bit lower hash codes; mean 13.87 got 102,922
new tuple test; 32-bit upper hash codes; mean 13.87 got 178,211
msg327319 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-08 03:39
Here's htest.py output from

git pull https://github.com/jdemeyer/cpython.git bpo34751

and a variation.  The number of collisions in the variation appear in square brackets at the end of each line.

32-bit build:

range(100) by 3; 32-bit hash codes; mean 116.42 got 0 [0]
-10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
[-3, 3] by 20; 32-bit hash codes; mean 128.00 got 140 [108]
[0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 99 [154]
old tuple test; 32-bit hash codes; mean 7.43 got 4 [2]
new tuple test; 32-bit hash codes; mean 13.87 got 19 [21]

64-bit build:

range(100) by 3; 64-bit hash codes; mean 0.00 got 0 [0]
range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]
-10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 0 [0]
-10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 1 [0]
-10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 1 [0]
-50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 128 [0]
-50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 116 [8]
0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 123 [258]
0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 121 [0]
[-3, 3] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 129 [117]
[-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 137 [115]
[0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 126 [131]
[0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 137 [130]
old tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
old tuple test; 32-bit lower hash codes; mean 7.43 got 12 [5]
old tuple test; 32-bit upper hash codes; mean 7.43 got 54 [52]
new tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
new tuple test; 32-bit lower hash codes; mean 13.87 got 10 [6]
new tuple test; 32-bit upper hash codes; mean 13.87 got 20 [30]

They're all fine by me.  This is what the variation does:

1. Removes all "final mix" code after the loop ends.
2. Changes initialization to add in the length:

    Py_uhash_t acc = _PyHASH_XXPRIME_5 + (Py_uhash_t)len;

The vast bulk of the "final mix" code applies a chain of tuple-agnostic permutations.  The only exception is adding in the length, which #2 moves to the start instead.

Applying permutations after the loop ends changes nothing about the number of full-width hash collisions, which is Python's _primary_ concern.  Two full-width hash codes are the same after the final mix if and only if they're the same before the final mix starts.

In xxHash they're concerned about getting close to "avalanche" perfection (each input bit affects all final output bits with probability about 0.5 for each).  There aren't enough permutations _inside_ the loop to achieve that for the last input or two, so they pile up more permutations after the loop.

While we do care about "propagate left" and "propagate right" to mix up very regular and/or sparse hash codes, "avalanche perfection" is of no particular value to us.  To the contrary, in _typical_ cases like `product(range(100), repeat=3)` it's better for us _not_ to destroy all the regularity:

range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]

"Avalanche perfection" is what drives the number of collisions up with the final mix code intact.  Without it, we get "waaaaaaay better than random" numbers of collisions.

The other reason it's attractive to drop it:  the final mix code is one long critical path (no instruction-level parallelism), adding 8 serialized machine instructions including two multiplies.  That's a real expense for typically-short tuples used as dict keys or set elements.

Nothing is anywhere near disaster territory either way, although "worse than random" remains quite possible, especially when cutting a 64-bit hash in half (do note that, either way, there are no collisions in any test when retaining the full 64-bit hash code).
msg327380 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-09 04:31
We need to worry about timing too :-(  I'm using this as a test.  It's very heavy on using 3-tuples of little ints as dict keys.  Getting hash codes for ints is relatively cheap, and there's not much else going on, so this is intended to be very sensitive to changes in the speed of tuple hashing:

def doit():
    from time import perf_counter as now
    from itertools import product

    s = now()
    d = dict.fromkeys(product(range(150), repeat=3), 0)
    for k in d:
         d[k] += 1
    for k in d:
         d[k] *= 2
    f = now()
    return f - s

I run that five times in a loop on a mostly quiet box, and pick the smallest time as "the result".

Compared to current master, a 64-bit release build under Visual Studio takes 20.7% longer.  Ouch!  That's a real hit.

Fiddling the code a bit (see the PR) to convince Visual Studio to emit a rotate instruction instead of some shifts and an add reduces that to 19.3% longer.  A little better.

The variant I discussed last time (add in the length at the start, and get rid of all the post-loop avalanche code) reduces it to 8.88% longer.  The avalanche code is fixed overhead independent of tuple length, so losing it is more valuable (for relative speed) the shorter the tuple.

I can't speed it more.  These high-quality hashes have unforgiving long critical paths, and every operation appears crucial to their good results in _some_ test we already have.  But "long" is relative to our current tuple hash, which does relatively little to scramble bits, and never "propagates right" at all.  In its loop, it does one multiply nothing waits on, and increments the multplier for the next iteration while the multiply is going on.

Note:  "the real" xxHash runs 4 independent streams, but it only has to read 8 bytes to get the next value to fold in.  That can go on - in a single thread - while other streams are doing arithmetic (ILP).  We pretty much have to "stop the world" to call PyObject_Hash() instead.

We could call that 4 times in a row and _then_ start arithmetic.  But most tuples that get hashed are probably less than 4 long, and the code to mix stream results together at the end is another bottleneck.  My first stab at trying to split it into 2 streams ran substantially slower on realistic-length (i.e., small) tuples - so that was also my last stab ;-)

I can live with the variant.  When I realized we never "propagate right" now, I agreed with Jeroen that the tuple hash is fundamentally "broken", despite that people hadn't reported it as such yet, and despite that that flaw had approximately nothing to do with the issue this bug report was opened about.  Switching to "a real" hash will avoid a universe of bad cases, and xxHash appears to be as cheap as a hash without glaringly obvious weaknesses gets:  two multiplies, an add, and a rotate.
msg327391 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-09 10:23
> Changes initialization to add in the length:

What's the rationale for that change? You always asked me to stay as close as possible to the "official" hash function which adds in the length at the end. Is there an actual benefit from doing it in the beginning?
msg327394 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-09 11:38
I pushed an update at PR 9471. I think I took into account all your comments, except for moving the length addition from the end to the begin of the function.
msg327423 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-09 16:06
>> Changes initialization to add in the length:

> What's the rationale for that change? You always asked
> me to stay as close as possible to the "official" hash function
> which adds in the length at the end. Is there an actual benefit
> from doing it in the beginning?

The heart of xxHash is the state-updating code, neither initialization nor finalization.  Likewise for SeaHash or any of the others.

Without the post-loop avalanche code, adding the length at the end has very little effect - it will typically change only the last two bits of the final result.  _With_ the avalanche code, the length can affect every bit in the result.  But adding it in at the start also achieves that - same as changing any bit in the initial accumulator value.

Adding it in at the start instead also takes the addition off the critical path.  Which may or may not save a cycle or two (depending on processor and compiler), but can't hurt speed.

I noted before that adding the length at the end _can_ break out of a zero fixed-point (accumulator becomes 0 and all remaining values hash to 0).  Adding it at the start loses that.

So there is a theoretical danger there ... OK, trying it both ways I don't see any significant differences in my test results or a timing difference outside the noise range.  So I'm happy either way.
msg327460 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-10-10 08:04
> it will typically change only the last two bits of the final result

Which is great if all that you care about is avoiding collisions.
msg328665 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-28 00:06
New changeset aeb1be5868623c9cd9cf6d7de3015a43fb005815 by Raymond Hettinger (jdemeyer) in branch 'master':
bpo-34751: improved hash function for tuples (GH-9471)
https://github.com/python/cpython/commit/aeb1be5868623c9cd9cf6d7de3015a43fb005815
msg329287 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2018-11-05 08:30
Many thanks!
History
Date User Action Args
2022-04-11 14:59:06adminsetgithub: 78932
2018-11-05 08:30:09jdemeyersetmessages: + msg329287
2018-10-28 00:07:08rhettingersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2018-10-28 00:06:43rhettingersetmessages: + msg328665
2018-10-10 08:04:55jdemeyersetmessages: + msg327460
2018-10-09 16:06:31tim.peterssetmessages: + msg327423
2018-10-09 11:38:53jdemeyersetmessages: + msg327394
2018-10-09 10:23:36jdemeyersetmessages: + msg327391
2018-10-09 04:31:45tim.peterssetmessages: + msg327380
2018-10-08 03:39:45tim.peterssetmessages: + msg327319
2018-10-07 21:13:39tim.peterssetfiles: + htest.py

messages: + msg327311
2018-10-07 18:05:41jdemeyersetmessages: + msg327298
2018-10-06 22:24:15tim.peterssetmessages: + msg327260
2018-10-06 18:18:07tim.peterssetmessages: + msg327258
2018-10-04 19:45:42tim.peterssetmessages: + msg327078
2018-10-04 16:55:14tim.peterssetcomponents: + Interpreter Core, - XML
2018-10-04 16:16:03tim.peterssetmessages: + msg327063
components: + XML, - Interpreter Core
2018-10-04 16:02:22tim.peterssetmessages: + msg327061
2018-10-04 15:30:38jdemeyersetmessages: + msg327055
2018-10-04 08:30:25jdemeyersetmessages: + msg327045
2018-10-04 08:27:20jdemeyersetmessages: + msg327044
2018-10-04 08:24:05jdemeyersetmessages: + msg327043
2018-10-04 08:21:13jdemeyersetmessages: + msg327042
2018-10-04 06:06:09jdemeyersetmessages: + msg327038
2018-10-03 21:41:59tim.peterssetmessages: + msg327014
2018-10-03 19:40:21tim.peterssetmessages: + msg327005
2018-10-03 10:34:14jdemeyersetmessages: + msg326961
2018-10-03 09:02:48jdemeyersetmessages: + msg326949
2018-10-03 08:21:31jdemeyersetmessages: + msg326947
2018-10-03 01:15:05tim.peterssetmessages: + msg326929
2018-10-02 23:50:19tim.peterssetmessages: + msg326927
2018-10-02 21:57:13tim.peterssetmessages: + msg326918
2018-10-02 21:41:39tim.peterssetmessages: + msg326917
2018-10-02 21:21:45tim.peterssetmessages: + msg326910
2018-10-02 21:12:25tim.peterssetmessages: + msg326908
2018-10-02 21:09:17jdemeyersetmessages: + msg326907
2018-10-02 21:05:11jdemeyersetmessages: + msg326906
2018-10-02 21:01:46jdemeyersetmessages: + msg326905
2018-10-02 20:57:24jdemeyersetmessages: + msg326903
2018-10-02 19:37:22tim.peterssetmessages: + msg326898
2018-10-02 19:15:02tim.peterssetmessages: + msg326894
2018-10-02 18:52:29tim.peterssetmessages: + msg326893
2018-10-02 18:32:29tim.peterssetmessages: + msg326892
2018-10-02 14:42:41jdemeyersetmessages: + msg326883
2018-10-02 12:41:42jdemeyersetmessages: + msg326878
2018-10-02 12:38:23jdemeyersetmessages: + msg326877
2018-10-02 10:41:53jdemeyersetmessages: + msg326874
2018-10-02 10:03:15jdemeyersetmessages: + msg326872
2018-10-01 22:33:09tim.peterssetmessages: + msg326835
2018-10-01 18:47:34tim.peterssetmessages: + msg326820
2018-10-01 06:13:40tim.peterssetmessages: + msg326765
2018-09-30 21:46:39tim.peterssetmessages: + msg326753
2018-09-29 06:03:26tim.peterssetmessages: + msg326667
2018-09-28 19:15:49tim.peterssetmessages: + msg326653
2018-09-28 15:38:53jdemeyersetmessages: + msg326640
2018-09-28 08:48:08jdemeyersetmessages: + msg326615
2018-09-28 04:43:07tim.peterssetmessages: + msg326603
2018-09-28 04:20:03tim.peterssetmessages: + msg326602
2018-09-27 18:55:59tim.peterssetmessages: + msg326583
2018-09-26 21:23:15tim.peterssetmessages: + msg326512
2018-09-26 21:06:05jdemeyersetmessages: + msg326510
2018-09-26 20:52:48jdemeyersetmessages: + msg326508
2018-09-26 20:50:15jdemeyersetmessages: + msg326507
2018-09-26 20:43:33tim.peterssetmessages: + msg326505
2018-09-26 09:06:52jdemeyersetmessages: + msg326435
2018-09-25 19:15:15tim.peterssetmessages: + msg326396
2018-09-25 16:54:28tim.peterssetmessages: + msg326375
2018-09-25 13:15:34jdemeyersetmessages: + msg326352
2018-09-25 10:00:12jdemeyersetmessages: + msg326337
2018-09-25 08:51:51jdemeyersetmessages: + msg326334
2018-09-25 08:46:24jdemeyersetmessages: + msg326333
2018-09-25 08:42:03jdemeyersetmessages: + msg326332
2018-09-25 08:34:29jdemeyersetmessages: + msg326331
2018-09-25 01:51:50tim.peterssetmessages: + msg326309
2018-09-25 00:42:29tim.peterssetmessages: + msg326306
2018-09-25 00:13:22tim.peterssetmessages: + msg326302
2018-09-24 21:37:16tim.peterssetmessages: + msg326290
2018-09-24 18:05:12tim.peterssetmessages: + msg326278
2018-09-24 13:48:28jdemeyersetmessages: + msg326236
2018-09-24 13:41:27jdemeyersetpull_requests: + pull_request8937
2018-09-24 10:39:51jdemeyersetmessages: + msg326219
2018-09-24 10:33:17jdemeyersetmessages: + msg326218
2018-09-24 10:01:11jdemeyersetmessages: + msg326217
2018-09-24 09:52:53jdemeyersetmessages: + msg326214
2018-09-24 09:45:54jdemeyersetmessages: + msg326212
2018-09-24 05:50:59tim.peterssetmessages: + msg326203
2018-09-24 04:16:31tim.peterssetmessages: + msg326200
2018-09-24 00:41:19tim.peterssetmessages: + msg326198
2018-09-24 00:33:46tim.peterssetmessages: + msg326194
2018-09-24 00:15:06rhettingersetmessages: + msg326193
2018-09-23 19:13:59tim.peterssetmessages: + msg326173
2018-09-23 08:29:19jdemeyersetmessages: + msg326147
2018-09-23 00:27:13tim.peterssetmessages: + msg326120
2018-09-23 00:16:30tim.peterssetmessages: + msg326119
2018-09-23 00:01:53rhettingersetmessages: + msg326118
2018-09-22 21:38:06jdemeyersetmessages: + msg326117
2018-09-22 21:20:19jdemeyersetmessages: + msg326115
2018-09-22 19:26:31tim.peterssetmessages: + msg326110
2018-09-22 09:12:44jdemeyersetmessages: + msg326086
2018-09-22 08:03:42jdemeyersetmessages: + msg326083
2018-09-22 07:36:42jdemeyersetmessages: + msg326081
2018-09-22 03:47:32tim.peterssetmessages: + msg326067
2018-09-21 21:50:02jdemeyersetmessages: + msg326047
2018-09-21 20:27:56jdemeyersetmessages: + msg326032
2018-09-21 20:07:12jdemeyersetmessages: + msg326029
2018-09-21 19:48:57tim.peterssetmessages: + msg326026
2018-09-21 19:44:23tim.peterssetmessages: + msg326024
2018-09-21 19:38:50jdemeyersetmessages: + msg326023
2018-09-21 19:26:57rhettingersetmessages: + msg326021
2018-09-21 19:01:35eric.smithsetmessages: + msg326018
2018-09-21 18:46:14jdemeyersetmessages: + msg326016
2018-09-21 13:26:57sir-sigurdsetnosy: + sir-sigurd
2018-09-21 13:18:29ned.deilysetnosy: - ned.deily
2018-09-21 12:50:57jdemeyersetkeywords: + patch
stage: resolved -> patch review
pull_requests: + pull_request8884
2018-09-21 10:34:24jdemeyersetmessages: + msg325982
2018-09-21 10:17:25eric.smithsetmessages: + msg325981
2018-09-21 09:45:09eric.smithsetnosy: + eric.smith
2018-09-21 08:09:02mark.dickinsonsetnosy: + mark.dickinson
2018-09-21 06:18:11jdemeyersetstatus: closed -> open
resolution: rejected -> (no value)
messages: + msg325964
2018-09-21 06:10:38tim.peterssetmessages: + msg325960
2018-09-21 06:10:27jdemeyersetmessages: + msg325959
2018-09-21 06:06:40tim.peterssetmessages: + msg325957
2018-09-21 05:57:20jdemeyersetmessages: + msg325956
2018-09-21 05:54:47jdemeyersetmessages: + msg325955
2018-09-21 05:09:04tim.peterssetnosy: + ned.deily
2018-09-21 04:55:05tim.peterssetnosy: - ned.deily
messages: + msg325953
2018-09-21 04:51:40ned.deilysetnosy: + ned.deily
messages: + msg325950
2018-09-21 04:46:25tim.peterssetstatus: open -> closed
resolution: rejected
messages: + msg325949
2018-09-21 01:21:22rhettingersetpriority: normal -> low

nosy: + tim.peters
messages: + msg325936

assignee: tim.peters
2018-09-20 15:45:11jdemeyersetstatus: closed -> open
resolution: rejected -> (no value)
messages: + msg325891
2018-09-20 15:06:50rhettingersetstatus: open -> closed

type: performance
components: + Interpreter Core
versions: + Python 3.8
nosy: + rhettinger

messages: + msg325883
resolution: rejected
stage: resolved
2018-09-20 13:27:27jdemeyercreate