Author tim.peters
Recipients eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date 2018-10-09.04:31:43
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1539059505.34.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to
Content
We need to worry about timing too :-(  I'm using this as a test.  It's very heavy on using 3-tuples of little ints as dict keys.  Getting hash codes for ints is relatively cheap, and there's not much else going on, so this is intended to be very sensitive to changes in the speed of tuple hashing:

def doit():
    from time import perf_counter as now
    from itertools import product

    s = now()
    d = dict.fromkeys(product(range(150), repeat=3), 0)
    for k in d:
         d[k] += 1
    for k in d:
         d[k] *= 2
    f = now()
    return f - s

I run that five times in a loop on a mostly quiet box, and pick the smallest time as "the result".

Compared to current master, a 64-bit release build under Visual Studio takes 20.7% longer.  Ouch!  That's a real hit.

Fiddling the code a bit (see the PR) to convince Visual Studio to emit a rotate instruction instead of some shifts and an add reduces that to 19.3% longer.  A little better.

The variant I discussed last time (add in the length at the start, and get rid of all the post-loop avalanche code) reduces it to 8.88% longer.  The avalanche code is fixed overhead independent of tuple length, so losing it is more valuable (for relative speed) the shorter the tuple.

I can't speed it more.  These high-quality hashes have unforgiving long critical paths, and every operation appears crucial to their good results in _some_ test we already have.  But "long" is relative to our current tuple hash, which does relatively little to scramble bits, and never "propagates right" at all.  In its loop, it does one multiply nothing waits on, and increments the multplier for the next iteration while the multiply is going on.

Note:  "the real" xxHash runs 4 independent streams, but it only has to read 8 bytes to get the next value to fold in.  That can go on - in a single thread - while other streams are doing arithmetic (ILP).  We pretty much have to "stop the world" to call PyObject_Hash() instead.

We could call that 4 times in a row and _then_ start arithmetic.  But most tuples that get hashed are probably less than 4 long, and the code to mix stream results together at the end is another bottleneck.  My first stab at trying to split it into 2 streams ran substantially slower on realistic-length (i.e., small) tuples - so that was also my last stab ;-)

I can live with the variant.  When I realized we never "propagate right" now, I agreed with Jeroen that the tuple hash is fundamentally "broken", despite that people hadn't reported it as such yet, and despite that that flaw had approximately nothing to do with the issue this bug report was opened about.  Switching to "a real" hash will avoid a universe of bad cases, and xxHash appears to be as cheap as a hash without glaringly obvious weaknesses gets:  two multiplies, an add, and a rotate.
History
Date User Action Args
2018-10-09 04:31:45tim.peterssetrecipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-10-09 04:31:45tim.peterssetmessageid: <1539059505.34.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-10-09 04:31:45tim.peterslinkissue34751 messages
2018-10-09 04:31:43tim.peterscreate