Message 327005 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-10-03.19:40:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1538595621.85.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
Thanks for looking at xxHash! An advantage is that it already comes in 32- and 64-bit versions. > A (simplified and slightly modified version of) xxHash > seems to work very well, much better than SeaHash. ? I've posted several SeaHash cores that suffer no collisions at all in any of our tests (including across every "bad example" in these 100+ messages), except for "the new" tuple test. Which it also passed, most recently with 7 collisions. That was under 64-bit builds, though, and from what follows I figure you're only looking at 32-bit builds for now. > Just like SeaHash, xxHash also works in parallel. But I'm not > doing that and just using this for the loop: > > for y in t: > y ^= y * (PRIME32_2 - 1) > acc += y > acc = ((acc << 13) + (acc >> 19)) # rotate left by 13 bits > acc = MULTIPLIER > > Plain xxHash does "y = PRIME32_2" or equivalently > "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps > slightly with my new tuple test. Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me. Does the variant also score well under SMHasher? "I don't see why it wouldn't" is not an answer to that ;-) Note that the change also slows the loop a bit - "acc += y" can't begin until the multiply finishes, and the following "acc *=" can't begin until the addition finishes. Lengthening the critical path needs to buy something real to justify the extra expense. I don't see it here. For posterity: the 64-bit version of xxHash uses different primes, and rotates left by 31 instead of by 13. Note: I'm assuming that by "PRIME32_2" you mean 2246822519U, and that "MULTIPLIER" means 2654435761U. Which appear to be the actual multipliers used in, e.g., https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp As always, I'm not a fan of making gratuitous changes. In any case, without knowing the actual values you're using, nobody can replicate what you're doing. That said, I have no reason to favor SeaHash over xxHash, and xxHash also has a real advantage in that it apparently didn't _need_ the "t ^= t << 1" permutation to clear the new tuple test's masses of replicated sign bits. But the more we make changes to either, the more work _we_ have to do to ensure we haven't introduced subtle weaknesses. Which the Python test suite will never be substantial enough to verify - we're testing almost nothing about hash behavior. Nor should we: people already wrote substantial test suites dedicated to that sole purpose, and we should aim to be "mere consumers" of functions that pass _those_ tests. Python's test suite is more to ensure that known problems _in Python_ don't recur over the decades.

Thanks for looking at xxHash!  An advantage is that it already comes in 32- and 64-bit versions.

> A (simplified and slightly modified version of) xxHash
> seems to work very well, much better than SeaHash.

?  I've posted several SeaHash cores that suffer no collisions at all in any of our tests (including across every "bad example" in these 100+ messages), except for "the new" tuple test.  Which it also passed, most recently with 7 collisions.  That was under 64-bit builds, though, and from what follows I figure you're only looking at 32-bit builds for now.

>  Just like SeaHash, xxHash also works in parallel. But I'm not
> doing that and just using this for the loop:
>
>    for y in t:
>        y ^= y * (PRIME32_2 - 1)
>        acc += y
>        acc = ((acc << 13) + (acc >> 19))  # rotate left by 13 bits
>        acc *= MULTIPLIER
>
> Plain xxHash does "y *= PRIME32_2" or equivalently
> "y += y * (PRIME32_2 - 1)". Replacing that += by ^= helps
> slightly with my new tuple test.

Taking an algorithm in wide use that's already known to get a top score on SMHasher and fiddling it to make a "slight" improvement in one tiny Python test doesn't make sense to me.  Does the variant also score well under SMHasher?  "I don't see why it wouldn't" is not an answer to that ;-)

Note that the change also slows the loop a bit - "acc += y" can't begin until the multiply finishes, and the following "acc *=" can't begin until the addition finishes.  Lengthening the critical path needs to buy something real to justify the extra expense.  I don't see it here.

For posterity:  the 64-bit version of xxHash uses different primes, and rotates left by 31 instead of by 13.

Note:  I'm assuming that by "PRIME32_2" you mean 2246822519U, and that "MULTIPLIER" means 2654435761U.  Which appear to be the actual multipliers used in, e.g.,

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

As always, I'm not a fan of making gratuitous changes.  In any case, without knowing the actual values you're using, nobody can replicate what you're doing.

That said, I have no reason to favor SeaHash over xxHash, and xxHash also has a real advantage in that it apparently didn't _need_ the "t ^= t << 1" permutation to clear the new tuple test's masses of replicated sign bits.

But the more we make changes to either, the more work _we_ have to do to ensure we haven't introduced subtle weaknesses.  Which the Python test suite will never be substantial enough to verify - we're testing almost nothing about hash behavior.  Nor should we:  people already wrote substantial test suites dedicated to that sole purpose, and we should aim to be "mere consumers" of functions that pass _those_ tests.  Python's test suite is more to ensure that known problems _in Python_ don't recur over the decades.

History
Date	User	Action	Args
2018-10-03 19:40:21	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-10-03 19:40:21	tim.peters	set	messageid: <1538595621.85.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-10-03 19:40:21	tim.peters	link	issue34751 messages
2018-10-03 19:40:21	tim.peters	create