Message 327078 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-10-04.19:45:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1538682342.49.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
>> people already wrote substantial test suites dedicated >> to that sole purpose, and we should aim to be "mere >> consumers" of functions that pass _those_ tests. > There are hash functions that pass those tests which > are still bad in practice when used as tuple hash function. Really? Which one(s)? If you're talking about that some fail badly when you _replace_ the constants they picked, I doubt they'd accept that as proof of anything. Things like FNV and DJB score poorly on SMHasher to begin with. > That's really unfortunate, but it's a fact that we need > to live with. I'm surprised they do as well as they do using less than a handful of invertible transformations per input, using state of the same bit width as the inputs. I don't expect them to be immune to all easily-provoked "worse than random" cases, though. Over time, these hashes change while keeping the same name. As noted before, this is at least the second version of SeaHash. The best of the original algorithms of this kind - murmurhash - is on its 3rd version now. The motivation is the same: someone bumps into real-life cases where they do _way way way_ worse than "random", and so they try to repair that as cheaply as possible. Most of these are designed for data centers to do cheap-as-possible reasonably robust fingerprinting of giant data blobs. They could already have used, e.g., SHA-2 for highly robust fingerprinting, but they don't want to pay the very much higher runtime cost. If you can provoke any deviation from randomness in that kind of hash, it's considered "broken" and is never used again. But in these far cheaper hashes, it's considered part of the tradeoffs. If you read the Reddit thread about SeaHash I cited before, the person who suggested the current transformations noted that there were still weaknesses in some areas of the input space he was able to find just by thinking about how it worked, but far milder than the weaknesses he found by thinking about how the original version worked. That doesn't imply there aren't far worse weaknesses in areas he _didn't_ think about. So it goes. > It means that we may need to do some adjustments to > the hash functions. I'm fine with the faithful (barring errors on my part) xxHash I posted here before. And I don't care whether it works "better" or "worse" on Python's tiny test suite if its constants are replaced, or its algorithm is "tweaked". When xxHash version 2 is released, I want it to be as mindless as possible to replace the guts of Python's tuple-hash loop with the guts of xxHash version 2. It's easy to believe that SMHasher has nothing testing the bit patterns produced by mixing two's-complement integers of similar magnitude but opposite signs. That's not the kind of data giant data centers are likely to have much of ;-) If Appleby can be talked into _adding_ that kind of keyset data to SMHasher, then the next generations of these hashes will deal with it for us. But an objectively small number of collisions more than expected in some such cases now doesn't annoy me enough to want to bother.

>> people already wrote substantial test suites dedicated
>> to that sole purpose, and we should aim to be "mere
>> consumers" of functions that pass _those_ tests.

> There are hash functions that pass those tests which
> are still bad in practice when used as tuple hash function.

Really?  Which one(s)?  If you're talking about that some fail badly when you _replace_ the constants they picked, I doubt they'd accept that as proof of anything.  Things like FNV and DJB score poorly on SMHasher to begin with.


> That's really unfortunate, but it's a fact that we need
> to live with.

I'm surprised they do as well as they do using less than a handful of invertible transformations per input, using state of the same bit width as the inputs.

I don't expect them to be immune to all easily-provoked "worse than random" cases, though.  Over time, these hashes change while keeping the same name.  As noted before, this is at least the second version of SeaHash.  The best of the original algorithms of this kind - murmurhash - is on its 3rd version now.

The motivation is the same:  someone bumps into real-life cases where they do _way way way_ worse than "random", and so they try to repair that as cheaply as possible.

Most of these are designed for data centers to do cheap-as-possible reasonably robust fingerprinting of giant data blobs.  They could already have used, e.g., SHA-2 for highly robust fingerprinting, but they don't want to pay the very much higher runtime cost.

If you can provoke any deviation from randomness in that kind of hash, it's considered "broken" and is never used again.  But in these far cheaper hashes, it's considered part of the tradeoffs.  If you read the Reddit thread about SeaHash I cited before, the person who suggested the current transformations noted that there were still weaknesses in some areas of the input space he was able to find just by thinking about how it worked, but far milder than the weaknesses he found by thinking about how the original version worked.

That doesn't imply there aren't far worse weaknesses in areas he _didn't_ think about.  So it goes.

> It means that we may need to do some adjustments to
> the hash functions.

I'm fine with the faithful (barring errors on my part) xxHash I posted here before.  And I don't care whether it works "better" or "worse" on Python's tiny test suite if its constants are replaced, or its algorithm is "tweaked".  When xxHash version 2 is released, I want it to be as mindless as possible to replace the guts of Python's tuple-hash loop with the guts of xxHash version 2.

It's easy to believe that SMHasher has nothing testing the bit patterns produced by mixing two's-complement integers of similar magnitude but opposite signs.  That's not the kind of data giant data centers are likely to have much of ;-)  If Appleby can be talked into _adding_ that kind of keyset data to SMHasher, then the next generations of these hashes will deal with it for us.  But an objectively small number of collisions more than expected in some such cases now doesn't annoy me enough to want to bother.

History
Date	User	Action	Args
2018-10-04 19:45:42	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-10-04 19:45:42	tim.peters	set	messageid: <1538682342.49.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-10-04 19:45:42	tim.peters	link	issue34751 messages
2018-10-04 19:45:42	tim.peters	create