Message327014
Here's a complete xxHash-based implementation via throwing away C++-isms from
https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp
In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.
The 32-bit build fares worse:
- 3 collisions in the old tuple test.
- 51 collisions in the new tuple test.
- 173 collisions across product([-3, 3], repeat=20)
- 141 collisions across product([0.5, 0.25], repeat=20)
- no other collisions
The expected number of collisions from tossing 2**20 balls into 2**32 buckets is about 128, with standard deviation 11.3. So 141 is fine, but 173 is highly suspect. 51 for the new tuple test is way out of bounds.
But I don't much care. None of these are anywhere near disasters, and - as I've already said - I know of no non-crypto strength hash that doesn't suffer "worse than random" behaviors in some easily-provoked cases. All you have to do is keep trying.
Much as with SeaHash, adding
t ^= t << 1;
at the top helps a whole lot in the "bad cases", cutting the new test's collisions to 7 in the 32-bit build. It also cuts the product([-3, 3], repeat=20) collisions to 109. But boosts the old tuple test's collisions to 12. I wouldn't do it: it adds cycles for no realistic gain. We don't really care whether the hash "is random" - we do care about avoiding catastrophic pile-ups, and there are none with an unmodified xxHash.
BTW, making that change also _boosts_ the number of "new test" collisions to 23 in the 64-bit build (a little over its passing limit).
#define Py_NHASHBITS (SIZEOF_PY_HASH_T * CHAR_BIT)
#if Py_NHASHBITS >= 64
# define Py_HASH_MULT1 (Py_uhash_t)11400714785074694791ULL
# define Py_HASH_MULT2 (Py_uhash_t)14029467366897019727ULL
# define Py_HASH_LSHIFT 31
#elif Py_NHASHBITS >= 32
# define Py_HASH_MULT1 (Py_uhash_t)2654435761ULL
# define Py_HASH_MULT2 (Py_uhash_t)2246822519ULL
# define Py_HASH_LSHIFT 13
#else
# error "can't make sense of Py_uhash_t size"
#endif
#define Py_HASH_RSHIFT (Py_NHASHBITS - Py_HASH_LSHIFT)
static Py_hash_t
tuplehash(PyTupleObject *v)
{
Py_uhash_t x, t; /* Unsigned for defined overflow behavior. */
Py_hash_t y;
Py_ssize_t len = Py_SIZE(v);
PyObject **p;
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
t = (Py_uhash_t)y;
x += t * Py_HASH_MULT2;
x = (x << Py_HASH_LSHIFT) | (x >> Py_HASH_RSHIFT);
x *= Py_HASH_MULT1;
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
x = -2;
return x;
}
#undef Py_NHASHBITS
#undef Py_HASH_MULT1
#undef Py_HASH_MULT2
#undef Py_HASH_LSHIFT
#undef Py_HASH_RSHIFT |
|
Date |
User |
Action |
Args |
2018-10-03 21:41:59 | tim.peters | set | recipients:
+ tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd |
2018-10-03 21:41:59 | tim.peters | set | messageid: <1538602919.46.0.545547206417.issue34751@psf.upfronthosting.co.za> |
2018-10-03 21:41:59 | tim.peters | link | issue34751 messages |
2018-10-03 21:41:59 | tim.peters | create | |
|