Author tim.peters
Recipients eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date 2018-10-03.21:41:59
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1538602919.46.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to
Content
Here's a complete xxHash-based implementation via throwing away C++-isms from

https://github.com/RedSpah/xxhash_cpp/blob/master/xxhash/xxhash.hpp

In the 64-bit build there are no collisions across my tests except for 11 in the new tuple test.

The 32-bit build fares worse:

- 3 collisions in the old tuple test.
- 51 collisions in the new tuple test.
- 173 collisions across product([-3, 3], repeat=20)
- 141 collisions across product([0.5, 0.25], repeat=20)
- no other collisions

The expected number of collisions from tossing 2**20 balls into 2**32 buckets is about 128, with standard deviation 11.3.  So 141 is fine, but 173 is highly suspect.  51 for the new tuple test is way out of bounds.

But I don't much care.  None of these are anywhere near disasters, and - as I've already said - I know of no non-crypto strength hash that doesn't suffer "worse than random" behaviors in some easily-provoked cases.  All you have to do is keep trying.

Much as with SeaHash, adding

    t ^= t << 1;

at the top helps a whole lot in the "bad cases", cutting the new test's collisions to 7 in the 32-bit build.  It also cuts the product([-3, 3], repeat=20) collisions to 109.  But boosts the old tuple test's collisions to 12.  I wouldn't do it:  it adds cycles for no realistic gain.  We don't really care whether the hash "is random" - we do care about avoiding catastrophic pile-ups, and there are none with an unmodified xxHash.

BTW, making that change also _boosts_ the number of "new test" collisions to 23 in the 64-bit build (a little over its passing limit).

#define Py_NHASHBITS (SIZEOF_PY_HASH_T * CHAR_BIT)
#if Py_NHASHBITS >= 64
#    define Py_HASH_MULT1 (Py_uhash_t)11400714785074694791ULL
#    define Py_HASH_MULT2 (Py_uhash_t)14029467366897019727ULL
#    define Py_HASH_LSHIFT 31
#elif Py_NHASHBITS >= 32
#    define Py_HASH_MULT1 (Py_uhash_t)2654435761ULL
#    define Py_HASH_MULT2 (Py_uhash_t)2246822519ULL
#    define Py_HASH_LSHIFT 13
#else
#    error "can't make sense of Py_uhash_t size"
#endif
#define Py_HASH_RSHIFT (Py_NHASHBITS - Py_HASH_LSHIFT)

static Py_hash_t
tuplehash(PyTupleObject *v)
{
    Py_uhash_t x, t;  /* Unsigned for defined overflow behavior. */
    Py_hash_t y;
    Py_ssize_t len = Py_SIZE(v);
    PyObject **p;
    x = 0x345678UL;
    p = v->ob_item;
    while (--len >= 0) {
        y = PyObject_Hash(*p++);
        if (y == -1)
            return -1;
        t = (Py_uhash_t)y;
        x += t * Py_HASH_MULT2;
        x = (x << Py_HASH_LSHIFT) | (x >> Py_HASH_RSHIFT);
        x *= Py_HASH_MULT1;
    }
    x += 97531UL;
    if (x == (Py_uhash_t)-1)
        x = -2;
    return x;
}
#undef Py_NHASHBITS
#undef Py_HASH_MULT1
#undef Py_HASH_MULT2
#undef Py_HASH_LSHIFT
#undef Py_HASH_RSHIFT
History
Date User Action Args
2018-10-03 21:41:59tim.peterssetrecipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-10-03 21:41:59tim.peterssetmessageid: <1538602919.46.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-10-03 21:41:59tim.peterslinkissue34751 messages
2018-10-03 21:41:59tim.peterscreate