Message 309961 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	Eric Appelt, berker.peksag, christian.heimes, martin.panter, python-dev, rhettinger, serhiy.storchaka, tim.peters, vstinner
Date	2018-01-15.10:09:04
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1516010945.34.0.467229070634.issue26163@psf.upfronthosting.co.za>
In-reply-to

Content
As far as I understand, the problem is that the value obtained by XORing hashes of elements has nonuniform distribution. Further transformations, either linear or nonlinear, as well as adding length, don't change the fact that the result of XORing contains less information than the number of bit in the hash word. The mathematically strong way of computing the hash of a frozenset is: hash(tuple(sorted(hash(x) for x in frozenset))) Store all hash values into an array, sort them for getting rid of ordering, and make a hash of all concatenated hashes. But this algorithm requires linear memory and have superlinear complexity. For practical purposes we need just makes the difference of the distribution from the uniform distribution small enough. Maybe the following algorithm could help: buckets = [0] * N for x in frozenset: h = hash(x) i = shuffle_bits1(h) % N buckets[i] ^= shuffle_bits2(h) return hash(tuple(buckets)) where shuffle_bits1() and shuffle_bits2() are functions that shuffle bits in different ways.

As far as I understand, the problem is that the value obtained by XORing hashes of elements has nonuniform distribution. Further transformations, either linear or nonlinear, as well as adding length, don't change the fact that the result of XORing contains less information than the number of bit in the hash word. The mathematically strong way of computing the hash of a frozenset is:

    hash(tuple(sorted(hash(x) for x in frozenset)))

Store all hash values into an array, sort them for getting rid of ordering, and make a hash of all concatenated hashes.

But this algorithm requires linear memory and have superlinear complexity. For practical purposes we need just makes the difference of the distribution from the uniform distribution small enough. Maybe the following algorithm could help:

    buckets = [0] * N
    for x in frozenset:
        h = hash(x)
        i = shuffle_bits1(h) % N
        buckets[i] ^= shuffle_bits2(h)
    return hash(tuple(buckets))

where shuffle_bits1() and shuffle_bits2() are functions that shuffle bits in different ways.

History
Date	User	Action	Args
2018-01-15 10:09:05	serhiy.storchaka	set	recipients: + serhiy.storchaka, tim.peters, rhettinger, vstinner, christian.heimes, python-dev, berker.peksag, martin.panter, Eric Appelt
2018-01-15 10:09:05	serhiy.storchaka	set	messageid: <1516010945.34.0.467229070634.issue26163@psf.upfronthosting.co.za>
2018-01-15 10:09:05	serhiy.storchaka	link	issue26163 messages
2018-01-15 10:09:05	serhiy.storchaka	create