Author Eric Appelt
Recipients Eric Appelt, berker.peksag, martin.panter, rhettinger, serhiy.storchaka, tim.peters, vstinner
Date 2016-10-30.19:23:01
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1477855382.67.0.129764493897.issue26163@psf.upfronthosting.co.za>
In-reply-to
Content
I spent some time looking at the Objects/setobject.c code where the hashing scheme for frozensets, which essentially works by XOR-ing together the hashes from all its entries. Significant care is taken to shuffle bits at various stages, and the algorithm seems to be well thought out. My own attempts to change constants or add in reshufflings either didn't help the collision statistics or just made things worse. I think that there is something of a fundamental limitation here due to information loss in XOR, and other fast, commutative bitwise operations (i.e. AND, OR) are well known to be much worse.

I did stumble on a 2006 paper by Boneh and Boyen [1] which looked at a potentially related problem of trying to combine two different hashing functions to improve collision resistance and found that there was no way to do this with fewer bits than just concatenating the hashes. The present ticket is more a problem of combining hashes from the same function in an order-independent way and may be completely unrelated. However, I imagine that concatenation followed by rehashing the result would remove the loss due to XORing unlucky choices of hashes.

Concatenating everything seemed obviously too slow and/or unacceptable in terms of memory use, but I thought of a compromise where I would construct an array of 4 hash values, initialized to zero, and for each entry I would select an array index based on a reshuffling of the bits, and XOR that particular entry into the chosen index. This results in a hash that is 4x wider than the standard size that reduces the information loss incurred from XOR. This wide hash can then be hashed down to a normal sized hash.

I implemented this algorithm and tested it using the same procedure as before. Specifically, all frozensets for every possible combination (128) of the letters abcdefg as single character strings are hashed, and the last 7 bits of each of their hashes are compared to see how many unique 7-bit patterns are produced. This is done for PYTHONHASHSEED values from 1 to 10000 to build a distribution. The distribution is compared to a similar distribution constructed from pseudorandom numbers for comparison.

Unlike the current hashing algorithm for frozensets, and much like the result from hashes of strings, the result of this new "wide4" algorithm appears to be nearly ideal. The results are plotted in "frozenset_string_n7_10k_wide4.png" as attached.

I will follow this up with a patch for the algorithm as soon as I run the complete test suite and clean up.

Another option if the current algorithm is considered good enough is to alter the current test to retry on failure if the power set of letters 'abcdefg...' fails by shifting one letter and trying again, perhaps ensuring that 4/5 tests pass. This ought to greatly reduce the sporadic build failures.

[1] http://ai.stanford.edu/~xb/crypto06b/blackboxhash.pdf
History
Date User Action Args
2016-10-30 19:23:02Eric Appeltsetrecipients: + Eric Appelt, tim.peters, rhettinger, vstinner, berker.peksag, martin.panter, serhiy.storchaka
2016-10-30 19:23:02Eric Appeltsetmessageid: <1477855382.67.0.129764493897.issue26163@psf.upfronthosting.co.za>
2016-10-30 19:23:02Eric Appeltlinkissue26163 messages
2016-10-30 19:23:01Eric Appeltcreate