Author rhettinger
Recipients eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date 2018-09-21.19:26:57
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1537558017.64.0.956365154283.issue34751@psf.upfronthosting.co.za>
In-reply-to
Content
ISTM the collision of "hash((1,0,0)) == hash((1,-2,-2))" also stems from integers hashing to themselves and that twos-complement arithmetic relation, -x == ~x+1.  Further, that example specifically exploits that "hash(-1) == hash(-2)" due to how error codes.  In a way, tuple hashing is incidental.

Occasional collisions aren't really a big deal -- it is normal for dicts to have some collisions and to resolve them.  I would be more concerned is non-contrived datasets had many elements that gave exactly the same hash value resulting in a pile-up in one place.

FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

    >>> from itertools import product
    >>> len(set(map(hash, product(range(100), repeat=4))))
    100000000

The current hash is in the pretty-darned-good category and so we should be disinclined to change battle-tested code that is known to work well across a broad range of use cases.
History
Date User Action Args
2018-09-21 19:26:57rhettingersetrecipients: + rhettinger, tim.peters, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-21 19:26:57rhettingersetmessageid: <1537558017.64.0.956365154283.issue34751@psf.upfronthosting.co.za>
2018-09-21 19:26:57rhettingerlinkissue34751 messages
2018-09-21 19:26:57rhettingercreate