classification
Title: Experiment: Assume that exact unicode hashes are perfect discriminators
Type: performance Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: dstufft, ezio.melotti, josh.r, lemburg, pitrou, rhettinger, scoder, serhiy.storchaka, tim.peters
Priority: low Keywords: patch

Created on 2015-03-19 19:57 by rhettinger, last changed 2015-05-23 01:26 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
assume_perf_uni_hash1.diff rhettinger, 2015-03-19 20:04 Assume that unicode hashes are perfect discriminators review
Messages (9)
msg238552 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-03-19 19:57
This tracker item is for a thought experiment I'm running where I can collect the thoughts and discussions in one place.  It is not an active proposal for inclusion in Python.

The idea is to greatly speed-up the language for set/dict lookups of unicode value by skipping the exact comparison when the unicode type is exact and the 64-bit hash values are known to match.

Given the siphash and hash randomization, we get a 1 in 2**64 chance of a false positive (which is better than the error rate for non-ECC DRAM itself).  

However, since the siphash isn't cryptographically secure, presumably a malicious chooser of keys could generate a false positive on-purpose.

This technique is currently used by git and mercurial which use hash values for file and version graphs without checking for an exact match (because the chance of a false positive is vanishingly rare).

The Python test suite passes as does the test suites for a number of packages I have installed.
msg238577 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2015-03-19 22:57
I'm not sure what you mean by "Siphash isn't cryptographically secure". One of the key points of Siphash is that it *is* cryptographically secure. It has a smaller space than your typical hash function (MD5, SHA1, SHA2, etc) which means that collisions themselves are more likely, but the hash itself is cryptographically secure against pre-image attacks and it has collision resistance (collision resistance is different than just getting a collision).
msg238578 - (view) Author: Donald Stufft (dstufft) * (Python committer) Date: 2015-03-19 23:01
To be clear, I have no opinion on your specific proposal and I don't know if the difference between "cryptographically secure" and "not cryptographically secure" matters for it. I just wanted to be clear that with SipHash an attacker should *not* be able to choose keys that will collide on purpose, but given the small output space (64bits IIRC) that collisions are more likely to occur than with MD5, SHA1, SHA2, etc.
msg238593 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-03-20 00:47
Assuming Siphash is in fact cryptographically secure (in the sense you can't choose a desired result hash with better odds that brute force), and it appears to be so, with a keyspace of 64 bits, if it's evenly distributed (which a cryptographically secure hash should be), that implies to even have a 1% chance of any two keys in a set colliding, you'd need over 2**29 keys ( you can plug in your own numbers for the conditional calculation here https://lazycackle.com/Probability_of_repeated_event_online_calculator__birthday_problem_.html ). Even at the one ten thousandth of a percent collision threshold, you're talking about a set of 6 million strings to find even one pair that match.

I'd still be leery of using such an approach for general purpose sets and dicts, where they could conceivably contain enough entries to pose a risk (vanishingly small, but not "heat death of the universe" small). But for Python implementation dictionaries (module, nested scope, class, and instance dictionaries), where we're talking about maybe a thousand attributes in extreme cases, which are almost never under the control of an "attacker" in any event, I could definitely see a low risk win. If you're assuming a dictionary with less than 10,000 keys, that's a hit would be literally one in a trillion; under a hundred and you're below one in a quadrillion chance, which I think is safe enough. If you wanted to make it "safe" you could conceivably use an approach that changed algorithms up front, depending on the size of the dictionary; less than a hundred entries, use hash only lookup, above 100, use "safe" lookup.
msg238609 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2015-03-20 07:15
The problem is, even if the chance is excessively small, it's not zero. Meaning, someone's data set somewhere will break somehow. And with hash randomisation, it'll mean that the problem spotted in some life system will be entirely unreproducible without knowing the exact hash seed value that triggered it (i.e. not at all in a debug setup, definitely not in a test suite).

Unequal keys accidentally comparing equal in a dict is something that wouldn't necessarily crash loudly. It may just lead to very subtly incorrect results, e.g. one lost item in an intermediate step of a process, and half a cent up or down in the final result.

Don't get me wrong, I think this is a very interesting idea and worth exploring. But I'd be very reluctant to introduce it into the core language general purpose data types.

As for Python implementation dicts, the keys in there will normally be interned, so the lookup will not compare the strings but only their pointers. Nothing to win on that front, really.
msg239242 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-03-25 09:46
Given interned strings can already be compared by pointer, I'd recommend people call sys.intern() on their strings if they want really fast equality comparisons.
msg239252 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-03-25 13:22
In case of class or module dicts attribute names usually are interned. So no string comparison is needed if the key is found. It is needed only when the key is not found, but found a key with the same hash (with the chance 5e-28). The benefit of optimization is 5e-28 * time of string comparison per key lookup.
msg239253 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-03-25 13:40
I wouldn't make this the default, since the probability is small, but not zero.

However, it may be worth exploring this as dict sub-type for applications to use which could benefit from it.

For internal strings, we already use interning, so no win there. The probabilistic approach would only make sense for dictionaries which store non-interned Unicode keys and get lots of dict lookups.
msg243877 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-23 01:26
Closing this because it was never an active proposal, just more of a thought experiment that I didn't want to lose track of.
History
Date User Action Args
2015-05-23 01:26:45rhettingersetstatus: open -> closed
resolution: not a bug
messages: + msg243877
2015-03-25 13:40:07lemburgsetnosy: + lemburg
messages: + msg239253
2015-03-25 13:22:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg239252
2015-03-25 09:46:08pitrousetnosy: + pitrou, tim.peters
messages: + msg239242
2015-03-24 22:40:46rhettingersetpriority: normal -> low
versions: - Python 3.5
2015-03-24 19:41:47ezio.melottisetnosy: + ezio.melotti
2015-03-20 07:15:14scodersetnosy: + scoder
messages: + msg238609
2015-03-20 00:47:47josh.rsetnosy: + josh.r
messages: + msg238593
2015-03-19 23:01:40dstufftsetmessages: + msg238578
2015-03-19 22:57:28dstufftsetnosy: + dstufft
messages: + msg238577
2015-03-19 20:05:19rhettingersetfiles: - assume_perf_uni_hash.diff
2015-03-19 20:04:59rhettingersetfiles: + assume_perf_uni_hash1.diff
2015-03-19 19:57:42rhettingercreate