Issue 23712: Experiment: Assume that exact unicode hashes are perfect discriminators

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67900

classification

Title:	Experiment: Assume that exact unicode hashes are perfect discriminators
Type:	performance	Stage:
Components:	Interpreter Core	Versions:

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	dstufft, ezio.melotti, josh.r, lemburg, pitrou, rhettinger, scoder, serhiy.storchaka, tim.peters
Priority:	low	Keywords:	patch

Created on 2015-03-19 19:57 by rhettinger, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
assume_perf_uni_hash1.diff	rhettinger, 2015-03-19 20:04	Assume that unicode hashes are perfect discriminators	review

Messages (9)
msg238552 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-03-19 19:57
This tracker item is for a thought experiment I'm running where I can collect the thoughts and discussions in one place. It is not an active proposal for inclusion in Python. The idea is to greatly speed-up the language for set/dict lookups of unicode value by skipping the exact comparison when the unicode type is exact and the 64-bit hash values are known to match. Given the siphash and hash randomization, we get a 1 in 2**64 chance of a false positive (which is better than the error rate for non-ECC DRAM itself). However, since the siphash isn't cryptographically secure, presumably a malicious chooser of keys could generate a false positive on-purpose. This technique is currently used by git and mercurial which use hash values for file and version graphs without checking for an exact match (because the chance of a false positive is vanishingly rare). The Python test suite passes as does the test suites for a number of packages I have installed.
msg238577 - (view)	Author: Donald Stufft (dstufft) *	Date: 2015-03-19 22:57
I'm not sure what you mean by "Siphash isn't cryptographically secure". One of the key points of Siphash is that it is cryptographically secure. It has a smaller space than your typical hash function (MD5, SHA1, SHA2, etc) which means that collisions themselves are more likely, but the hash itself is cryptographically secure against pre-image attacks and it has collision resistance (collision resistance is different than just getting a collision).
msg238578 - (view)	Author: Donald Stufft (dstufft) *	Date: 2015-03-19 23:01
To be clear, I have no opinion on your specific proposal and I don't know if the difference between "cryptographically secure" and "not cryptographically secure" matters for it. I just wanted to be clear that with SipHash an attacker should not be able to choose keys that will collide on purpose, but given the small output space (64bits IIRC) that collisions are more likely to occur than with MD5, SHA1, SHA2, etc.
msg238593 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2015-03-20 00:47
Assuming Siphash is in fact cryptographically secure (in the sense you can't choose a desired result hash with better odds that brute force), and it appears to be so, with a keyspace of 64 bits, if it's evenly distributed (which a cryptographically secure hash should be), that implies to even have a 1% chance of any two keys in a set colliding, you'd need over 2**29 keys ( you can plug in your own numbers for the conditional calculation here https://lazycackle.com/Probability_of_repeated_event_online_calculator__birthday_problem_.html ). Even at the one ten thousandth of a percent collision threshold, you're talking about a set of 6 million strings to find even one pair that match. I'd still be leery of using such an approach for general purpose sets and dicts, where they could conceivably contain enough entries to pose a risk (vanishingly small, but not "heat death of the universe" small). But for Python implementation dictionaries (module, nested scope, class, and instance dictionaries), where we're talking about maybe a thousand attributes in extreme cases, which are almost never under the control of an "attacker" in any event, I could definitely see a low risk win. If you're assuming a dictionary with less than 10,000 keys, that's a hit would be literally one in a trillion; under a hundred and you're below one in a quadrillion chance, which I think is safe enough. If you wanted to make it "safe" you could conceivably use an approach that changed algorithms up front, depending on the size of the dictionary; less than a hundred entries, use hash only lookup, above 100, use "safe" lookup.
msg238609 - (view)	Author: Stefan Behnel (scoder) *	Date: 2015-03-20 07:15
The problem is, even if the chance is excessively small, it's not zero. Meaning, someone's data set somewhere will break somehow. And with hash randomisation, it'll mean that the problem spotted in some life system will be entirely unreproducible without knowing the exact hash seed value that triggered it (i.e. not at all in a debug setup, definitely not in a test suite). Unequal keys accidentally comparing equal in a dict is something that wouldn't necessarily crash loudly. It may just lead to very subtly incorrect results, e.g. one lost item in an intermediate step of a process, and half a cent up or down in the final result. Don't get me wrong, I think this is a very interesting idea and worth exploring. But I'd be very reluctant to introduce it into the core language general purpose data types. As for Python implementation dicts, the keys in there will normally be interned, so the lookup will not compare the strings but only their pointers. Nothing to win on that front, really.
msg239242 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-03-25 09:46
Given interned strings can already be compared by pointer, I'd recommend people call sys.intern() on their strings if they want really fast equality comparisons.
msg239252 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-25 13:22
In case of class or module dicts attribute names usually are interned. So no string comparison is needed if the key is found. It is needed only when the key is not found, but found a key with the same hash (with the chance 5e-28). The benefit of optimization is 5e-28 * time of string comparison per key lookup.
msg239253 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-03-25 13:40
I wouldn't make this the default, since the probability is small, but not zero. However, it may be worth exploring this as dict sub-type for applications to use which could benefit from it. For internal strings, we already use interning, so no win there. The probabilistic approach would only make sense for dictionaries which store non-interned Unicode keys and get lots of dict lookups.
msg243877 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-05-23 01:26
Closing this because it was never an active proposal, just more of a thought experiment that I didn't want to lose track of.

History
Date	User	Action	Args
2022-04-11 14:58:14	admin	set	github: 67900
2015-05-23 01:26:45	rhettinger	set	status: open -> closed resolution: not a bug messages: + msg243877
2015-03-25 13:40:07	lemburg	set	nosy: + lemburg messages: + msg239253
2015-03-25 13:22:15	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg239252
2015-03-25 09:46:08	pitrou	set	nosy: + pitrou, tim.peters messages: + msg239242
2015-03-24 22:40:46	rhettinger	set	priority: normal -> low versions: - Python 3.5
2015-03-24 19:41:47	ezio.melotti	set	nosy: + ezio.melotti
2015-03-20 07:15:14	scoder	set	nosy: + scoder messages: + msg238609
2015-03-20 00:47:47	josh.r	set	nosy: + josh.r messages: + msg238593
2015-03-19 23:01:40	dstufft	set	messages: + msg238578
2015-03-19 22:57:28	dstufft	set	nosy: + dstufft messages: + msg238577
2015-03-19 20:05:19	rhettinger	set	files: - assume_perf_uni_hash.diff
2015-03-19 20:04:59	rhettinger	set	files: + assume_perf_uni_hash1.diff
2015-03-19 19:57:42	rhettinger	create