Message 238593 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	josh.r
Recipients	dstufft, josh.r, rhettinger
Date	2015-03-20.00:47:47
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1426812467.32.0.315256734462.issue23712@psf.upfronthosting.co.za>
In-reply-to

Content
Assuming Siphash is in fact cryptographically secure (in the sense you can't choose a desired result hash with better odds that brute force), and it appears to be so, with a keyspace of 64 bits, if it's evenly distributed (which a cryptographically secure hash should be), that implies to even have a 1% chance of any two keys in a set colliding, you'd need over 2**29 keys ( you can plug in your own numbers for the conditional calculation here https://lazycackle.com/Probability_of_repeated_event_online_calculator__birthday_problem_.html ). Even at the one ten thousandth of a percent collision threshold, you're talking about a set of 6 million strings to find even one pair that match. I'd still be leery of using such an approach for general purpose sets and dicts, where they could conceivably contain enough entries to pose a risk (vanishingly small, but not "heat death of the universe" small). But for Python implementation dictionaries (module, nested scope, class, and instance dictionaries), where we're talking about maybe a thousand attributes in extreme cases, which are almost never under the control of an "attacker" in any event, I could definitely see a low risk win. If you're assuming a dictionary with less than 10,000 keys, that's a hit would be literally one in a trillion; under a hundred and you're below one in a quadrillion chance, which I think is safe enough. If you wanted to make it "safe" you could conceivably use an approach that changed algorithms up front, depending on the size of the dictionary; less than a hundred entries, use hash only lookup, above 100, use "safe" lookup.

Assuming Siphash is in fact cryptographically secure (in the sense you can't choose a desired result hash with better odds that brute force), and it appears to be so, with a keyspace of 64 bits, if it's evenly distributed (which a cryptographically secure hash should be), that implies to even have a 1% chance of any two keys in a set colliding, you'd need over 2**29 keys ( you can plug in your own numbers for the conditional calculation here https://lazycackle.com/Probability_of_repeated_event_online_calculator__birthday_problem_.html ). Even at the one ten thousandth of a percent collision threshold, you're talking about a set of 6 million strings to find even one pair that match.

I'd still be leery of using such an approach for general purpose sets and dicts, where they could conceivably contain enough entries to pose a risk (vanishingly small, but not "heat death of the universe" small). But for Python implementation dictionaries (module, nested scope, class, and instance dictionaries), where we're talking about maybe a thousand attributes in extreme cases, which are almost never under the control of an "attacker" in any event, I could definitely see a low risk win. If you're assuming a dictionary with less than 10,000 keys, that's a hit would be literally one in a trillion; under a hundred and you're below one in a quadrillion chance, which I think is safe enough. If you wanted to make it "safe" you could conceivably use an approach that changed algorithms up front, depending on the size of the dictionary; less than a hundred entries, use hash only lookup, above 100, use "safe" lookup.

History
Date	User	Action	Args
2015-03-20 00:47:47	josh.r	set	recipients: + josh.r, rhettinger, dstufft
2015-03-20 00:47:47	josh.r	set	messageid: <1426812467.32.0.315256734462.issue23712@psf.upfronthosting.co.za>
2015-03-20 00:47:47	josh.r	link	issue23712 messages
2015-03-20 00:47:47	josh.r	create