This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author fx5
Recipients Arach, Arfrever, Huzaifa.Sidhpurwala, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, fx5, georg.brandl, gvanrossum, gz, jcea, lemburg, mark.dickinson, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
Date 2012-01-12.08:53:22
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1326358404.42.0.384907026194.issue13703@psf.upfronthosting.co.za>
In-reply-to
Content
I'd like to add a few notes:

1. both 32-bit and 64-bit python are vulnerable
2. collision-counting will break other things
3. imho, randomization is the way to go, enabled by default.
4. do we need a steady hash-function later again?

I created ~500KB of colliding strings for both 32-bit and 64-bit python.
It works impressively good:

32bit: ~500KB payload keep django busy for >30 minutes.
64bit: ~500KB payload keep django busy for 5 minutes.

Django is more vulnerable than python-dict alone, because it
* converts the strings to unicode first, making the comparision more expensive
* does 5 dict-lookups per key.

So Python's dict of str alone is probably ~10x faster. Of course it's much harder to create the payload for 64-bit python than for 32-bit, but it works for both.

The collision-counting idea makes some sense in the web environment, but for other software types it can causes serious problems.

I don't want my software to stop working because someone managed to enter 1000 bad strings into it. Think of a software that handles names of customers or filenames. We don't want it to break completely just because someone entered a few clever names.

Randomization fixes most of these problems.

However, it breaks the steadiness of hash(X) between two runs of the same software. There's probably code out there that assumes that hash(X) always returns the same value: database- or serialization-modules, for example.

There might be good reasons to also have a steady hash-function available. The broken code is hard to fix if no such a function is available at all. Maybe it's possible to add a second steady hash-functions later again?

For the moment I think the best way is to turn on randomization of hash() by default, but having a way to turn it off.
History
Date User Action Args
2012-01-12 08:53:24fx5setrecipients: + fx5, lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, jcea, mark.dickinson, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, Zhiping.Deng, Huzaifa.Sidhpurwala, PaulMcMillan
2012-01-12 08:53:24fx5setmessageid: <1326358404.42.0.384907026194.issue13703@psf.upfronthosting.co.za>
2012-01-12 08:53:23fx5linkissue13703 messages
2012-01-12 08:53:23fx5create