Message 151121 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arach, Arfrever, Huzaifa.Sidhpurwala, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, fx5, georg.brandl, gvanrossum, gz, jcea, lemburg, mark.dickinson, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
Date	2012-01-12.09:27:35
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<4F0EA780.8020007@egenix.com>
In-reply-to	<1326358404.42.0.384907026194.issue13703@psf.upfronthosting.co.za>

Content
Frank Sievertsen wrote: > > I don't want my software to stop working because someone managed to enter 1000 bad strings into it. Think of a software that handles names of customers or filenames. We don't want it to break completely just because someone entered a few clever names. Collision counting is just a simple way to trigger an action. As I mentioned in my proposal on this ticket, raising an exception is just one way to deal with the problem in case excessive collisions are found. A better way is to add a universal hash method, so that the dict can adapt to the data and modify the hash functions for just that dict (without breaking other dicts or changing the standard hash functions). Note that raising an exception doesn't completely break your software. It just signals a severe problem with the input data and a likely attack on your software. As such, it's no different than turning on DOS attack prevention in your router. In case you do get an exception, a web server will simply return a 500 error and continue working normally. For other applications, you may see a failure notice in your logs. If you're sure that there are no possible ways to attack the application using such data, then you can simply disable the feature to prevent such exceptions. > Randomization fixes most of these problems. See my list of issues with this approach (further up on this ticket). > However, it breaks the steadiness of hash(X) between two runs of the same software. There's probably code out there that assumes that hash(X) always returns the same value: database- or serialization-modules, for example. > > There might be good reasons to also have a steady hash-function available. The broken code is hard to fix if no such a function is available at all. Maybe it's possible to add a second steady hash-functions later again? This is one of the issues I mentioned. > For the moment I think the best way is to turn on randomization of hash() by default, but having a way to turn it off.

Frank Sievertsen wrote:
> 
> I don't want my software to stop working because someone managed to enter 1000 bad strings into it. Think of a software that handles names of customers or filenames. We don't want it to break completely just because someone entered a few clever names.

Collision counting is just a simple way to trigger an action. As I mentioned
in my proposal on this ticket, raising an exception is just one way to deal
with the problem in case excessive collisions are found. A better way is to
add a universal hash method, so that the dict can adapt to the data and
modify the hash functions for just that dict (without breaking other
dicts or changing the standard hash functions).

Note that raising an exception doesn't completely break your software.
It just signals a severe problem with the input data and a likely
attack on your software. As such, it's no different than turning on DOS
attack prevention in your router.

In case you do get an exception, a web server will simply return a 500 error
and continue working normally.

For other applications, you may see a failure notice in your logs. If
you're sure that there are no possible ways to attack the application using
such data, then you can simply disable the feature to prevent such
exceptions.

> Randomization fixes most of these problems.

See my list of issues with this approach (further up on this ticket).

> However, it breaks the steadiness of hash(X) between two runs of the same software. There's probably code out there that assumes that hash(X) always returns the same value: database- or serialization-modules, for example.
> 
> There might be good reasons to also have a steady hash-function available. The broken code is hard to fix if no such a function is available at all. Maybe it's possible to add a second steady hash-functions later again?

This is one of the issues I mentioned.

> For the moment I think the best way is to turn on randomization of hash() by default, but having a way to turn it off.

History
Date	User	Action	Args
2012-01-12 09:27:36	lemburg	set	recipients: + lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, jcea, mark.dickinson, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, Zhiping.Deng, Huzaifa.Sidhpurwala, PaulMcMillan, fx5
2012-01-12 09:27:35	lemburg	link	issue13703 messages
2012-01-12 09:27:35	lemburg	create