Message 150694 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	PaulMcMillan
Recipients	Arfrever, Huzaifa.Sidhpurwala, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, georg.brandl, gvanrossum, jcea, lemburg, pitrou, terry.reedy, v+python, vstinner
Date	2012-01-05.21:40:02
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1325799604.17.0.952753487848.issue13703@psf.upfronthosting.co.za>
In-reply-to

Content
Marc-Andre: Victor already pasted the relevant part of my code: http://bugs.python.org/issue13703#msg150568 The link to the fuller version, with revision history and a copy of the code before I modified it is here: https://gist.github.com/0a91e52efa74f61858b5 >Why? The attack doesn't work with short strings? What do you call a "short string"? Well, the demonstrated collision is for 16 character ascii strings. Worst case UTF-8, we're looking at 3 manipulable bytes per character, but they may be harder to collide since some of those bytes are fixed. > only be making it harder for script kiddies, but as soon as someone > crypt-analysis the used hash algorithm, you're lost again. Not true. What I propose is to make the amount of information necessary to analyze and generate collisions impractically large. My proposed hash function is certainly broken if you brute force the lookup table. There are undoubtedly other problems with it too. The point is that it's hard enough. We aren't going for perfect security - we're going for enough to make this attack impractical. What are the downsides to counting collisions? For one thing, it's something that needs to be kept track of on a per-dict basis, and can't be cached the way the hash results are. How do you choose a good value for the limit? If you set it to something conservative, you still pay the collision price every time a dict is created to discover that the keys collide. This means that it's possible to feed to bad data up to exactly the limit, and suddenly the python app is inexplicably slow. If you set the limit too aggressively, then sometimes valid data gets caught, and python randomly dies in hard to debug ways with an error the programmer has never seen in testing and cannot reproduce. It adds a new way to kill most python applications, and so programs are going to have to be re-written to cope with it. It also introduces a new place to cause errors - if the WSGI server dies, it's hard for my application to catch that and recover gracefully. >... not in Python itself, but if you consider all the types in Python > extensions and classes implementing __hash__ in user code, the number > of hash functions to fix quickly becomes unmanageable. When we looked at the Django project, we wouldn't have anything to fix since ours end up relying on the python internal values eventually. I suspect a lot of other code is similar. Mark said: >What is the mechanism by which the attacker can determine the seeds? The biggest information leak is probably the ordering in which dict entries are returned. This can be used to deduce the underlying hash values. This is much easier than trying to do it via timing. > But that's not the issue we are supposed to be dealing with. > A single (genuinely random) seed will deal with the attack described in > the talk and it is (almost) as fast as using 0 as a seed. This is not true. A single random seed shifts the hash table, but does not actually prevent an attacker from generating collisions. Please see my other posts on the topic here and on the mailing list.

Marc-Andre: Victor already pasted the relevant part of my code:
http://bugs.python.org/issue13703#msg150568
The link to the fuller version, with revision history and a copy of the code before I modified it is here:
https://gist.github.com/0a91e52efa74f61858b5

>Why? The attack doesn't work with short strings? What do you call a "short string"?

Well, the demonstrated collision is for 16 character ascii strings. Worst case UTF-8, we're looking at 3 manipulable bytes per character, but they may be harder to collide since some of those bytes are fixed.

> only be making it harder for script kiddies, but as soon as someone
> crypt-analysis the used hash algorithm, you're lost again.

Not true. What I propose is to make the amount of information necessary to analyze and generate collisions impractically large. My proposed hash function is certainly broken if you brute force the lookup table. There are undoubtedly other problems with it too. The point is that it's hard enough. We aren't going for perfect security - we're going for enough to make this attack impractical.

What are the downsides to counting collisions? For one thing, it's something that needs to be kept track of on a per-dict basis, and can't be cached the way the hash results are. How do you choose a good value for the limit? If you set it to something conservative, you still pay the collision price every time a dict  is created to discover that the keys collide. This means that it's possible to feed to bad data up to exactly the limit, and suddenly the python app is inexplicably slow. If you set the limit too aggressively, then sometimes valid data gets caught, and python randomly dies in hard to debug ways with an error the programmer has never seen in testing and cannot reproduce.

It adds a new way to kill most python applications, and so programs are going to have to be re-written to cope with it. It also introduces a new place to cause errors - if the WSGI server dies, it's hard for my application to catch that and recover gracefully.

>... not in Python itself, but if you consider all the types in Python
> extensions and classes implementing __hash__ in user code, the number
> of hash functions to fix quickly becomes unmanageable.

When we looked at the Django project, we wouldn't have anything to fix since ours end up relying on the python internal values eventually. I suspect a lot of other code is similar.

Mark said:
>What is the mechanism by which the attacker can determine the seeds?

The biggest information leak is probably the ordering in which dict entries are returned. This can be used to deduce the underlying hash values. This is much easier than trying to do it via timing.

> But that's not the issue we are supposed to be dealing with.
> A single (genuinely random) seed will deal with the attack described in 
> the talk and it is (almost) as fast as using 0 as a seed.

This is not true. A single random seed shifts the hash table, but does not actually prevent an attacker from generating collisions. Please see my other posts on the topic here and on the mailing list.

History
Date	User	Action	Args
2012-01-05 21:40:04	PaulMcMillan	set	recipients: + PaulMcMillan, lemburg, gvanrossum, barry, georg.brandl, terry.reedy, jcea, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, Arfrever, v+python, alex, dmalcolm, Mark.Shannon, Zhiping.Deng, Huzaifa.Sidhpurwala
2012-01-05 21:40:04	PaulMcMillan	set	messageid: <1325799604.17.0.952753487848.issue13703@psf.upfronthosting.co.za>
2012-01-05 21:40:03	PaulMcMillan	link	issue13703 messages
2012-01-05 21:40:02	PaulMcMillan	create