Message 151061 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arach, Arfrever, Huzaifa.Sidhpurwala, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, georg.brandl, gvanrossum, gz, jcea, lemburg, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
Date	2012-01-11.14:34:17
SpamBayes Score	1.6653345e-16
Marked as misclassified	No
Message-id	<4F0D9DE3.6010509@egenix.com>
In-reply-to	<CAMpsgwZraWfhT_bXn2F-BmYUWkPKf5Gg1evRrijffn2xmTXjzg@mail.gmail.com>

Content
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> * it is exceedingly complex > > Which part exactly? For hash(str), it just add two extra XOR. I'm not talking specifically about your patch, but the whole idea and the needed changes in general. >> * the method would need to be implemented for all hashable Python types > > It was already discussed, and it was said that only hash(str) need to > be modified. Really ? What about the much simpler attack on integer hash values ? You only have to send a specially crafted JSON dictionary with integer keys to a Python web server providing JSON interfaces in order to trigger the integer hash attack. The same goes for the other Python data types. >> * it causes startup time to increase (you need urandom data for >> every single hashable Python data type) > > My patch reads 8 or 16 bytes from /dev/urandom which doesn't block. Do > you have a benchmark showing a difference? > > I didn't try my patch on Windows yet. Your patch only implements the simple idea of adding an init vector and a fixed suffix vector (which you don't need since it doesn't prevent hash collisions). I don't think that's good enough, since it doesn't change how the hash algorithm works on the actual data, but instead just shifts the algorithm to a different sequence. If you apply the same logic to the integer hash function, you'll see that more clearly. Paul's algorithm is much more secure in this respect, but it requires more random startup data. >> * it causes run-time to increase due to changes in the hash >> algorithm (more operations in the tight loop) > > I posted a micro-benchmark on hash(str) on python-dev: the overhead is > nul. Did you have numbers showing that the overhead is not nul? For the simple solution, that's an expected result, but if you want more safety, then you'll see a hit due to the random data getting XOR'ed in every single loop. >> * causes different processes in a multi-process setup to use different >> hashes for the same object > > Correct. If you need to get the same hash, you can disable the > randomized hash (PYTHONHASHSEED=0) or use a fixed seed (e.g. > PYTHONHASHSEED=42). So you have the choice of being able to work in a multi-process environment and be vulnerable to the attack or not. I think we can do better :-) Note that web servers written in Python tend to be long running processes, so an attacker has lots of time to test various seeds. >> * doesn't appear to work well in embedded interpreters that >> regularly restarted interpreters (AFAIK, some objects persist across >> restarts and those will have wrong hash values in the newly started >> instances) > > test_capi runs _testembed which restarts a embedded interpreters 3 > times, and the test pass (with my patch version 5). Can you write a > script showing the problem if there is a real problem? > > In an older version of my patch, the hash secret was recreated at each > initiliazation. I changed my patch to only generate the secret once. Ok, that should fix the case. Two more issue that I forgot: * enabling randomized hashing can make debugging a lot harder, since it's rather difficult to reproduce the same state in a controlled way (unless you record the hash seed somewhere in the logs) and even though applications should not rely on the order of dict repr()s or str()s, they do often enough: * randomized hashing will result in repr() and str() of dictionaries to be random as well >> The most important issue, though, is that it doesn't really >> protect Python against the attack - it only makes it less >> likely that an adversary will find the init vector (or a way >> around having to find it via crypt analysis). > > I agree that the patch is not perfect. As written in the patch, it > just makes the attack more complex. I consider that it is enough. Wouldn't you rather see a fix that works for all hash functions and Python objects ? One that doesn't cause performance issues ? The collision counting idea has this potential. > Perl has a simpler protection than the one proposed in my patch. Is > Perl vulnerable to the hash collision vulnerability? I don't know what Perl did or how hashing works in Perl, so cannot comment on the effect of their fix. FWIW, I don't think that we should use Perl or Java as reference here.

STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>>  * it is exceedingly complex
> 
> Which part exactly? For hash(str), it just add two extra XOR.

I'm not talking specifically about your patch, but the whole idea
and the needed changes in general.

>>  * the method would need to be implemented for all hashable Python types
> 
> It was already discussed, and it was said that only hash(str) need to
> be modified.

Really ? What about the much simpler attack on integer hash values ?

You only have to send a specially crafted JSON dictionary with integer
keys to a Python web server providing JSON interfaces in order to
trigger the integer hash attack.

The same goes for the other Python data types.

>>  * it causes startup time to increase (you need urandom data for
>>   every single hashable Python data type)
> 
> My patch reads 8 or 16 bytes from /dev/urandom which doesn't block. Do
> you have a benchmark showing a difference?
> 
> I didn't try my patch on Windows yet.

Your patch only implements the simple idea of adding an init
vector and a fixed suffix vector (which you don't need since
it doesn't prevent hash collisions).

I don't think that's good enough, since
it doesn't change how the hash algorithm works on the actual
data, but instead just shifts the algorithm to a different
sequence. If you apply the same logic to the integer hash
function, you'll see that more clearly.

Paul's algorithm is much more secure in this respect, but it
requires more random startup data.

>>  * it causes run-time to increase due to changes in the hash
>>   algorithm (more operations in the tight loop)
> 
> I posted a micro-benchmark on hash(str) on python-dev: the overhead is
> nul. Did you have numbers showing that the overhead is not nul?

For the simple solution, that's an expected result, but if you want
more safety, then you'll see a hit due to the random data getting
XOR'ed in every single loop.

>>  * causes different processes in a multi-process setup to use different
>>   hashes for the same object
> 
> Correct. If you need to get the same hash, you can disable the
> randomized hash (PYTHONHASHSEED=0) or use a fixed seed (e.g.
> PYTHONHASHSEED=42).

So you have the choice of being able to work in a multi-process
environment and be vulnerable to the attack or not. I think we
can do better :-)

Note that web servers written in Python tend to be long running
processes, so an attacker has lots of time to test various
seeds.

>>  * doesn't appear to work well in embedded interpreters that
>>   regularly restarted interpreters (AFAIK, some objects persist across
>>   restarts and those will have wrong hash values in the newly started
>>   instances)
> 
> test_capi runs _testembed which restarts a embedded interpreters 3
> times, and the test pass (with my patch version 5). Can you write a
> script showing the problem if there is a real problem?
> 
> In an older version of my patch, the hash secret was recreated at each
> initiliazation. I changed my patch to only generate the secret once.

Ok, that should fix the case.

Two more issue that I forgot:

 * enabling randomized hashing can make debugging a lot harder, since
   it's rather difficult to reproduce the same state in a controlled
   way (unless you record the hash seed somewhere in the logs)

and even though applications should not rely on the order of dict
repr()s or str()s, they do often enough:

 * randomized hashing will result in repr() and str() of dictionaries
   to be random as well

>> The most important issue, though, is that it doesn't really
>> protect Python against the attack - it only makes it less
>> likely that an adversary will find the init vector (or a way
>> around having to find it via crypt analysis).
> 
> I agree that the patch is not perfect. As written in the patch, it
> just makes the attack more complex. I consider that it is enough.

Wouldn't you rather see a fix that works for all hash functions
and Python objects ? One that doesn't cause performance
issues ?

The collision counting idea has this potential.

> Perl has a simpler protection than the one proposed in my patch. Is
> Perl vulnerable to the hash collision vulnerability?

I don't know what Perl did or how hashing works in Perl, so cannot
comment on the effect of their fix. FWIW, I don't think that we
should use Perl or Java as reference here.

History
Date	User	Action	Args
2012-01-11 14:34:18	lemburg	set	recipients: + lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, jcea, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, Arach, Mark.Shannon, Zhiping.Deng, Huzaifa.Sidhpurwala, PaulMcMillan
2012-01-11 14:34:17	lemburg	link	issue13703 messages
2012-01-11 14:34:17	lemburg	create