New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash collision security issue #57912
Comments
This is already publicly known and in deep discussion on python-dev. The proper fix is still TBD. Essentially, hash collisions can be exploited to DoS a web framework that automatically parses input forms into dictionaries. Start here: http://mail.python.org/pipermail/python-dev/2011-December/115116.html |
I had a short chat with Guido yesterday. I'll try to sum up the conversation. Guido, please correct me if I got something wrong or missed a point. Guido wants the fix as simple and less intrusive as possible as he wants to provide/apply a patch for Python 2.4 to 3.3. This means any new stuff is off the table unless it's really, really necessary. Say goodbye to my experimental MurmurHash3 patch. We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release. The env var PYTHONRANDOMHASH=1 would enable the randomization. It's simple to set the env var in e.g. Apache for mod_python and mod_wsgi. |
I think on the contrary it must be enabled by default. Leaving security |
We can't foresee the implications of the randomization and only a small number of deployments is affected by the problem. But I won't start a fight on the matter. ;) |
I'm with Antoine -- turn it on by default. Maybe there should be a release candidate to test the waters. |
On Jan 03, 2012, at 08:24 PM, Antoine Pitrou wrote:
Unless there's evidence of performance regressions or backward |
If hash() is modified, str(dict) and str(set) will change for example. It may break doctests. Can we consider that the application should not rely (indirectly) on hash and so fix (for example) their doctests? Or is it a backward incompatibility? hash() was already modified in major Python versions. For this specific issue, I consider that security is more important than str(dict). |
Barry, when this gets fixed, shall we coordinate release times? |
On Jan 03, 2012, at 09:43 PM, Benjamin Peterson wrote:
Yes! |
Randomized hashing destabilizes the unit tests of Python, too. Here are the outputs of four test runs: 11 tests failed: 9 tests failed: 10 tests failed: 9 tests failed: |
I agree that we should enable randomness by default, and provide an easy way for users to disable it if necessary (unit test suites that explicitly depend on order being an obvious candidate). I'll link my proposed algorithm change here, for the record: I've gotten confirmation from several other sources that the fix recommended by the presenters (just a random initialization seed) only prevents the most basic form of the attack. |
Christian Heimes proposes the following change in its randomhash branch (see issue bpo-13704): - x = (Py_uhash_t) *p << 7;
+ x = Py_RndHashSeed + ((Py_uhash_t) *p << 7);
for (i = 0; i < len; i++)
x = (1000003U * x) ^ (Py_uhash_t) *p++;
x ^= (Py_uhash_t) len; This change doesn't add any security if the attacker can inject any string and retreive the hash value. You can retreive directly Py_RndHashSeed using: Py_RndHashSeed = intmask((hash("a") ^ len("a") ^ ord("a")) * DIVIDE) - (ord("a") << 7) where intmask() truncates to a long (x mod 2^(long bits)) and DIVIDE = 1/1000003 mod 2^(long bits). For example, DIVIDE=2021759595 for 32 bits long. |
Victor, please ignore my code related to hash randomization for now. I've deliberately not linked my branch to this bug report. I'm well aware that it's not secure and that it's pretty easy to reverse engineer the seed from a hash of a short string. The code is a proof of concept to detect failing tests and other issues. I'm in private contact with Paul and we are working together. He has done extended research and I'll gladly follow his expertise. I've already discussed the issue with small strings, but I can't recall if it was a private mail to Paul or a public one to the dev list. Paul: 16kb of seed is still a lot. Most CPUs have about 16 to 32, maybe 64kb L1 cache for data. 1024 to 4096 bytes should increase cache locality and reduce speed impacts. PS: I'm going to reply to your last mail tomorrow. |
In bpo-13707 I suggest a change to the current hash() entry which is needed independently of this issue, because the default hash (for object()), being tied to id() is already limited to an object's lifetime. But this change will become more imperative if hash() is made run-dependent for numbers and strings. There does not seems to presently *be* a security hole for 64 bit builds, so if there is any noticeable slowdown on 64 bit builds and it is sensibly easy to tie the default to the bitness, I would think it should be off for such builds. |
Paul first proposition (on python-dev) was to replace:
by:
This change has a vulnerability similar than the one of Christian's suggested changed. The "r" array can be retreived directly with: r2 = []
for i in xrange(len(r)):
s = chr(intmask(i * UNSHIFT7) % len(r))
h = intmask(hash(s) ^ len(s) ^ ord(s) ^ ((ord(s) << 7) * MOD))
r2.append(chr(h))
r2 = ''.join(r2) where UNSHIFT7 = 1/2**7 mod 2^(long bits). By the way, this change always use r[0] to hash all string of one ASCII character (U+0000-U+007F). |
Can all this be discussed on this issue now that it's the official point (I don't think special-casing small strings makes sense, because then |
Please, attach directly a file to the issue, or copy/paste the code in your comment. Interesting part the code: #Proposed replacement import os, array
size_exponent = 14 #adjust as a memory/security tradeoff
r = array.array('l', os.urandom(2**size_exponent))
len_r = len(r)
def _hash_string2(s):
"""The algorithm behind compute_hash() for a string or a unicode."""
length = len(s)
#print s
if length == 0:
return -1
x = (ord(s[0]) << 7) ^ r[length % len_r]
i = 0
while i < length:
x = intmask((1000003*x) ^ ord(s[i]))
x ^= r[x % len_r]
i += 1
x ^= length
return intmask(x)
r size should not depend on the size of a long. You should write something like: sizeof_long = ctypes.sizeof(ctypes.c_long)
r_bits = 8
r = array.array('l', os.urandom((2**r_bits) * sizeof_long))
r_mask = 2**r_bits-1 and then replace "% len_r" by "& r_mask". What is the minimum value of r_bits? For example, would it be safe to use a single long integer? (r_bits=1) |
The final code will be in C and will use neither ctypes nor array.array. |
For the record, here is what "man urandom" says about random seed size: “[...] no cryptographic primitive available today can hope to promise In that light, reading a 64 bytes seed from /dev/urandom is already a lot, and 4096 bytes is simply insane. |
I read that the attack cannot be computed with actual computers (it's too expensive) against Python 64 bits. I tried to change str.__hash__ in Python 32 bits to compute the hash in 64 bits and than truncate the hash to 32 bits: it doesn't change anything, the hash values are the same, so it doesn't improve the security. |
Yet another random hash function, simplified version of Paul's function. It always use exactly 256 bits of entropy and so 32 bytes of memory, and doesn't keep the loop. I don't expect my function to be secure, but just give more work to the attacker to compute the data for an attack against our dict implementation. --- import os, array, struct
sizeof_long = struct.calcsize("l")
r_bits = 256
r_len = r_bits // (sizeof_long * 8)
r_mask = r_len - 1
r = array.array('l', os.urandom(r_len * sizeof_long))
def randomhash(s):
length = len(s)
if length == 0:
return -2
x = ord(s[0])
x ^= r[x & r_mask]
x <<= 7
for ch in s:
x = intmask(1000003 * x)
x ^= ord(ch)
x ^= length
x ^= r[x & r_mask]
return intmask(x) The first "x ^= r[x & r_mask]" may be replaced by "x ^= r[(x ^ length) & r_mask]". The binary shift is done after the first xor with r, because 2**7 and r_len are not coprime (r_len is a multipler of 2**7), and so (ord(s[0] << 7) & r_mask is always zero. randomhash(s)==hash(s) if we used twice the same index in the r array. I don't know if this case gives useful information. |
A couple of things here: First, my proposed change is not cryptographically secure. There simply aren't any cryptographic hashing algorithms available that are in the performance class we need. My proposal does make the collision attack quite difficult to carry out, even if the raw output values from the hash are available to the attacker (they should not usually be). I favor using the same algorithm between 32 and 64 bit builds for consistency of behavior. Developers would be startled to find that ordering stays consistent on a 64 bit build but varies on 32 bit builds. Additionally, the impracticality of attacking of 64 bit builds rests on the fact that these particular researchers didn't devise a way to do it. I'd hate to make this change and then have a clever mathematician publish some elegant point requiring us to go fix the problem all over again. I could be convinced either way on small strings. I like that they can't be used to attack the secret. At the same time, I worry that combining 2 different hashing routines into the same output space may introduce unexpected collisions and other difficult to debug edge-case conditions. It also means that the order of the hashes of long strings will vary while the order of short strings will not - another inconsistency which will encourage bugs. Thank you Victor for the improvements to the python demonstration code. As Antoine said, it's only demo code, but better demo code is always good. Antoine: That section of the manpage is referring to the overall security of a key generated using urandom. 256 bits is overkill for this application. We could take 256 bits and use them to generate a key using a cryptographically appropriate algorithm, but it's simpler to read more bits and use them directly as the key. Additionally, that verbiage has been in the man page for urandom for quite some time (probably since the earliest version in the mid 90's). The PRNG has been improved since then. Minimum length of r is a hard question. The shorter it is, the higher the correlation of the output. In my tests, 16kb was the amount necessary to generally do reasonably well on my test suite for randomness even with problematic input. Obviously our existing output isn't random, so it doesn't pass those tests at all. Using a fairly small value (4k) should not make the results much worse from a security perspective, but might be problematic from a collision/distribution standpoint. It's clear that we don't need cryptographically good randomness here, but passing the test suite is not a bad thing when considering the distribution. When we settle on a C implementation, I'd like to run it through the smhasher set of tests to make sure we aren't making distribution worse, especially for very small values of r. |
Keep in mind the average L1 data cache size is between 16KB and 64KB. 4KB is already a significant chunk of that. Given a hash function's typical loop is to feed back the current result into the next computation, I don't see why a small value (e.g. 256 bytes) would be detrimental. |
If test_packaging fails because it relies on dict order / hash details, that’s a bug. Can you copy the full tb (possibly in another report, I can fix it independently of this issue)? |
On Jan 04, 2012, at 06:00 AM, Paul McMillan wrote:
Well, one positive outcome of this issue is that users will finally viscerally |
Some comments:
There are many ways you can do a DoS attack on web servers. It's the It's a good idea to provide some way to protect against hash There are other ways you can generate lots of CPU overhead with In order to protect against such attacks in general, we'd have to The easiest way to protect against the hash collision attack is by The second best way would be to limit the number of parameters that a
If randomization of the hash start vector or some other method is The hash values of Python objects are not only used by the Python Hence, if changed, the hash change should be disabled per default
Hash values of other types can easily be guessed as well, e.g. We'd have to adapt all hash functions of the basic types in Python
An attacker only needs to create many hash collisions, not
It's one of the most used operations in Python. Please get experts into
Another idea would be counting the collisions and raising an Such a change would work for all hashable Python objects and Thanks,Marc-Andre Lemburg ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
Marc-Andre Lemburg wrote:
Here's an example for integers on a 64-bit machine:
This takes ages to complete and only uses very little memory.
32397634 |
The bug report is the easiest thing to search for and follow when checking when something is resolved so it is nice to have a link to the relevant patch(es) for each branch. I just wanted to note the major commit here so that all planned branches had a note recorded. I don't care that it wasn't automatic. :) For observers: There have been several more commits related to fixing this (test dict/set order fixes, bug/typo/merge oops fixes for the linked to patches, etc). Anyone interested in seeing the full list of diffs should look at their specific branch on our around the time of the linked to changelists. Too many to list here. |
Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now. Saying yes "working as intended" is fine by me. sys.flags.hash_randomization seems to simply indicate that doing something with the hash seed was explicitly specified as opposed to defaulting to off, not that the hash seed was actually chosen randomly. What this implies for 3.3 after we make hash randomization default to on is that sys.flags.hash_randomization will always be 1. |
That is a good question. I don't really care either way, but let's say +0 for turning it off when seed == 0. -R still needs to be made default in 3.3 - that's one reason this issue is still open. |
It is documented that PYTHONHASHSEED=0 disables the randomization, so |
Gregory P. Smith wrote:
The flag should probably be removed - simply because Exposing the seed value as sys.hashseed would be better and more useful |
STINNER Victor wrote:
PYTHONHASHSEED=1 will disable randomization as well :-) Only setting PYTHONHASHSEED=random actually enables randomization. |
+1 |
On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote:
That makes the most sense to me. |
On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote:
Okay, after chatting with __ap__ on irc, here's what I think the behavior sys.flags.hash_randomization should contain just the value given by the -R sys.hash_seed contains the hash seed, set by virtue of the flag or envar. It If you really need the envar value, getenv('PYTHONHASHSEED') is good enough |
+1 to what barry and __ap__ discussed and settled on. |
I have to amend my suggestion about sys.flags.hash_randomization. It needs to be non-zero even if $PYTHONHASHSEED is given instead of -R. Many other flags that also have envars work the same way, e.g. -O and $PYTHONOPTIMIZE. So hash_randomization has to work the same way. I'll still work on a patch for exposing the seed in sys. |
Never mind about sys.hash_seed. See my follow up in python-dev. I consider this issue is closed wrt the 2.6 branch. |
After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7. |
Can you paste the error you're getting? 2012/2/26 Roger Serwy <report@bugs.python.org>:
|
It was a false alarm. I didn't recompile python before running it with the latest /Lib files. My apologies. |
The Design and History FAQ (will) need a minor corresponding update: |
I have assigned CVE-2012-1150 for this issue as per http://www.openwall.com/lists/oss-security/2012/03/10/3 |
FWIW I upgraded to ubuntu pangolin beta over the weekend, which includes 2.7.3rc1, and I'm also experiencing a problem with urandom. File "/usr/lib/python2.7/email/utils.py", line 27, in <module> Given Roger Serwy's comment it sounds like a beta ubuntu problem, but thought it worth mentioning. |
It looks like you are using random.py of Python 2.7.3 with the Python program 2.7.2, because os.urandom() is now always available in Python 2.7.3. |
Victor - yes that was it; a mixture of a 2.7.2 virtual env and 2.7.3. Apologies for any nuisance caused. |
Can we close this issue? |
I believe so. This is in all of the release candidates. The expat/xmlparse.c hash collision DoS issue is being handled on its own via http://bugs.python.org/issue14234. |
Hey Erlend, why did you add so many people to the nosy list of this old On Thu, Nov 4, 2021 at 07:33 Erlend E. Aasland <report@bugs.python.org>
-- |
Because today's spammer, whose message was removed, deleted us all. Restoring the version to 3.3 is not possible. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: