Hash collision security issue #57912

warsaw · 2012-01-03T19:36:50Z

BPO	13703
Nosy	@malemburg, @gvanrossum, @tim-one, @loewis, @warsaw, @birkenfeld, @terryjreedy, @gpshead, @jcea, @mdickinson, @pitrou, @tiran, @benjaminp, @serwy, @merwok, @alex, @skrah, @davidmalcolm, @bz2, @markshannon, @ericsnowcurrently, @JimJJewett, @PaulMcMillan
Dependencies	bpo-13704: Random number generator in Python core
Files	hash-attack.patch SafeDict.py: SafeDict implementation bench_startup.py random-8.patch hash-collision-counting-dmalcolm-2012-01-20-001.patch amortized-probe-counting-dmalcolm-2012-01-20-002.patch amortized-probe-counting-dmalcolm-2012-01-21-003.patch hash-attack-2.patch hash-attack-3.patch integercollision.py backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch: Backport of haypo's random-8.patch to 2.7 hybrid-approach-dmalcolm-2012-01-25-001.patch: Hybrid approach to solving dict DoS attack hybrid-approach-dmalcolm-2012-01-25-002.patch optin-hash-randomization-for-3.1-dmalcolm-2012-01-27-001.patch optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch unnamed optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch optin-hash-randomization-for-2.6-dmalcolm-2012-01-30-001.patch results-16.txt add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch hash-patch-3.1-gb-03.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-03-13.22:25:45.919>
created_at = <Date 2012-01-03.19:36:49.855>
labels = ['type-security', 'interpreter-core', 'release-blocker', '3.11']
title = 'Hash collision security issue'
updated_at = <Date 2021-11-08.16:57:10.080>
user = 'https://github.com/warsaw'

bugs.python.org fields:

activity = <Date 2021-11-08.16:57:10.080>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2012-03-13.22:25:45.919>
closer = 'gregory.p.smith'
components = ['Interpreter Core']
creation = <Date 2012-01-03.19:36:49.855>
creator = 'barry'
dependencies = ['13704']
files = ['24151', '24169', '24223', '24259', '24286', '24288', '24289', '24295', '24299', '24300', '24304', '24320', '24324', '24343', '24353', '24365', '24366', '24370', '24371', '24375', '24385', '24391', '24392', '24393', '24394', '24434', '24435', '24436', '24437', '24490', '24491', '24514', '24515', '24563']
hgrepos = []
issue_num = 13703
keywords = ['patch']
message_count = 328.0
messages = ['150522', '150525', '150526', '150529', '150531', '150532', '150533', '150534', '150541', '150543', '150558', '150559', '150560', '150562', '150563', '150565', '150568', '150569', '150570', '150577', '150589', '150592', '150601', '150609', '150613', '150616', '150619', '150620', '150621', '150622', '150625', '150634', '150635', '150636', '150637', '150638', '150639', '150641', '150642', '150643', '150644', '150645', '150646', '150647', '150648', '150649', '150650', '150651', '150652', '150655', '150656', '150659', '150662', '150665', '150668', '150694', '150699', '150702', '150706', '150707', '150708', '150712', '150713', '150718', '150719', '150724', '150725', '150726', '150727', '150738', '150748', '150756', '150766', '150768', '150769', '150771', '150795', '150829', '150832', '150835', '150836', '150840', '150847', '150856', '150857', '150859', '150865', '150866', '150934', '151012', '151017', '151031', '151033', '151047', '151048', '151061', '151062', '151063', '151064', '151065', '151069', '151070', '151071', '151073', '151074', '151078', '151092', '151120', '151121', '151122', '151157', '151158', '151159', '151167', '151353', '151401', '151402', '151419', '151422', '151448', '151449', '151468', '151472', '151474', '151484', '151519', '151528', '151560', '151561', '151565', '151566', '151567', '151574', '151582', '151583', '151584', '151585', '151586', '151589', '151590', '151596', '151604', '151617', '151620', '151625', '151626', '151628', '151629', '151632', '151633', '151647', '151662', '151664', '151677', '151679', '151680', '151681', '151682', '151684', '151685', '151689', '151691', '151699', '151700', '151701', '151703', '151707', '151714', '151731', '151734', '151735', '151737', '151739', '151744', '151745', '151747', '151748', '151753', '151754', '151756', '151758', '151794', '151796', '151798', '151812', '151813', '151814', '151815', '151825', '151826', '151847', '151850', '151867', '151869', '151870', '151939', '151941', '151942', '151944', '151956', '151959', '151960', '151961', '151965', '151966', '151967', '151970', '151973', '151977', '151984', '152030', '152033', '152037', '152039', '152040', '152041', '152043', '152046', '152051', '152057', '152060', '152066', '152070', '152104', '152112', '152117', '152118', '152125', '152146', '152149', '152183', '152186', '152199', '152200', '152203', '152204', '152270', '152271', '152275', '152276', '152299', '152300', '152309', '152311', '152315', '152335', '152344', '152352', '152362', '152364', '152422', '152452', '152453', '152723', '152730', '152731', '152732', '152734', '152740', '152747', '152753', '152754', '152755', '152758', '152760', '152763', '152764', '152767', '152768', '152769', '152777', '152780', '152781', '152784', '152787', '152789', '152797', '152811', '152855', '153055', '153074', '153081', '153082', '153140', '153141', '153143', '153144', '153297', '153301', '153369', '153395', '153682', '153683', '153690', '153695', '153750', '153753', '153798', '153802', '153817', '153833', '153848', '153849', '153850', '153852', '153853', '153854', '153860', '153861', '153862', '153868', '153872', '153873', '153877', '153975', '153980', '154428', '154430', '154432', '154853', '155293', '155472', '155527', '155680', '155681', '155682', '405727', '405745']
nosy_count = 36.0
nosy_names = ['lemburg', 'gvanrossum', 'tim.peters', 'loewis', 'barry', 'georg.brandl', 'terry.reedy', 'gregory.p.smith', 'jcea', 'mark.dickinson', 'pitrou', 'christian.heimes', 'benjamin.peterson', 'roger.serwy', 'eric.araujo', 'grahamd', 'Arfrever', 'v+python', 'alex', 'cvrebert', 'zbysz', 'skrah', 'dmalcolm', 'gz', 'neologix', 'Arach', 'Mark.Shannon', 'python-dev', 'eric.snow', 'Zhiping.Deng', 'Huzaifa.Sidhpurwala', 'Jim.Jewett', 'PaulMcMillan', 'fx5', 'skorgu', 'jsvaughan']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = 'needs patch'
status = 'closed'
superseder = None
type = 'security'
url = 'https://bugs.python.org/issue13703'
versions = ['Python 3.11']

warsaw · 2012-01-03T19:36:50Z

This is already publicly known and in deep discussion on python-dev. The proper fix is still TBD. Essentially, hash collisions can be exploited to DoS a web framework that automatically parses input forms into dictionaries.

Start here:

http://mail.python.org/pipermail/python-dev/2011-December/115116.html

tiran · 2012-01-03T20:19:25Z

I had a short chat with Guido yesterday. I'll try to sum up the conversation. Guido, please correct me if I got something wrong or missed a point.

Guido wants the fix as simple and less intrusive as possible as he wants to provide/apply a patch for Python 2.4 to 3.3. This means any new stuff is off the table unless it's really, really necessary. Say goodbye to my experimental MurmurHash3 patch.

We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release. The env var PYTHONRANDOMHASH=1 would enable the randomization. It's simple to set the env var in e.g. Apache for mod_python and mod_wsgi.

pitrou · 2012-01-03T20:24:32Z

We haven't agreed whether the randomization should be enabled by
default or disabled by default. IMHO it should be disabled for all
releases except for the upcoming 3.3 release.

I think on the contrary it must be enabled by default. Leaving security
holes open is wrong.

tiran · 2012-01-03T20:31:16Z

I think on the contrary it must be enabled by default. Leaving security
holes open is wrong.

We can't foresee the implications of the randomization and only a small number of deployments is affected by the problem. But I won't start a fight on the matter. ;)

gvanrossum · 2012-01-03T20:47:45Z

I'm with Antoine -- turn it on by default. Maybe there should be a release candidate to test the waters.

warsaw · 2012-01-03T20:49:39Z

On Jan 03, 2012, at 08:24 PM, Antoine Pitrou wrote:

I think on the contrary it must be enabled by default. Leaving security
holes open is wrong.

Unless there's evidence of performance regressions or backward
incompatibilities, I agree.

vstinner · 2012-01-03T21:20:59Z

Unless there's evidence of performance regressions
or backward incompatibilities, I agree.

If hash() is modified, str(dict) and str(set) will change for example. It may break doctests. Can we consider that the application should not rely (indirectly) on hash and so fix (for example) their doctests? Or is it a backward incompatibility?

hash() was already modified in major Python versions.

For this specific issue, I consider that security is more important than str(dict).

benjaminp · 2012-01-03T21:43:35Z

Barry, when this gets fixed, shall we coordinate release times?

warsaw · 2012-01-03T22:02:46Z

On Jan 03, 2012, at 09:43 PM, Benjamin Peterson wrote:

Barry, when this gets fixed, shall we coordinate release times?

Yes!

tiran · 2012-01-03T22:08:19Z

Randomized hashing destabilizes the unit tests of Python, too. Here are the outputs of four test runs:

11 tests failed:
test_collections test_dbm test_dis test_gdb test_inspect
test_packaging test_set test_symtable test_ttk_textonly
test_urllib test_urlparse

9 tests failed:
test_dbm test_dis test_gdb test_json test_packaging test_set
test_symtable test_urllib test_urlparse

10 tests failed:
test_dbm test_dis test_gdb test_inspect test_packaging test_set
test_symtable test_ttk_textonly test_urllib test_urlparse

9 tests failed:
test_collections test_dbm test_dict test_dis test_gdb
test_packaging test_symtable test_urllib test_urlparse

PaulMcMillan · 2012-01-03T23:52:47Z

I agree that we should enable randomness by default, and provide an easy way for users to disable it if necessary (unit test suites that explicitly depend on order being an obvious candidate).

I'll link my proposed algorithm change here, for the record:
https://gist.github.com/0a91e52efa74f61858b5

I've gotten confirmation from several other sources that the fix recommended by the presenters (just a random initialization seed) only prevents the most basic form of the attack.

vstinner · 2012-01-04T00:22:37Z

Christian Heimes proposes the following change in its randomhash branch (see issue bpo-13704):

-    x = (Py_uhash_t) *p << 7;
+    x = Py_RndHashSeed + ((Py_uhash_t) *p << 7);
     for (i = 0; i < len; i++)
         x = (1000003U * x) ^ (Py_uhash_t) *p++;
     x ^= (Py_uhash_t) len;

This change doesn't add any security if the attacker can inject any string and retreive the hash value. You can retreive directly Py_RndHashSeed using:

Py_RndHashSeed = intmask((hash("a") ^ len("a") ^ ord("a")) * DIVIDE) - (ord("a") << 7)

where intmask() truncates to a long (x mod 2^(long bits)) and DIVIDE = 1/1000003 mod 2^(long bits). For example, DIVIDE=2021759595 for 32 bits long.

tiran · 2012-01-04T00:38:29Z

Victor, please ignore my code related to hash randomization for now. I've deliberately not linked my branch to this bug report. I'm well aware that it's not secure and that it's pretty easy to reverse engineer the seed from a hash of a short string. The code is a proof of concept to detect failing tests and other issues.

I'm in private contact with Paul and we are working together. He has done extended research and I'll gladly follow his expertise. I've already discussed the issue with small strings, but I can't recall if it was a private mail to Paul or a public one to the dev list.

Paul:
I still think that you should special case short strings (five or few chars sound good). An attacker can't do much harm with one to five char strings but such short strings may make it too easy to calculate the seed.

16kb of seed is still a lot. Most CPUs have about 16 to 32, maybe 64kb L1 cache for data. 1024 to 4096 bytes should increase cache locality and reduce speed impacts.

PS: I'm going to reply to your last mail tomorrow.

terryjreedy · 2012-01-04T00:55:05Z

In bpo-13707 I suggest a change to the current hash() entry which is needed independently of this issue, because the default hash (for object()), being tied to id() is already limited to an object's lifetime. But this change will become more imperative if hash() is made run-dependent for numbers and strings.

There does not seems to presently *be* a security hole for 64 bit builds, so if there is any noticeable slowdown on 64 bit builds and it is sensibly easy to tie the default to the bitness, I would think it should be off for such builds.

vstinner · 2012-01-04T01:00:55Z

Paul first proposition (on python-dev) was to replace:

...
x = (ord(s[0]) << 7)
while i < length:
    x = intmask((1000003*x) ^ ord(s[i]))
    ...

by:

...
x = (ord(s[0]) << 7)
while i < length:
    x = intmask((1000003*x) ^ ord(s[i])) ^ r[x % len_r]
    ...

This change has a vulnerability similar than the one of Christian's suggested changed. The "r" array can be retreived directly with:

r2 = []
for i in xrange(len(r)):
    s = chr(intmask(i * UNSHIFT7) % len(r))
    h = intmask(hash(s) ^ len(s) ^ ord(s) ^ ((ord(s) << 7) * MOD))
    r2.append(chr(h))
r2 = ''.join(r2)

where UNSHIFT7 = 1/2**7 mod 2^(long bits).

By the way, this change always use r[0] to hash all string of one ASCII character (U+0000-U+007F).

pitrou · 2012-01-04T01:30:00Z

I'm in private contact with Paul and we are working together. He has
done extended research and I'll gladly follow his expertise. I've
already discussed the issue with small strings, but I can't recall if
it was a private mail to Paul or a public one to the dev list.

Can all this be discussed on this issue now that it's the official point
of reference? It will avoid the repetition of arguments we see here and
there.

(I don't think special-casing small strings makes sense, because then
you have two algorithms to audit rather than one)

vstinner · 2012-01-04T01:54:51Z

https://gist.github.com/0a91e52efa74f61858b5

Please, attach directly a file to the issue, or copy/paste the code in your comment. Interesting part the code:
---

#Proposed replacement
#--------------------------------------

import os, array
size_exponent = 14 #adjust as a memory/security tradeoff
r = array.array('l', os.urandom(2**size_exponent))
len_r = len(r)

def _hash_string2(s):
    """The algorithm behind compute_hash() for a string or a unicode."""
    length = len(s)
    #print s
    if length == 0:
        return -1
    x = (ord(s[0]) << 7) ^ r[length % len_r]
    i = 0
    while i < length:
        x = intmask((1000003*x) ^ ord(s[i]))
        x ^= r[x % len_r]
        i += 1
    x ^= length
    return intmask(x)

r = array.array('l', os.urandom(2**size_exponent))
len_r = len(r)

r size should not depend on the size of a long. You should write something like:

sizeof_long = ctypes.sizeof(ctypes.c_long)
r_bits = 8
r = array.array('l', os.urandom((2**r_bits) * sizeof_long))
r_mask = 2**r_bits-1

and then replace "% len_r" by "& r_mask".

What is the minimum value of r_bits? For example, would it be safe to use a single long integer? (r_bits=1)

pitrou · 2012-01-04T01:58:04Z

> r = array.array('l', os.urandom(2**size_exponent))
> len_r = len(r)

r size should not depend on the size of a long. You should write something like:

sizeof_long = ctypes.sizeof(ctypes.c_long)
r_bits = 8
r = array.array('l', os.urandom((2**r_bits) * sizeof_long))
r_mask = 2**r_bits-1

The final code will be in C and will use neither ctypes nor array.array.
Arguing about this looks quite pointless IMO.

pitrou · 2012-01-04T02:14:54Z

For the record, here is what "man urandom" says about random seed size:

“[...] no cryptographic primitive available today can hope to promise
more than 256 bits of security, so if any program reads more than
256 bits (32 bytes) from the kernel random pool per invocation, or per
reasonable reseed interval (not less than one minute), that should be
taken as a sign that its cryptography is not skilfully implemented.”

In that light, reading a 64 bytes seed from /dev/urandom is already a lot, and 4096 bytes is simply insane.

vstinner · 2012-01-04T03:08:15Z

I read that the attack cannot be computed with actual computers (it's too expensive) against Python 64 bits. I tried to change str.__hash__ in Python 32 bits to compute the hash in 64 bits and than truncate the hash to 32 bits: it doesn't change anything, the hash values are the same, so it doesn't improve the security.

vstinner · 2012-01-04T05:09:59Z

Yet another random hash function, simplified version of Paul's function. It always use exactly 256 bits of entropy and so 32 bytes of memory, and doesn't keep the loop. I don't expect my function to be secure, but just give more work to the attacker to compute the data for an attack against our dict implementation.

---

import os, array, struct
sizeof_long = struct.calcsize("l")
r_bits = 256
r_len = r_bits // (sizeof_long * 8)
r_mask = r_len - 1
r = array.array('l', os.urandom(r_len * sizeof_long))

def randomhash(s):
    length = len(s)
    if length == 0:
        return -2
    x = ord(s[0])
    x ^= r[x & r_mask]
    x <<= 7
    for ch in s:
        x = intmask(1000003 * x)
        x ^= ord(ch)
    x ^= length
    x ^= r[x & r_mask]
    return intmask(x)

The first "x ^= r[x & r_mask]" may be replaced by "x ^= r[(x ^ length) & r_mask]".

The binary shift is done after the first xor with r, because 2**7 and r_len are not coprime (r_len is a multipler of 2**7), and so (ord(s[0] << 7) & r_mask is always zero.

randomhash(s)==hash(s) if we used twice the same index in the r array. I don't know if this case gives useful information.

PaulMcMillan · 2012-01-04T06:00:37Z

A couple of things here:

First, my proposed change is not cryptographically secure. There simply aren't any cryptographic hashing algorithms available that are in the performance class we need. My proposal does make the collision attack quite difficult to carry out, even if the raw output values from the hash are available to the attacker (they should not usually be).

I favor using the same algorithm between 32 and 64 bit builds for consistency of behavior. Developers would be startled to find that ordering stays consistent on a 64 bit build but varies on 32 bit builds. Additionally, the impracticality of attacking of 64 bit builds rests on the fact that these particular researchers didn't devise a way to do it. I'd hate to make this change and then have a clever mathematician publish some elegant point requiring us to go fix the problem all over again.

I could be convinced either way on small strings. I like that they can't be used to attack the secret. At the same time, I worry that combining 2 different hashing routines into the same output space may introduce unexpected collisions and other difficult to debug edge-case conditions. It also means that the order of the hashes of long strings will vary while the order of short strings will not - another inconsistency which will encourage bugs.

Thank you Victor for the improvements to the python demonstration code. As Antoine said, it's only demo code, but better demo code is always good.

Antoine: That section of the manpage is referring to the overall security of a key generated using urandom. 256 bits is overkill for this application. We could take 256 bits and use them to generate a key using a cryptographically appropriate algorithm, but it's simpler to read more bits and use them directly as the key.

Additionally, that verbiage has been in the man page for urandom for quite some time (probably since the earliest version in the mid 90's). The PRNG has been improved since then.

Minimum length of r is a hard question. The shorter it is, the higher the correlation of the output. In my tests, 16kb was the amount necessary to generally do reasonably well on my test suite for randomness even with problematic input. Obviously our existing output isn't random, so it doesn't pass those tests at all. Using a fairly small value (4k) should not make the results much worse from a security perspective, but might be problematic from a collision/distribution standpoint. It's clear that we don't need cryptographically good randomness here, but passing the test suite is not a bad thing when considering the distribution.

When we settle on a C implementation, I'd like to run it through the smhasher set of tests to make sure we aren't making distribution worse, especially for very small values of r.

pitrou · 2012-01-04T11:03:00Z

Using a fairly small value (4k) should not make the results much worse
from a security perspective, but might be problematic from a
collision/distribution standpoint.

Keep in mind the average L1 data cache size is between 16KB and 64KB. 4KB is already a significant chunk of that.

Given a hash function's typical loop is to feed back the current result into the next computation, I don't see why a small value (e.g. 256 bytes) would be detrimental.

merwok · 2012-01-04T14:52:26Z

If test_packaging fails because it relies on dict order / hash details, that’s a bug. Can you copy the full tb (possibly in another report, I can fix it independently of this issue)?

warsaw · 2012-01-04T15:08:36Z

On Jan 04, 2012, at 06:00 AM, Paul McMillan wrote:

Developers would be startled to find that ordering stays consistent on a 64
bit build but varies on 32 bit builds.

Well, one positive outcome of this issue is that users will finally viscerally
understand that dictionary (and set) order should never be relied upon, even
between successive runs of the same Python executable.

malemburg · 2012-01-04T16:42:05Z

Some comments:

The security implications in all this is being somewhat overemphasized.

There are many ways you can do a DoS attack on web servers. It's the
responsibility of the used web frameworks and servers to deal with
the possible cases.

It's a good idea to provide some way to protect against hash
collision attacks, but that will only solve one possible way of
causing a resource attack on a server.

There are other ways you can generate lots of CPU overhead with
little data input (e.g. think of targeting the search feature on
many Zope/Plone sites).

In order to protect against such attacks in general, we'd have to
provide a way to control CPU time and e.g. raise an exception if too
much time is being spent on a simple operation such as a key insertion.
This can be done using timers, signals or even under OS control.

The easiest way to protect against the hash collision attack is by
limiting the POST/GET/HEAD request size.

The second best way would be to limit the number of parameters that a
web framework accepts for POST/GET/HEAD request.

Changing the semantics of hashing in a dot release is not allowed.

If randomization of the hash start vector or some other method is
enabled by default in a dot release, this will change the semantics
of any application switching to that dot release.

The hash values of Python objects are not only used by the Python
dictionary implementation, but also by other storage mechanisms
such as on-disk dictionaries, inter-process object exchange via
share memory, memcache, etc.

Hence, if changed, the hash change should be disabled per default
for dot releases and enabled for 3.3.

Changing the way strings are hashed doesn't solve the problem.

Hash values of other types can easily be guessed as well, e.g.
take integers which use a trivial hash function.

We'd have to adapt all hash functions of the basic types in Python
or come up with a generic solution using e.g. double-hashing
in the dictionary/set implementations.

By just using a random start vector you change the absolute
hash values for specific objects, but not the overall hash sequence
or its period.

An attacker only needs to create many hash collisions, not
specific ones. It's the period of the hash function that's
important in such attacks and that doesn't change when moving to
a different start vector.

Hashing needs to be fast.

It's one of the most used operations in Python. Please get experts into
the boat like Tim Peters and Christian Tismer, who both have worked
on the dict implementation and the hash functions, before experimenting
with ad-hoc fixes.

Counting collisions could solve the issue without having to
change hashing.

Another idea would be counting the collisions and raising an
exception if the number of collisions exceed a certain
threshold.

Such a change would work for all hashable Python objects and
protect against the attack without changing any hash function.

Thanks,

Marc-Andre Lemburg
eGenix.com

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

malemburg · 2012-01-04T17:18:30Z

Marc-Andre Lemburg wrote:

Changing the way strings are hashed doesn't solve the problem.

Hash values of other types can easily be guessed as well, e.g.
take integers which use a trivial hash function.

Here's an example for integers on a 64-bit machine:

>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000000))
>> d = dict(g)

This takes ages to complete and only uses very little memory.
The input data has some 32MB if written down in decimal numbers

not all that much data either.

32397634

gpshead · 2012-02-21T06:35:47Z

The bug report is the easiest thing to search for and follow when checking when something is resolved so it is nice to have a link to the relevant patch(es) for each branch. I just wanted to note the major commit here so that all planned branches had a note recorded. I don't care that it wasn't automatic. :)

For observers: There have been several more commits related to fixing this (test dict/set order fixes, bug/typo/merge oops fixes for the linked to patches, etc). Anyone interested in seeing the full list of diffs should look at their specific branch on our around the time of the linked to changelists. Too many to list here.

gpshead · 2012-02-21T06:40:31Z

Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now.

Saying yes "working as intended" is fine by me.

sys.flags.hash_randomization seems to simply indicate that doing something with the hash seed was explicitly specified as opposed to defaulting to off, not that the hash seed was actually chosen randomly.

What this implies for 3.3 after we make hash randomization default to on is that sys.flags.hash_randomization will always be 1.

birkenfeld · 2012-02-21T06:47:35Z

That is a good question. I don't really care either way, but let's say +0 for turning it off when seed == 0.

-R still needs to be made default in 3.3 - that's one reason this issue is still open.

vstinner · 2012-02-21T09:47:32Z

Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now.

Saying yes "working as intended" is fine by me.

It is documented that PYTHONHASHSEED=0 disables the randomization, so
sys.flags.hash_randomization must be False (0).

malemburg · 2012-02-21T09:48:43Z

Gregory P. Smith wrote:

Gregory P. Smith <greg@krypto.org> added the comment:

Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now.

The flag should probably be removed - simply because
the env var is not a flag, it's a configuration parameter.

Exposing the seed value as sys.hashseed would be better and more useful
to applications.

malemburg · 2012-02-21T09:50:25Z

STINNER Victor wrote:

STINNER Victor <victor.stinner@gmail.com> added the comment:

> Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now.
>
> Saying yes "working as intended" is fine by me.

It is documented that PYTHONHASHSEED=0 disables the randomization, so
sys.flags.hash_randomization must be False (0).

PYTHONHASHSEED=1 will disable randomization as well :-)

Only setting PYTHONHASHSEED=random actually enables randomization.

pitrou · 2012-02-21T11:37:41Z

That is a good question. I don't really care either way, but let's
say +0 for turning it off when seed == 0.

+1

warsaw · 2012-02-21T15:33:37Z

On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote:

Exposing the seed value as sys.hashseed would be better and more useful
to applications.

That makes the most sense to me.

warsaw · 2012-02-21T15:42:34Z

On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote:

The flag should probably be removed - simply because
the env var is not a flag, it's a configuration parameter.

Exposing the seed value as sys.hashseed would be better and more useful
to applications.

Okay, after chatting with __ap__ on irc, here's what I think the behavior
should be:

sys.flags.hash_randomization should contain just the value given by the -R
flag. It should only be True if the flag is present, False otherwise.

sys.hash_seed contains the hash seed, set by virtue of the flag or envar. It
should contain the *actual* seed value used. E.g. it might be zero, the
explicitly set integer, or the randomly selected seed value in use during this
Python execution if a random seed was requested.

If you really need the envar value, getenv('PYTHONHASHSEED') is good enough
for that.

gpshead · 2012-02-21T16:28:11Z

+1 to what barry and __ap__ discussed and settled on.

warsaw · 2012-02-22T17:46:49Z

I have to amend my suggestion about sys.flags.hash_randomization. It needs to be non-zero even if $PYTHONHASHSEED is given instead of -R. Many other flags that also have envars work the same way, e.g. -O and $PYTHONOPTIMIZE. So hash_randomization has to work the same way.

I'll still work on a patch for exposing the seed in sys.

warsaw · 2012-02-22T18:12:07Z

Never mind about sys.hash_seed. See my follow up in python-dev. I consider this issue is closed wrt the 2.6 branch.

serwy · 2012-02-27T04:34:41Z

After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7.

benjaminp · 2012-02-27T05:01:39Z

Can you paste the error you're getting?

2012/2/26 Roger Serwy <report@bugs.python.org>:

Roger Serwy <roger.serwy@gmail.com> added the comment:

After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7.

----------
nosy: +serwy

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue13703\>

serwy · 2012-02-27T05:22:27Z

It was a false alarm. I didn't recompile python before running it with the latest /Lib files. My apologies.

cvrebert · 2012-03-03T20:36:37Z

The Design and History FAQ (will) need a minor corresponding update:
http://docs.python.org/dev/faq/design.html#how-are-dictionaries-implemented

kseifriedredhatcom · 2012-03-10T05:59:11Z

I have assigned CVE-2012-1150 for this issue as per http://www.openwall.com/lists/oss-security/2012/03/10/3

jsvaughan · 2012-03-12T20:37:34Z

FWIW I upgraded to ubuntu pangolin beta over the weekend, which includes 2.7.3rc1, and I'm also experiencing a problem with urandom.

File "/usr/lib/python2.7/email/utils.py", line 27, in <module>
import random
File "/usr/lib/python2.7/random.py", line 47, in <module>
from os import urandom as _urandom
ImportError: cannot import name urandom

Given Roger Serwy's comment it sounds like a beta ubuntu problem, but thought it worth mentioning.

vstinner · 2012-03-12T23:51:17Z

FWIW I upgraded to ubuntu pangolin beta over the weekend,
which includes 2.7.3rc1, ...

File "/usr/lib/python2.7/random.py", line 47, in <module>
from os import urandom as _urandom
ImportError: cannot import name urandom

It looks like you are using random.py of Python 2.7.3 with the Python program 2.7.2, because os.urandom() is now always available in Python 2.7.3.

jsvaughan · 2012-03-13T22:15:44Z

Victor - yes that was it; a mixture of a 2.7.2 virtual env and 2.7.3. Apologies for any nuisance caused.

vstinner · 2012-03-13T22:18:58Z

Can we close this issue?

gpshead · 2012-03-13T22:25:46Z

I believe so. This is in all of the release candidates.

The expat/xmlparse.c hash collision DoS issue is being handled on its own via http://bugs.python.org/issue14234.

gvanrossum · 2021-11-04T15:35:57Z

Hey Erlend, why did you add so many people to the nosy list of this old
issue?

On Thu, Nov 4, 2021 at 07:33 Erlend E. Aasland <report@bugs.python.org>
wrote:

Change by Erlend E. Aasland <erlend.aasland@innova.no>:

----------
components: +Interpreter Core -Argument Clinic
nosy: +Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon,
PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson,
christian.heimes, cvrebert, dmalcolm, eric.araujo, eric.snow, fx5,
georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, jsvaughan,
lemburg, loewis, mark.dickinson, neologix, pitrou, python-dev, roger.serwy,
skorgu, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
-ahmedsayeed1982, larry

Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue13703\>

--
--Guido (mobile)

terryjreedy · 2021-11-04T19:28:37Z

Because today's spammer, whose message was removed, deleted us all. Restoring the version to 3.3 is not possible.

warsaw added interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-security A security issue labels Jan 3, 2012

gpshead closed this as completed Mar 13, 2012

ahmedsayeed1982 mannequin added 3.11 only security fixes and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Nov 4, 2021

erlend-aasland added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 4, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

gpshead mentioned this issue Sep 22, 2023

account for dict randomization in Design & History FAQ #58506

Closed

Hash collision security issue #57912

Hash collision security issue #57912

Comments

warsaw commented Jan 3, 2012

warsaw commented Jan 3, 2012

tiran commented Jan 3, 2012

pitrou commented Jan 3, 2012

tiran commented Jan 3, 2012

gvanrossum commented Jan 3, 2012

warsaw commented Jan 3, 2012

vstinner commented Jan 3, 2012

benjaminp commented Jan 3, 2012

warsaw commented Jan 3, 2012

tiran commented Jan 3, 2012

PaulMcMillan mannequin commented Jan 3, 2012

vstinner commented Jan 4, 2012

tiran commented Jan 4, 2012

terryjreedy commented Jan 4, 2012

vstinner commented Jan 4, 2012

pitrou commented Jan 4, 2012

vstinner commented Jan 4, 2012

pitrou commented Jan 4, 2012

pitrou commented Jan 4, 2012

vstinner commented Jan 4, 2012

vstinner commented Jan 4, 2012

PaulMcMillan mannequin commented Jan 4, 2012

pitrou commented Jan 4, 2012

merwok commented Jan 4, 2012

warsaw commented Jan 4, 2012

malemburg commented Jan 4, 2012

Thanks,

malemburg commented Jan 4, 2012

gpshead commented Feb 21, 2012

gpshead commented Feb 21, 2012

birkenfeld commented Feb 21, 2012

vstinner commented Feb 21, 2012

malemburg commented Feb 21, 2012

malemburg commented Feb 21, 2012

pitrou commented Feb 21, 2012

warsaw commented Feb 21, 2012

warsaw commented Feb 21, 2012

gpshead commented Feb 21, 2012

warsaw commented Feb 22, 2012

warsaw commented Feb 22, 2012

serwy mannequin commented Feb 27, 2012

benjaminp commented Feb 27, 2012

serwy mannequin commented Feb 27, 2012

cvrebert mannequin commented Mar 3, 2012

kseifriedredhatcom mannequin commented Mar 10, 2012

jsvaughan mannequin commented Mar 12, 2012

vstinner commented Mar 12, 2012

jsvaughan mannequin commented Mar 13, 2012

vstinner commented Mar 13, 2012

gpshead commented Mar 13, 2012

gvanrossum commented Nov 4, 2021

terryjreedy commented Nov 4, 2021