classification
Title: Hash collision security issue
Type: security Stage: needs patch
Components: Interpreter Core Versions: Python 3.3, Python 3.2, Python 3.1, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: 13704 Superseder:
Assigned To: Nosy List: Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, cvrebert, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, haypo, jcea, jsvaughan, lemburg, loewis, mark.dickinson, neologix, pitrou, python-dev, roger.serwy, skorgu, skrah, terry.reedy, tim.peters, v+python, zbysz
Priority: release blocker Keywords: patch

Created on 2012-01-03 19:36 by barry, last changed 2012-03-13 22:25 by gregory.p.smith. This issue is now closed.

Files
File name Uploaded Description Edit
hash-attack.patch lemburg, 2012-01-06 12:52
SafeDict.py v+python, 2012-01-08 00:19 SafeDict implementation
bench_startup.py haypo, 2012-01-13 00:36
random-8.patch haypo, 2012-01-17 12:21 review
hash-collision-counting-dmalcolm-2012-01-20-001.patch dmalcolm, 2012-01-20 22:55 review
amortized-probe-counting-dmalcolm-2012-01-20-002.patch dmalcolm, 2012-01-21 03:16 review
amortized-probe-counting-dmalcolm-2012-01-21-003.patch dmalcolm, 2012-01-21 17:02 review
hash-attack-2.patch lemburg, 2012-01-23 13:07
hash-attack-3.patch lemburg, 2012-01-23 16:43
integercollision.py lemburg, 2012-01-23 16:43
backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch dmalcolm, 2012-01-23 21:31 Backport of haypo's random-8.patch to 2.7 review
hybrid-approach-dmalcolm-2012-01-25-001.patch dmalcolm, 2012-01-25 11:05 Hybrid approach to solving dict DoS attack review
hybrid-approach-dmalcolm-2012-01-25-002.patch dmalcolm, 2012-01-25 17:49 review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-27-001.patch dmalcolm, 2012-01-28 05:13 review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch dmalcolm, 2012-01-28 23:14 review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch dmalcolm, 2012-01-30 01:39 review
unnamed dmalcolm, 2012-01-30 01:44 review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch dmalcolm, 2012-01-30 17:31 review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch dmalcolm, 2012-01-30 22:22 review
optin-hash-randomization-for-2.6-dmalcolm-2012-01-30-001.patch dmalcolm, 2012-01-31 01:34 review
results-16.txt dmalcolm, 2012-02-01 03:29
add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch dmalcolm, 2012-02-02 01:18 review
fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch dmalcolm, 2012-02-02 01:18 review
add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch dmalcolm, 2012-02-02 01:18 review
fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch dmalcolm, 2012-02-02 01:18 review
add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch dmalcolm, 2012-02-06 19:07 review
fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch dmalcolm, 2012-02-06 19:07 review
add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch dmalcolm, 2012-02-06 19:07 review
fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch dmalcolm, 2012-02-06 19:07 review
add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch dmalcolm, 2012-02-11 23:06 review
add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch dmalcolm, 2012-02-11 23:06 review
add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch dmalcolm, 2012-02-13 20:37 review
add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch dmalcolm, 2012-02-13 20:37 review
hash-patch-3.1-gb-03.patch georg.brandl, 2012-02-19 10:00 review
Messages (326)
msg150522 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-01-03 19:36
This is already publicly known and in deep discussion on python-dev.  The proper fix is still TBD.  Essentially, hash collisions can be exploited to DoS a web framework that automatically parses input forms into dictionaries.

Start here:

http://mail.python.org/pipermail/python-dev/2011-December/115116.html
msg150525 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-03 20:19
I had a short chat with Guido yesterday. I'll try to sum up the conversation. Guido, please correct me if I got something wrong or missed a point.

Guido wants the fix as simple and less intrusive as possible as he wants to provide/apply a patch for Python 2.4 to 3.3. This means any new stuff is off the table unless it's really, really necessary. Say goodbye to my experimental MurmurHash3 patch.

We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release. The env var PYTHONRANDOMHASH=1 would enable the randomization. It's simple to set the env var in e.g. Apache for mod_python and mod_wsgi.
msg150526 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-03 20:24
> We haven't agreed whether the randomization should be enabled by
> default or disabled by default. IMHO it should be disabled for all
> releases except for the upcoming 3.3 release.

I think on the contrary it must be enabled by default. Leaving security
holes open is wrong.
msg150529 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-03 20:31
> I think on the contrary it must be enabled by default. Leaving security
> holes open is wrong.

We can't foresee the implications of the randomization and only a small number of deployments is affected by the problem. But I won't start a fight on the matter. ;)
msg150531 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-01-03 20:47
I'm with Antoine -- turn it on by default.  Maybe there should be a release candidate to test the waters.
msg150532 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-01-03 20:49
On Jan 03, 2012, at 08:24 PM, Antoine Pitrou wrote:

>I think on the contrary it must be enabled by default. Leaving security
>holes open is wrong.

Unless there's evidence of performance regressions or backward
incompatibilities, I agree.
msg150533 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-03 21:20
> Unless there's evidence of performance regressions
> or backward incompatibilities, I agree.

If hash() is modified, str(dict) and str(set) will change for example. It may break doctests. Can we consider that the application should not rely (indirectly) on hash and so fix (for example) their doctests? Or is it a backward incompatibility?

hash() was already modified in major Python versions.

For this specific issue, I consider that security is more important than str(dict).
msg150534 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-03 21:43
Barry, when this gets fixed, shall we coordinate release times?
msg150541 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-01-03 22:02
On Jan 03, 2012, at 09:43 PM, Benjamin Peterson wrote:

>Barry, when this gets fixed, shall we coordinate release times?

Yes!
msg150543 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-03 22:08
Randomized hashing destabilizes the unit tests of Python, too. Here are the outputs of four test runs:

11 tests failed:
    test_collections test_dbm test_dis test_gdb test_inspect
    test_packaging test_set test_symtable test_ttk_textonly
    test_urllib test_urlparse

9 tests failed:
    test_dbm test_dis test_gdb test_json test_packaging test_set
    test_symtable test_urllib test_urlparse

10 tests failed:
    test_dbm test_dis test_gdb test_inspect test_packaging test_set
    test_symtable test_ttk_textonly test_urllib test_urlparse

9 tests failed:
    test_collections test_dbm test_dict test_dis test_gdb
    test_packaging test_symtable test_urllib test_urlparse
msg150558 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-03 23:52
I agree that we should enable randomness by default, and provide an easy way for users to disable it if necessary (unit test suites that explicitly depend on order being an obvious candidate).

I'll link my proposed algorithm change here, for the record:
https://gist.github.com/0a91e52efa74f61858b5

I've gotten confirmation from several other sources that the fix recommended by the presenters (just a random initialization seed) only prevents the most basic form of the attack.
msg150559 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-04 00:22
Christian Heimes proposes the following change in its randomhash branch (see issue #13704):

-    x = (Py_uhash_t) *p << 7;
+    x = Py_RndHashSeed + ((Py_uhash_t) *p << 7);
     for (i = 0; i < len; i++)
         x = (1000003U * x) ^ (Py_uhash_t) *p++;
     x ^= (Py_uhash_t) len;

This change doesn't add any security if the attacker can inject any string and retreive the hash value. You can retreive directly Py_RndHashSeed using:

Py_RndHashSeed = intmask((hash("a") ^ len("a") ^ ord("a")) * DIVIDE) - (ord("a") << 7)

where intmask() truncates to a long (x mod 2^(long bits)) and DIVIDE = 1/1000003 mod 2^(long bits). For example, DIVIDE=2021759595 for 32 bits long.
msg150560 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-04 00:38
Victor, please ignore my code related to hash randomization for now. I've deliberately not linked my branch to this bug report. I'm well aware that it's not secure and that it's pretty easy to reverse engineer the seed from a hash of a short string. The code is a proof of concept to detect failing tests and other issues.

I'm in private contact with Paul and we are working together. He has done extended research and I'll gladly follow his expertise. I've already discussed the issue with small strings, but I can't recall if it was a private mail to Paul or a public one to the dev list.

Paul:
I still think that you should special case short strings (five or few chars sound good). An attacker can't do much harm with one to five char strings but such short strings may make it too easy to calculate the seed.

16kb of seed is still a lot. Most CPUs have about 16 to 32, maybe 64kb L1 cache for data. 1024 to 4096 bytes should increase cache locality and reduce speed impacts.

PS: I'm going to reply to your last mail tomorrow.
msg150562 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-04 00:55
In #13707 I suggest a change to the current hash() entry which is needed independently of this issue, because the default hash (for object()), being tied to id() is already limited to an object's lifetime. But this change will become more imperative if hash() is made run-dependent for numbers and strings.

There does not seems to presently *be* a security hole for 64 bit builds, so if there is any noticeable slowdown on 64 bit builds and it is sensibly easy to tie the default to the bitness, I would think it should be off for such builds.
msg150563 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-04 01:00
Paul first proposition (on python-dev) was to replace:

    ...
    x = (ord(s[0]) << 7)
    while i < length:
        x = intmask((1000003*x) ^ ord(s[i]))
        ...

by:

    ...
    x = (ord(s[0]) << 7)
    while i < length:
        x = intmask((1000003*x) ^ ord(s[i])) ^ r[x % len_r]
        ...

This change has a vulnerability similar than the one of Christian's suggested changed. The "r" array can be retreived directly with:

r2 = []
for i in xrange(len(r)):
    s = chr(intmask(i * UNSHIFT7) % len(r))
    h = intmask(hash(s) ^ len(s) ^ ord(s) ^ ((ord(s) << 7) * MOD))
    r2.append(chr(h))
r2 = ''.join(r2)

where UNSHIFT7 = 1/2**7 mod 2^(long bits).

By the way, this change always use r[0] to hash all string of one ASCII character (U+0000-U+007F).
msg150565 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-04 01:30
> I'm in private contact with Paul and we are working together. He has
> done extended research and I'll gladly follow his expertise. I've
> already discussed the issue with small strings, but I can't recall if
> it was a private mail to Paul or a public one to the dev list.

Can all this be discussed on this issue now that it's the official point
of reference? It will avoid the repetition of arguments we see here and
there.

(I don't think special-casing small strings makes sense, because then
you have two algorithms to audit rather than one)
msg150568 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-04 01:54
> https://gist.github.com/0a91e52efa74f61858b5

Please, attach directly a file to the issue, or copy/paste the code in your comment. Interesting part the code:
---

#Proposed replacement
#--------------------------------------
import os, array
size_exponent = 14 #adjust as a memory/security tradeoff
r = array.array('l', os.urandom(2**size_exponent))
len_r = len(r)

def _hash_string2(s):
    """The algorithm behind compute_hash() for a string or a unicode."""
    length = len(s)
    #print s
    if length == 0:
        return -1
    x = (ord(s[0]) << 7) ^ r[length % len_r]
    i = 0
    while i < length:
        x = intmask((1000003*x) ^ ord(s[i]))
        x ^= r[x % len_r]
        i += 1
    x ^= length
    return intmask(x)
---

> r = array.array('l', os.urandom(2**size_exponent))
> len_r = len(r)

r size should not depend on the size of a long. You should write something like:

sizeof_long = ctypes.sizeof(ctypes.c_long)
r_bits = 8
r = array.array('l', os.urandom((2**r_bits) * sizeof_long))
r_mask = 2**r_bits-1

and then replace "% len_r" by "& r_mask".

What is the minimum value of r_bits? For example, would it be safe to use a single long integer? (r_bits=1)
msg150569 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-04 01:58
> > r = array.array('l', os.urandom(2**size_exponent))
> > len_r = len(r)
> 
> r size should not depend on the size of a long. You should write something like:
> 
> sizeof_long = ctypes.sizeof(ctypes.c_long)
> r_bits = 8
> r = array.array('l', os.urandom((2**r_bits) * sizeof_long))
> r_mask = 2**r_bits-1

The final code will be in C and will use neither ctypes nor array.array.
Arguing about this looks quite pointless IMO.
msg150570 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-04 02:14
For the record, here is what "man urandom" says about random seed size:

“[...] no cryptographic primitive available today can hope to promise 
more than 256  bits of  security,  so  if  any  program  reads more than 
256 bits (32 bytes) from the kernel random pool per invocation, or per 
reasonable  reseed  interval (not less than one minute), that should be
taken as a sign that its cryptography  is  not  skilfully  implemented.”

In that light, reading a 64 bytes seed from /dev/urandom is already a lot, and 4096 bytes is simply insane.
msg150577 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-04 03:08
I read that the attack cannot be computed with actual computers (it's too expensive) against Python 64 bits. I tried to change str.__hash__ in Python 32 bits to compute the hash in 64 bits and than truncate the hash to 32 bits: it doesn't change anything, the hash values are the same, so it doesn't improve the security.
msg150589 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-04 05:09
Yet another random hash function, simplified version of Paul's function. It always use exactly 256 bits of entropy and so 32 bytes of memory, and doesn't keep the loop. I don't expect my function to be secure, but just give more work to the attacker to compute the data for an attack against our dict implementation.

---
import os, array, struct
sizeof_long = struct.calcsize("l")
r_bits = 256
r_len = r_bits // (sizeof_long * 8)
r_mask = r_len - 1
r = array.array('l', os.urandom(r_len * sizeof_long))

def randomhash(s):
    length = len(s)
    if length == 0:
        return -2
    x = ord(s[0])
    x ^= r[x & r_mask]
    x <<= 7
    for ch in s:
        x = intmask(1000003 * x)
        x ^= ord(ch)
    x ^= length
    x ^= r[x & r_mask]
    return intmask(x)
---

The first "x ^= r[x & r_mask]" may be replaced by "x ^= r[(x ^ length) & r_mask]".

The binary shift is done after the first xor with r, because 2**7 and r_len are not coprime (r_len is a multipler of 2**7), and so (ord(s[0] << 7) & r_mask is always zero.

randomhash(s)==hash(s) if we used twice the same index in the r array. I don't know if this case gives useful information.
msg150592 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-04 06:00
A couple of things here:

First, my proposed change is not cryptographically secure. There simply aren't any cryptographic hashing algorithms available that are in the performance class we need. My proposal does make the collision attack quite difficult to carry out, even if the raw output values from the hash are available to the attacker (they should not usually be).

I favor using the same algorithm between 32 and 64 bit builds for consistency of behavior. Developers would be startled to find that ordering stays consistent on a 64 bit build but varies on 32 bit builds. Additionally, the impracticality of attacking of 64 bit builds rests on the fact that these particular researchers didn't devise a way to do it. I'd hate to make this change and then have a clever mathematician publish some elegant point requiring us to go fix the problem all over again. 

I could be convinced either way on small strings. I like that they can't be used to attack the secret. At the same time, I worry that combining 2 different hashing routines into the same output space may introduce unexpected collisions and other difficult to debug edge-case conditions. It also means that the order of the hashes of long strings will vary while the order of short strings will not - another inconsistency which will encourage bugs.

Thank you Victor for the improvements to the python demonstration code. As Antoine said, it's only demo code, but better demo code is always good.

Antoine: That section of the manpage is referring to the overall security of a key generated using urandom. 256 bits is overkill for this application. We could take 256 bits and use them to generate a key using a cryptographically appropriate algorithm, but it's simpler to read more bits and use them directly as the key.

Additionally, that verbiage has been in the man page for urandom for quite some time (probably since the earliest version in the mid 90's). The PRNG has been improved since then.

Minimum length of r is a hard question. The shorter it is, the higher the correlation of the output. In my tests, 16kb was the amount necessary to generally do reasonably well on my test suite for randomness even with problematic input. Obviously our existing output isn't random, so it doesn't pass those tests at all. Using a fairly small value (4k) should not make the results much worse from a security perspective, but might be problematic from a collision/distribution standpoint. It's clear that we don't need cryptographically good randomness here, but passing the test suite is not a bad thing when considering the distribution.

When we settle on a C implementation, I'd like to run it through the smhasher set of tests to make sure we aren't making distribution worse, especially for very small values of r.
msg150601 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-04 11:02
> Using a fairly small value (4k) should not make the results much worse 
> from a security perspective, but might be problematic from a
> collision/distribution standpoint.

Keep in mind the average L1 data cache size is between 16KB and 64KB. 4KB is already a significant chunk of that.

Given a hash function's typical loop is to feed back the current result into the next computation, I don't see why a small value (e.g. 256 bytes) would be detrimental.
msg150609 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-04 14:52
If test_packaging fails because it relies on dict order / hash details, that’s a bug.  Can you copy the full tb (possibly in another report, I can fix it independently of this issue)?
msg150613 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-01-04 15:08
On Jan 04, 2012, at 06:00 AM, Paul McMillan wrote:

>Developers would be startled to find that ordering stays consistent on a 64
>bit build but varies on 32 bit builds.

Well, one positive outcome of this issue is that users will finally viscerally
understand that dictionary (and set) order should never be relied upon, even
between successive runs of the same Python executable.
msg150616 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-04 16:42
Some comments:

1. The security implications in all this is being somewhat overemphasized.

There are many ways you can do a DoS attack on web servers. It's the
responsibility of the used web frameworks and servers to deal with
the possible cases.

It's a good idea to provide some way to protect against hash
collision attacks, but that will only solve one possible way of
causing a resource attack on a server.

There are other ways you can generate lots of CPU overhead with
little data input (e.g. think of targeting the search feature on
many Zope/Plone sites).

In order to protect against such attacks in general, we'd have to
provide a way to control CPU time and e.g. raise an exception if too
much time is being spent on a simple operation such as a key insertion.
This can be done using timers, signals or even under OS control.

The easiest way to protect against the hash collision attack is by
limiting the POST/GET/HEAD request size.

The second best way would be to limit the number of parameters that a
web framework accepts for POST/GET/HEAD request.

2. Changing the semantics of hashing in a dot release is not allowed.

If randomization of the hash start vector or some other method is
enabled by default in a dot release, this will change the semantics
of any application switching to that dot release.

The hash values of Python objects are not only used by the Python
dictionary implementation, but also by other storage mechanisms
such as on-disk dictionaries, inter-process object exchange via
share memory, memcache, etc.

Hence, if changed, the hash change should be disabled per default
for dot releases and enabled for 3.3.

3. Changing the way strings are hashed doesn't solve the problem.

Hash values of other types can easily be guessed as well, e.g.
take integers which use a trivial hash function.

We'd have to adapt all hash functions of the basic types in Python
or come up with a generic solution using e.g. double-hashing
in the dictionary/set implementations.

4. By just using a random start vector you change the absolute
hash values for specific objects, but not the overall hash sequence
or its period.

An attacker only needs to create many hash collisions, not
specific ones. It's the period of the hash function that's
important in such attacks and that doesn't change when moving to
a different start vector.

5. Hashing needs to be fast.

It's one of the most used operations in Python. Please get experts into
the boat like Tim Peters and Christian Tismer, who both have worked
on the dict implementation and the hash functions, before experimenting
with ad-hoc fixes.

6. Counting collisions could solve the issue without having to
change hashing.

Another idea would be counting the collisions and raising an
exception if the number of collisions exceed a certain
threshold.

Such a change would work for all hashable Python objects and
protect against the attack without changing any hash function.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg150619 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-04 17:18
Marc-Andre Lemburg wrote:
> 
> 3. Changing the way strings are hashed doesn't solve the problem.
> 
> Hash values of other types can easily be guessed as well, e.g.
> take integers which use a trivial hash function.

Here's an example for integers on a 64-bit machine:

>>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000000))
>>> d = dict(g)

This takes ages to complete and only uses very little memory.
The input data has some 32MB if written down in decimal numbers
- not all that much data either.

32397634
msg150620 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-04 17:22
The email interface ate part of my reply:

>>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000000))
>>> s = ''.join(str(x) for x in g)
>>> len(s)
32397634
>>> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000000))
>>> d = dict(g)
... lots of time for coffee, pizza, taking a walk, etc. :-)
msg150621 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-04 17:41
To expand on Marc-Andre's point 1: the DOS attack on web servers is possible because servers are generally dumb at the first stage. Upon receiving a post request, all key=value pairs are mindlessly packaged into a hash table that is then passed on to a page handler that typically ignores the invalid keys.

However, most pages do not need any key,value pairs and forms that do have a pre-defined set of expected and recognized keys. If there were a possibly empty set of keys associated with each page, and the set were checked against posted keys, then a DOS post with thousands of effectively random keys could quickly (in O(1) time) be rejected as erroneous.

In Python, the same effect could be accomplished by associating a class with slots with each page and having the server create an instance of the class. Attempts to create an undefined attribute would then raise an exception. Either way, checking input data for face validity before processing it in a time-consuming way is one possible solution for nearly all web pages and at least some other applications.
msg150622 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-04 17:44
Except, it's a totally non-scalable approach.  People have vulnerabilities all over their sites which they don't realize.  Some examples:

django-taggit (an application I wrote for handling tags) parses tags out an input, it stores these in a set to check for duplicates.  It's vulnerable.

Another site I'm writing accepts JSON POSTs, you can put arbitrary keys in the JSON.  It's vulnerable.
msg150625 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-04 17:58
Marc-Andre Lemburg wrote:
> 
> 1. The security implications in all this is being somewhat overemphasized.
> 
> There are many ways you can do a DoS attack on web servers. It's the
> responsibility of the used web frameworks and servers to deal with
> the possible cases.
> 
> It's a good idea to provide some way to protect against hash
> collision attacks, but that will only solve one possible way of
> causing a resource attack on a server.
> 
> There are other ways you can generate lots of CPU overhead with
> little data input (e.g. think of targeting the search feature on
> many Zope/Plone sites).
> 
> In order to protect against such attacks in general, we'd have to
> provide a way to control CPU time and e.g. raise an exception if too
> much time is being spent on a simple operation such as a key insertion.
> This can be done using timers, signals or even under OS control.
> 
> The easiest way to protect against the hash collision attack is by
> limiting the POST/GET/HEAD request size.

For GET and HEAD, web servers normally already apply such limitations
at rather low levels:

http://stackoverflow.com/questions/686217/maximum-on-http-header-values

So only HTTP methods which carry data in the body part of the HTTP
request are effected, e.g. POST and various WebDAV methods.

> The second best way would be to limit the number of parameters that a
> web framework accepts for POST/GET/HEAD request.

Depending on how parsers are implemented, applications taking
XML/JSON/XML-RPC/etc. as data input may also be vulnerable, e.g.
non validating XML parsers which place element attributes into
a dictionary or a JSON parser that has to read the JSON version of
the dict I generated earlier on.
msg150634 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-04 23:42
Work-in-progress patch implementing my randomized hash function (random.patch):
 - add PyOS_URandom() using CryptoGen, SSL (only on VMS!!) or /dev/urandom, will a fallback on a dummy LCG if the OS urandom failed
 - posix.urandom() is always defined and reuses PyOS_URandom()
 - hash(str) is now randomized using two random Py_hash_t values: don't touch the critical loop, only add a prefix and a suffix

Notes:
 - PyOS_URandom() reuses mostly code from Modules/posixmodule.c, except dev_urandom() and fallback_urandom() which are new
 - I removed memset(PyBytes_AS_STRING(result), 0, howMany); from win32_urandom() because it doesn't really change anything because the LCG is used if win32_urandom() fails
 - Python refuses to start if the OS urandom is missing.
 - Python/random.c code may be moved into Python/pythonrun.c if it is an issue to add a new file in old Python versions.
 - If the OS urandom fails to generate the unicode hash secret, no warning is emitted (because the LCG is used). I don't know if a warning is needed in this case.
 - os.urandom() argument is now a Py_ssize_t instead of an int

TODO:
 - add an environment option to ignore the OS urandom and only uses the LCG
 - fix all tests broken because of the randomized hash(str)
 - PyOS_URandom() raises exceptions whereas it is called before creating the interpreter state. I suppose that it cannot work like this.
 - review and test PyOS_URandom()
 - review and test the new randomized hash(str)
msg150635 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-04 23:54
> add PyOS_URandom() using CryptoGen, SSL (only on VMS!!)
> or /dev/urandom

Oh, OpenSSL (RAND_pseudo_bytes) should be used on Windows, Linux, Mac OS X, etc. if OpenSSL is available. I was just too lazy to add a define or pyconfig.h option to indicate if OpenSSL is available or not. FYI RAND_pseudo_bytes() is now exposed in the ssl module of Python 3.3.

> will a fallback on a dummy LCG

It's the Linear congruent generator (LCG) used by Microsoft Visual C++ and PHP:

x(n+1) = (x(n) * 214013 + 2531011) % 2^32

I only use bits 23..16 (bits 15..0 are not really random).
msg150636 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-05 00:01
> > add PyOS_URandom() using CryptoGen, SSL (only on VMS!!)
> > or /dev/urandom
> 
> Oh, OpenSSL (RAND_pseudo_bytes) should be used on Windows, Linux, Mac
> OS X, etc. if OpenSSL is available.

Apart from the large dependency, the OpenSSL license is not
GPL-compatible which may be a problem for some Python-embedding
applications:
http://en.wikipedia.org/wiki/OpenSSL#Licensing

> > will a fallback on a dummy LCG
> 
> It's the Linear congruent generator (LCG) used by Microsoft Visual C++
> and PHP:
> 
> x(n+1) = (x(n) * 214013 + 2531011) % 2^32
> 
> I only use bits 23..16 (bits 15..0 are not really random).

If PHP uses it, I'm confident it is secure.
msg150637 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-05 00:02
+            printf("read %i bytes\n", size);

Oops, I forgot a debug message.
msg150638 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-05 00:11
> If PHP uses it, I'm confident it is secure.

If I remember correctly, it is only used for the Windows version of PHP, but PHP doesn't implement it correctly because it uses all bits.
msg150639 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-05 00:31
This is not something that can be fixed by limiting the size of POST/GET. 

Parsing documents (even offline) can generate these problems. I can create books that calibre (a Python-based ebook format shifting tool) can't convert, but are otherwise perfectly valid for non-python devices. If I'm allowed to insert usernames into a database and you ever retrieve those in a dict, you're vulnerable. If I can post things one at a time that eventually get parsed into a dict (like the tag example), you're vulnerable. I can generate web traffic that creates log files that are unparsable (even offline) in Python if dicts are used anywhere. Any application that accepts data from users needs to be considered.

Even if the web framework has a dictionary implementation that randomizes the hashes so it's not vulnerable, the entire python standard library uses dicts all over the place. If this is a problem which must be fixed by the framework, they must reinvent every standard library function they hope to use.

Any non-trivial python application which parses data needs the fix. The entire standard library needs the fix if is to be relied upon by applications which accept data. It makes sense to fix Python.

Of course we must fix all the basic hashing functions in python, not just the string hash. There aren't that many. 

Marc-Andre:
If you look at my proposed code, you'll notice that we do more than simply shift the period of the hash. It's not trivial for an attacker to create colliding hash functions without knowing the key.

Since speed is a concern, I think that the proposal to avoid using the random hash for short strings is a good idea. Additionally, randomizing only some of the characters in longer strings will allow us to improve security without compromising speed significantly.

I suggest that we don't randomize strings shorter than 6 characters. For longer strings, we randomize the first and last 5 characters. This means we're only adding additional work to a max of 10 rounds of the hash, and only for longer strings. Collisions with the hash from short strings should be minimal.
msg150641 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-05 00:36
"Since speed is a concern, I think that the proposal to avoid using the random hash for short strings is a good idea."

My proposition only adds two XOR to hash(str) (outside the loop on Unicode characters), so I expect a ridiculous overhead. I don't know yet how hard it is to guess the secret from hash(str) output.
msg150642 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-05 00:36
Thanks Victor!

> - hash(str) is now randomized using two random Py_hash_t values: 
> don't touch the critical loop, only add a prefix and a suffix

At least for Python 2.x hash(str) and hash(unicode) have to yield the same result for ASCII only strings. 

>  - PyOS_URandom() raises exceptions whereas it is called before
> creating the interpreter state. I suppose that it cannot work like this.

My patch compensates for the issue and calls Py_FatalError() when the random seed hasn't been initialized yet.

You aren't special casing small strings. I fear that an attacker may guess the seed from several small strings. How about using another initial seed for strings shorter than 4 code points?
msg150643 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-05 00:39
> You aren't special casing small strings. I fear that an attacker may
> guess the seed from several small strings.

How would (s)he do?
msg150644 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-05 00:44
> My proposition only adds two XOR to hash(str) (outside the loop on Unicode characters), so I expect a ridiculous overhead. I don't know yet how hard it is to guess the secret from hash(str) output.

It doesn't work much better than a single random seed. Calculating the
hash of a null byte gives you the xor of your two seeds. An attacker
can still cause collisions inside the vulnerable hash function, your
change doesn't negate those internal collisions. Also, strings of all
null bytes collide trivially.
msg150645 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-05 00:49
> I fear that an attacker may guess the seed from several small strings

hash(a) ^ hash(b) "removes" the suffix, but I don't see how to guess the prefix from this new value. It doesn't mean that it is not possible, just that I don't have a strong background in crytography :-)

I don't expect that adding 2 XOR would change our dummy (fast but unsafe) hash function into a cryptographic hash function. We cannot have security for free. If we want a strong cryptographic hash function, it would be much slower (Paul wrote that it would be 4x slower). But we prefer speed over security, so we have to do compromise.

I don't know if you can retreive hash values in practice. I suppose that you can only get hash(str) & (size - 1) with size=size of the dict internal array, so only the lower bits. Using a large dict, you may be able to retreive more bits of the hash value.
msg150646 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-05 00:53
Given that a user has an application with an oracle function that returns the hash of a unicode string, an attacker can probe tenth of thousand one and two character unicode strings. That should give him/her enough data to calculate both seeds. hash("") already gives away lots of infomration about the seeds, too.

- hash("") should always return 0

- for small strings we could use a different seed than for larger strings

- for larger strings we could use Paul's algorithm but limit the XOR op to the first and last 16 elements instead of all elements.
msg150647 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-05 00:57
> - for small strings we could use a different seed than for larger strings

Or just leave them unseeded with our existing algorithm. Shifting them
into a different part of the hash space doesn't really gain us much.

> - for larger strings we could use Paul's algorithm but limit the XOR op to the first and last 16 elements instead of all elements.

Agreed. It does have to be both the first and the last though. We
can't just do one or the other.
msg150648 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-05 00:58
Paul wrote:
> I suggest that we don't randomize strings shorter than 6 characters. For longer strings, we randomize the first and last 5 characters. This means we're only adding additional work to a max of 10 rounds of the hash, and only for longer strings. Collisions with the hash from short strings should be minimal.

It's too surprising for developers when just the strings with 6 or more chars are randomized. Barry made a good point http://bugs.python.org/issue13703#msg150613
msg150649 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-05 00:58
"Calculating the hash of a null byte gives you the xor of your two seeds."

Not directly because prefix is first multiplied by 1000003. So hash("\0") gives you (prefix*1000003) % 2^32 ^ suffix.

Example:

$ ./python 
secret={b7abfbbf, db6cbb4d}
Python 3.3.0a0 (default:547e918d7bf5+, Jan  5 2012, 01:36:39) 
>>> hash("")
1824997618
>>> hash("\0")
-227042383
>>> hash("\0"*2)
1946249080
>>> 0xb7abfbbf ^ 0xdb6cbb4d
1824997618
>>> (0xb7abfbbf * 1000003) & 0xffffffff ^ 0xdb6cbb4d
4067924912
>>> hash("\0") & 0xffffffff
4067924913
msg150650 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-05 01:05
> At least for Python 2.x hash(str) and hash(unicode) have to yield
> the same result for ASCII only strings. 

Ah yes, I forgot Python 2: I wrote my patch for Python 3.3. The two hash functions should be modified to be randomized.

> hash("") should always return 0

Ok, I can add a special case. Antoine told me that hash("") gives prefix ^ suffix, which is too much information for the attacker :-)

> for small strings we could use a different seed
> than for larger strings

Why? The attack doesn't work with short strings? What do you call a "short string"?
msg150651 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-05 01:09
Patch version 2:
 - hash("") is always 0
 - Remove a debug message
msg150652 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-05 01:17
In reply to MAL's message http://bugs.python.org/issue13703#msg150616

> 2. Changing the semantics of hashing in a dot release is not allowed.

I concur with Marc. The change is too intrusive and may cause too much trouble for the issue. Also it seems to be unnecessary for platforms with 64bit hash.

Marc: Fred told me that ZODB isn't affected. One thing less to worry. ;)


> 5. Hashing needs to be fast.

Good point, we should include Tim and Christian Tiesmer once we have a solution we can agree upon

PS: I'm missing "Reply to message" and a threaded view for lengthy topics
msg150655 - (view) Author: Huzaifa Sidhpurwala (Huzaifa.Sidhpurwala) Date: 2012-01-05 06:25
I am wondering if a CVE id has been assigned to this security issue yet?
msg150656 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-05 09:01
Paul McMillan wrote:
> 
> This is not something that can be fixed by limiting the size of POST/GET. 
> 
> Parsing documents (even offline) can generate these problems. I can create books that calibre (a Python-based ebook format shifting tool) can't convert, but are otherwise perfectly valid for non-python devices. If I'm allowed to insert usernames into a database and you ever retrieve those in a dict, you're vulnerable. If I can post things one at a time that eventually get parsed into a dict (like the tag example), you're vulnerable. I can generate web traffic that creates log files that are unparsable (even offline) in Python if dicts are used anywhere. Any application that accepts data from users needs to be considered.
> 
> Even if the web framework has a dictionary implementation that randomizes the hashes so it's not vulnerable, the entire python standard library uses dicts all over the place. If this is a problem which must be fixed by the framework, they must reinvent every standard library function they hope to use.
> 
> Any non-trivial python application which parses data needs the fix. The entire standard library needs the fix if is to be relied upon by applications which accept data. It makes sense to fix Python.

Agreed: Limiting the size of POST requests only applies to *web* applications.
Other applications will need other fixes.

Trying to fix the problem in general by tweaking the hash function to
(apparently) make it hard for an attacker to guess a good set of
colliding strings/integers/etc. is not really a good solution. You'd
only be making it harder for script kiddies, but as soon as someone
crypt-analysis the used hash algorithm, you're lost again.

You'd need to use crypto hash functions or universal hash functions
if you want to achieve good security, but that's not an option for
Python objects, since the hash functions need to be as fast as possible
(which rules out crypto hash functions) and cannot easily drop the invariant
"a=b => hash(a)=hash(b)" (which rules out universal hash functions, AFAICT).

IMO, the strategy to simply cap the number of allowed collisions is
a better way to achieve protection against this particular resource
attack. The probability of having valid data reach such a limit is
low and, if configurable, can be made 0.

> Of course we must fix all the basic hashing functions in python, not just the string hash. There aren't that many. 

... not in Python itself, but if you consider all the types in Python
extensions and classes implementing __hash__ in user code, the number
of hash functions to fix quickly becomes unmanageable.

> Marc-Andre:
> If you look at my proposed code, you'll notice that we do more than simply shift the period of the hash. It's not trivial for an attacker to create colliding hash functions without knowing the key.

Could you post it on the ticket ?

BTW: I wonder how long it's going to take before someone figures out
that our merge sort based list.sort() is vulnerable as well... its
worst-case performance is O(n log n), making attacks somewhat harder.
The popular quicksort which Python used for a long time has O(n²),
making it much easier to attack, but fortunately, we replaced it
with merge sort in Python 2.3, before anyone noticed ;-)
msg150659 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2012-01-05 09:43
What is the mechanism by which the attacker can determine the seeds?
The actual hash value is not directly observable externally.
The attacker can only determine the timing effects of multiple 
insertions into a dict, or have I missed something?

> - hash("") should always return 0

Why should hash("") always return 0?
I can't find it in the docs anywhere.
msg150662 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-05 10:20
It's quite possible that a user has created a function (by mistake or deliberately) that gives away the hash of an arbitrary string. We haven't taught developers that (s)he shouldn't disclose the hash of a string.

> Why should hash("") always return 0?
> I can't find it in the docs anywhere.

hash("") should return something constant that doesn't reveal information about the random seeds. 0 is an arbitrary choice that is as good as anything else. hash("") already returns 0, hence my suggestion for 0.
msg150665 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2012-01-05 10:41
But that's not the issue we are supposed to be dealing with.
A single (genuinely random) seed will deal with the attack described in 
the talk and it is (almost) as fast as using 0 as a seed.
Why make things complicated dealing with a hypothetical problem?

>> Why should hash("") always return 0?
>> I can't find it in the docs anywhere.
> 
> hash("") should return something constant that doesn't reveal information about the random seeds. 0 is an arbitrary choice that is as good as anything else. hash("") already returns 0, hence my suggestion for 0.

Is special casing arbitrary values really any more secure?
If we special case "", the attacker will just start using "\0" and so on...

> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
msg150668 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-05 12:41
> I concur with Marc. The change is too intrusive and may cause too much
> trouble for the issue.

Do you know if mod_wsgi et al. are tackling the issue on their side?

> Also it seems to be unnecessary for platforms with 64bit hash.

We still support Python on 32-bit platforms, so this can't be a serious
argument.
If you think that no-one runs a server on a 32-bit kernel nowadays, I
would point out that "no-one" apparently doesn't include ourselves ;-)
msg150694 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-05 21:40
Marc-Andre: Victor already pasted the relevant part of my code:
http://bugs.python.org/issue13703#msg150568
The link to the fuller version, with revision history and a copy of the code before I modified it is here:
https://gist.github.com/0a91e52efa74f61858b5

>Why? The attack doesn't work with short strings? What do you call a "short string"?

Well, the demonstrated collision is for 16 character ascii strings. Worst case UTF-8, we're looking at 3 manipulable bytes per character, but they may be harder to collide since some of those bytes are fixed.

> only be making it harder for script kiddies, but as soon as someone
> crypt-analysis the used hash algorithm, you're lost again.

Not true. What I propose is to make the amount of information necessary to analyze and generate collisions impractically large. My proposed hash function is certainly broken if you brute force the lookup table. There are undoubtedly other problems with it too. The point is that it's hard enough. We aren't going for perfect security - we're going for enough to make this attack impractical.

What are the downsides to counting collisions? For one thing, it's something that needs to be kept track of on a per-dict basis, and can't be cached the way the hash results are. How do you choose a good value for the limit? If you set it to something conservative, you still pay the collision price every time a dict  is created to discover that the keys collide. This means that it's possible to feed to bad data up to exactly the limit, and suddenly the python app is inexplicably slow. If you set the limit too aggressively, then sometimes valid data gets caught, and python randomly dies in hard to debug ways with an error the programmer has never seen in testing and cannot reproduce.

It adds a new way to kill most python applications, and so programs are going to have to be re-written to cope with it. It also introduces a new place to cause errors - if the WSGI server dies, it's hard for my application to catch that and recover gracefully.

>... not in Python itself, but if you consider all the types in Python
> extensions and classes implementing __hash__ in user code, the number
> of hash functions to fix quickly becomes unmanageable.

When we looked at the Django project, we wouldn't have anything to fix since ours end up relying on the python internal values eventually. I suspect a lot of other code is similar.

Mark said:
>What is the mechanism by which the attacker can determine the seeds?

The biggest information leak is probably the ordering in which dict entries are returned. This can be used to deduce the underlying hash values. This is much easier than trying to do it via timing.

> But that's not the issue we are supposed to be dealing with.
> A single (genuinely random) seed will deal with the attack described in 
> the talk and it is (almost) as fast as using 0 as a seed.

This is not true. A single random seed shifts the hash table, but does not actually prevent an attacker from generating collisions. Please see my other posts on the topic here and on the mailing list.
msg150699 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-05 22:49
> What I propose is to make the amount of information necessary
> to analyze and generate collisions impractically large.

Not only: the attacker has to compute the collisions for the new seed. I don't know how long it is, the code to generate collisions is not public yet. I suppose than generating collisions is longer if we change the hash function to add more instructions (I don't know how much).

If generating the collisions requires a farm of computers / GPUs / something else and 7 days, it doesn't matter if it's easy to retreive the secret.

If the attack wants to precompute collisions for all possible seeds, (s)he will also have to store them. With 64 bits of entropy, if an attack is 1 byte long, you have to store 2^64 bytes (16,777,216 TB).

It is a problem if it takes less than a day with a desktop PC to generate data for an attack. In this case, it should be difficult to compute the secret.
msg150702 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-06 00:23
Note for myself, random-2.patch: _PyRandom_Init() must generate a prefix and a suffix different than zero (call PyOS_URandom in a loop, and fail after 100 tries).
msg150706 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-06 01:09
"Given that a user has an application with an oracle function that returns the hash of a unicode string, an attacker can probe tenth of thousand one and two character unicode strings. That should give him/her enough data to calculate both seeds. hash("") already gives away lots of infomration about the seeds, too."

Sorry, but I don't see how you compute the secret using these data.

You are right, hash("\0") gives some information about the secret. With my patch, hash("\0")^1 gives: ((prefix * 1000003) & HASH_MASK) ^ suffix.

(hash("\0")^1) ^ (hash("\0\0")^2) gives ((prefix * 1000003) & HASH_MASK) ^ ((prefix * 1000003**2)  & HASH_MASK).
msg150707 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-06 01:44
Either we are really paranoid (I know that I am *g*) or Perl's and Ruby's randomized hashing function suffer from the issues we are worried about. They don't compensate for hash(''), hash(n * '\0') or hash(shortstring).

Perl 5.12.4 hv.h:

#define PERL_HASH(hash,str,len) \
     STMT_START { \
        register const char * const s_PeRlHaSh_tmp = str; \
        register const unsigned char *s_PeRlHaSh = (const unsigned char *)s_PeRlHaSh_tmp; \
        register I32 i_PeRlHaSh = len; \
        register U32 hash_PeRlHaSh = PERL_HASH_SEED; \
        while (i_PeRlHaSh--) { \
            hash_PeRlHaSh += *s_PeRlHaSh++; \
            hash_PeRlHaSh += (hash_PeRlHaSh << 10); \
            hash_PeRlHaSh ^= (hash_PeRlHaSh >> 6); \
        } \
        hash_PeRlHaSh += (hash_PeRlHaSh << 3); \
        hash_PeRlHaSh ^= (hash_PeRlHaSh >> 11); \
        (hash) = (hash_PeRlHaSh + (hash_PeRlHaSh << 15)); \
    } STMT_END

Ruby 1.8.7-p357 st.c:strhash()

#define CHAR_BIT 8
hash_seed = rb_genrand_int32() # Mersenne Twister

    register unsigned long val = hash_seed;

    while ((c = *string++) != '\0') {
        val = val*997 + c;
        val = (val << 13) | (val >> (sizeof(st_data_t) * CHAR_BIT - 13));
    }

    return val + (val>>5);

I wasn't able to find Java's fix quickly. Anybody else?
msg150708 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-06 01:50
Perl is so paranoid they obscure their variable names!  In all seriousness, both Perl and Ruby are vulnerable to the timing attacks, and as far as I know the JVM is not patching this themselves, but telling applications to fix it themselves (I know JRuby switched to Murmurhash).
msg150712 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-06 02:50
As Alex said, Java has refused to fix the issue.

I believe that Ruby 1.9 (at least the master branch code that I looked
at) is using murmurhash2 with a random seed.

In either case, yes, these functions are vulnerable to a number of
attacks. We're solving the problem more completely than they did.
msg150713 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-06 02:57
Those who use or advocate a simple randomized starting hash (Perl, Ruby, perhaps MS, and the CCC presenters) are presuming that the randomized hash values are kept private. Indeed, they should be (and the docs could note this) unless an attacker has direct access to the interpreter. An attacker who does, as in a Python programming class, can much more easily freeze the interpreter by 'accidentally' writing code equivalent to "while True: pass".

I do not think we, as Python developers, should be concerned about esoteric timing attacks. They strike me as a site issue rather than a language issue. As I understand them, they require *large* numbers of probes coupled with responses based on the same hash function. So a site being so probed already has bit of a problem. And if hashing were randomized per process, and probes were randomly distributed among processes, and processes were periodically killed and restarted with new seeds, could such an attack get anywhere (besides the DOS effect of the probing)? The point of the CCC talk was that with one constant known hash, one could lock up a server for a long while with just one upload.

So I think we should copy Perl and Ruby, do the easy thing, and add a random seed to 3.3 hashing, subject to keeping equality for equal numbers. Let whatever thereby fails, fail, and be updated. For prior versions, add an option for strings and perhaps numbers, and document that some tests will fail if enabled.

We could also consider, for 3.3, making the output of hash() be different from the internal values used for dicts, perhaps by switching random seeds in hash(). So even if someone does return hash(x) values to potential attackers, they are not the values used in dicts. (This would require a slight change in the doc.)
msg150718 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2012-01-06 09:08
I agree.

+1 for strings. -0 for numbers.

This might cause problems with dict subclasses and the like,
so I'm -1 on this.
msg150719 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2012-01-06 09:31
Without the context, that last message didn't make much sense.

I agree with Terry that we should copy Perl and Ruby (for strings).
I'm -1 on hash() returning a different value than dict uses internally.
msg150724 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-06 12:49
Before continuing down the road of adding randomness to hash
functions, please have a good read of the existing dictionary
implementation:

"""
Major subtleties ahead:  Most hash schemes depend on having a "good" hash
function, in the sense of simulating randomness.  Python doesn't:  its most
important hash functions (for strings and ints) are very regular in common
cases:

[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
>>>

This isn't necessarily bad!  To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints.
The same is approximately true when keys are "consecutive" strings.  So this
gives better-than-random behavior in common cases, and that's very desirable.
...
"""

There's also a file called dictnotes.txt which has more interesting
details about how the implementation is designed.

Please note that the term "collision" is used in a slightly different
way: it refers to trying to find an empty slot in the dictionary
table. Having a collision implies that the hash values of two distinct
objects are the same, but you also get collisions in case two distinct
objects with different hash values get mapped to the same table entry.

An attack can be based on trying to find many objects with the same
hash value, or trying to find many objects that, as they get inserted
into a dictionary, very often cause collisions due to the collision
resolution algorithm not finding a free slot.

In both cases, the (slow) object comparisons needed to find an
empty slot is what makes the attack practical, if the application
puts too much trust into large blobs of input data - which is
the actual security issues we're trying to work around here...

Given the dictionary implementation notes, I'm even less certain
that the randomization change is a good idea. It will likely
introduce a performance hit due to both the added complexity in
calculating the hash as well as the reduced cache locality of
the data in the dict table.

I'll upload a patch that demonstrates the collisions counting
strategy to show that detecting the problem is easy. Whether
just raising an exception is a good idea, is another issue.

It may be better to change the tp_hash slot in Python 3.3
to take an argument, so that the dict implementation can
use the hash function as universal hash family function
(see http://en.wikipedia.org/wiki/Universal_hash).

The dict implementation could then alter the hash parameter
and recreate the dict table in case the number of collisions
exceeds a certain limit, thereby actively taking action
instead of just relying on randomness solving the issue in
most cases.
msg150725 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-06 12:52
Demo patch implementing the collision limit idea for Python 2.7.
msg150726 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-06 12:56
The hash-attack.patch solves the problem for the integer case
I posted earlier on and doesn't cause any problems with the
test suite.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'too many hash collisions'

It also doesn't change the hashing or dict repr in existing
applications.
msg150727 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-06 12:56
Stupid email interface again... here's the full text:

The hash-attack.patch solves the problem for the integer case
I posted earlier on and doesn't cause any problems with the
test suite.

>>> d = dict((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 100))
>>> d = dict((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'too many hash collisions'

It also doesn't change the hashing or dict repr in existing
applications.
msg150738 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-06 16:35
hash-attack.patch does never decrement the collision counter.
msg150748 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-06 17:03
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> hash-attack.patch does never decrement the collision counter.

Why should it ? It's only used as local variable in the lookup function.

Note that the limit only triggers on a per-key basis. It's not
a limit on the total number of collisions in the table, so you don't
need to keep the number of collisions stored on the object.
msg150756 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-06 17:59
Here's an example of hash-attack.patch finding an on-purpose
programming error (hashing all objects to the same value):

http://stackoverflow.com/questions/4865325/counting-collisions-in-a-python-dictionary
(see the second example on the page for @Winston Ewert's solution)

With the patch you get:

Traceback (most recent call last):
  File "testcollisons.py", line 20, in <module>
    d[o] = 1
KeyError: 'too many hash collisions'
msg150766 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-06 19:53
> Those who use or advocate a simple randomized starting hash (Perl, Ruby, perhaps MS, and the CCC presenters) are presuming that the randomized hash values are kept private. Indeed, they should be (and the docs could note this) unless an attacker has direct access to the interpreter.

Except that this is patently untrue. Anytime any programmer iterates
over a dictionary and returns the ordered result to the user in some
form, they're leaking information about the hash value. I hope you're
not suggesting that any programmer who is concerned about security
will make sure to sort the results of every iteration before making it
public in some fashion.

> I do not think we, as Python developers, should be concerned about esoteric timing attacks.

Timing attacks are less esoteric than you think they are. This issue
gets worse, not better, as the internet moves (for better or worse)
towards virtualized computing.

> And if hashing were randomized per process, and probes were randomly distributed among processes, and processes were periodically killed and restarted with new seeds, could such an attack get anywhere...

You're suggesting that in order for a Python application to be secure,
it's a requirement that we randomly kill and restart processes from
time to time? I thought we were going for a good solution here, not a
hacky workaround.

> We could also consider, for 3.3, making the output of hash() be different from the internal values used for dicts, perhaps by switching random seeds in hash(). So even if someone does return hash(x) values to potential attackers, they are not the values used in dicts. (This would require a slight change in the doc.)

This isn't a bad idea, but I'd be fine documenting that the output of
hash() shouldn't be made public.
msg150768 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-06 20:50
"You're suggesting that in order for a Python application to be secure,
it's a requirement that we randomly kill and restart processes from
time to time?"

No, that is not what I said.
msg150769 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-06 20:56
> An attack can be based on trying to find many objects with the same
> hash value, or trying to find many objects that, as they get inserted
> into a dictionary, very often cause collisions due to the collision
> resolution algorithm not finding a free slot.

Yep. Allowing an attacker to produce very large dictionaries is also bad.

> if the application
> puts too much trust into large blobs of input data - which is
> the actual security issues we're trying to work around here...

To be very clear the issue is ANY large blob of data anywhere in the
application, not just on input. An attack could happen after whatever
transform your application runs on the data before returning it.

> I'll upload a patch that demonstrates the collisions counting
> strategy to show that detecting the problem is easy. Whether
> just raising an exception is a good idea, is another issue.

I'm in cautious agreement that collision counting is a better
strategy. The dict implementation performance would suffer from
randomization.

> The dict implementation could then alter the hash parameter
> and recreate the dict table in case the number of collisions
> exceeds a certain limit, thereby actively taking action
> instead of just relying on randomness solving the issue in
> most cases.

This is clever. You basically neuter the attack as you notice it but
everything else is business as usual. I'm concerned that this may end
up being costly in some edge cases (e.g. look up how many collisions
it takes to force the recreation, and then aim for just that many
collisions many times). Unfortunately, each dict object has to
discover for itself that it's full of offending hashes. Another
approach would be to neuter the offending object by changing its hash,
but this would require either returning multiple values, or fixing up
existing dictionaries, neither of which seems feasible.
msg150771 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-06 21:53
> I'm in cautious agreement that collision counting is a better
> strategy.

Disagreed. Raising randomly is unacceptable (false positives), especially in a bugfix release.

> The dict implementation performance would suffer from
> randomization.

Benchmarks please. http://hg.python.org/benchmarks/ for example.
msg150795 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-07 13:17
Paul McMillan wrote:
> 
>> I'll upload a patch that demonstrates the collisions counting
>> strategy to show that detecting the problem is easy. Whether
>> just raising an exception is a good idea, is another issue.
> 
> I'm in cautious agreement that collision counting is a better
> strategy. The dict implementation performance would suffer from
> randomization.
> 
>> The dict implementation could then alter the hash parameter
>> and recreate the dict table in case the number of collisions
>> exceeds a certain limit, thereby actively taking action
>> instead of just relying on randomness solving the issue in
>> most cases.
> 
> This is clever. You basically neuter the attack as you notice it but
> everything else is business as usual. I'm concerned that this may end
> up being costly in some edge cases (e.g. look up how many collisions
> it takes to force the recreation, and then aim for just that many
> collisions many times). Unfortunately, each dict object has to
> discover for itself that it's full of offending hashes. Another
> approach would be to neuter the offending object by changing its hash,
> but this would require either returning multiple values, or fixing up
> existing dictionaries, neither of which seems feasible.

I ran some experiments with the collision counting patch and
could not trigger it in normal applications, not even in cases
that are documented in the dict implementation to have a poor
collision resolution behavior (integers with zeros the the low bits).
The probability of having to deal with dictionaries that create
over a thousand collisions for one of the key objects in a
real life application appears to be very very low.

Still, it may cause problems with existing applications for the
Python dot releases, so it's probably safer to add it in a
disabled-per-default form there (using an environment variable
to adjust the setting). For 3.3 it could be enabled per default
and it would also make sense to allow customizing the limit
using a sys module setting.

The idea with adding a parameter to the hash method/slot in order
to have objects provide a hash family function instead of a fixed
unparametrized hash function would probably have to be implemented
as additional hash method, e.g. .__uhash__() and tp_uhash ("u"
for universal).

The builtin types should then grow such methods
in order to make hashing safe against such attacks. For objects
defined in 3rd party extensions, we would need to encourage
implementing the slot/method as well. If it's not implemented,
the dict implementation would have to fallback to raising an
exception.

Please note that I'm just sketching things here. I don't have
time to work on a full-blown patch, just wanted to show what
I meant with the collision counting idea and demonstrate that
it actually works as intended.
msg150829 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2012-01-07 23:24
[Marc-Andre]
> BTW: I wonder how long it's going to take before
> someone figures out that our merge sort based
> list.sort() is vulnerable as well... its worst-
> case performance is O(n log n), making attacks
> somewhat harder.

I wouldn't worry about that, because nobody could stir up anguish
about it by writing a paper ;-)

1. O(n log n) is enormously more forgiving than O(n**2).

2. An attacker need not be clever at all:  O(n log n) is not only
sort()'s worst case, it's also its _expected_ case when fed randomly
ordered data.

3. It's provable that no comparison-based sorting algorithm can have
better worst-case asymptotic behavior when fed randomly ordered data.

So if anyone whines about this, tell 'em to go do something useful instead :-)
msg150832 - (view) Author: Martin (gz) Date: 2012-01-07 23:53
I built random-2.patch on my windows xp box (updating the project and fixing some compile errors in random.c required), and initialising crypto has a noticeable impact on startup time. The numbers vary a fair bit naturally, two representative runs are as follows:

changeset 52796:1ea8b7233fd7 on default branch:

    >timeit %PY3K% -c "import sys;print(sys.version)"
    3.3.0a0 (default, Jan  7 2012, 00:12:45) [MSC v.1500 32 bit (Intel)]

    Version Number:   Windows NT 5.1 (Build 2600)
    Exit Time:        0:16 am, Saturday, January 7 2012
    Elapsed Time:     0:00:00.218
    Process Time:     0:00:00.187
    System Calls:     4193
    Context Switches: 445
    Page Faults:      1886
    Bytes Read:       642542
    Bytes Written:    272
    Bytes Other:      31896

with random-2.patch and fixes applied:

    >timeit %PY3K% -c "import sys;print(sys.version)"
    3.3.0a0 (default, Jan  7 2012, 00:58:32) [MSC v.1500 32 bit (Intel)]

    Version Number:   Windows NT 5.1 (Build 2600)
    Exit Time:        0:59 am, Saturday, January 7 2012
    Elapsed Time:     0:00:00.296
    Process Time:     0:00:00.234
    System Calls:     4712
    Context Switches: 642
    Page Faults:      2049
    Bytes Read:       1059381
    Bytes Written:    272
    Bytes Other:      34544

This is with hot caches, cold will likely be worse, but a smaller percentage change. On a faster box, or with an SSD, or win 7, the delta will likely be smaller too.

A 50-100ms slow down is consistent with the difference on Python 2.7 between calling `os.urandom(1)` or not. However, the baseline is faster with Python 2, frequently dipping under 100ms, so there this change could double the runtime of trivial scripts.
msg150835 - (view) Author: Glenn Linderman (v+python) Date: 2012-01-08 00:19
Given Martin's comment (msg150832) I guess I should add my suggestion to this issue, at least for the record.

Rather than change hash functions, randomization could be added to those dicts that are subject to attack by wanting to store user-supplied key values.  The list so far seems to be   urllib.parse, cgi, json  Some have claimed there are many more, but without enumeration.  These three are clearly related to the documented issue.

The technique would be to wrap dict and add a short random prefix to each key value, preventing the attacker from supplier keys that are known to collide... and even if he successfully stumbles on a set that does collide on one request, it is unlikely to collide on a subsequent request with a different prefix string.

The technique is fully backward compatible with all applications except those that contain potential vulnerabilities as described by the researchers. The technique adds no startup or runtime overhead to any application that doesn't contain the potential vulnerabilities.  Due to the per-request randomization, the complexity of creating a sequence of sets of keys that may collide is enormous, and requires that such a set of keys happen to arrive on a request in the right sequence where the predicted prefix randomization would be used to cause the collisions to occur.  This might be possible on a lightly loaded system, but is less likely on a system with heavy load, which are more interesting to attack.

Serhiy Storchaka provided a sample implementation on the python-dev, copied below, and attached as a file (but is not a patch).

# -*- coding: utf-8 -*-
from collections import MutableMapping
import random


class SafeDict(dict, MutableMapping):

    def __init__(self, *args, **kwds):
        dict.__init__(self)
        self._prefix = str(random.getrandbits(64))
        self.update(*args, **kwds)

    def clear(self):
        dict.clear(self)
        self._prefix = str(random.getrandbits(64))

    def _safe_key(self, key):
        return self._prefix + repr(key), key

    def __getitem__(self, key):
        try:
            return dict.__getitem__(self, self._safe_key(key))
        except KeyError as e:
            e.args = (key,)
            raise e

    def __setitem__(self, key, value):
        dict.__setitem__(self, self._safe_key(key), value)

    def __delitem__(self, key):
        try:
            dict.__delitem__(self, self._safe_key(key))
        except KeyError as e:
            e.args = (key,)
            raise e

    def __iter__(self):
        for skey, key in dict.__iter__(self):
            yield key

    def __contains__(self, key):
        return dict.__contains__(self, self._safe_key(key))

    setdefault = MutableMapping.setdefault
    update = MutableMapping.update
    pop = MutableMapping.pop
    popitem = MutableMapping.popitem
    keys = MutableMapping.keys
    values = MutableMapping.values
    items = MutableMapping.items

    def __repr__(self):
        return '{%s}' % ', '.join('%s: %s' % (repr(k), repr(v))
            for k, v in self.items())

    def copy(self):
        return self.__class__(self)

    @classmethod
    def fromkeys(cls, iterable, value=None):
        d = cls()
        for key in iterable:
            d[key] = value
        return d

    def __eq__(self, other):
        return all(k in other and other[k] == v for k, v in self.items()) and \
            all(k in self and self[k] == v for k, v in other.items())

    def __ne__(self, other):
        return not self == other
msg150836 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-08 00:21
You're seriously underestimating the number of vulnerable dicts.  It has nothing to do with the module, and everything to do with the origin of the data.  There's tons of user code that's vulnerable too.
msg150840 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-08 02:40
> Alex, I agree the issue has to do with the origin of the data, but the modules listed are the ones that deal with the data supplied by this particular attack.

They deal directly with the data. Do any of them pass the data
further, or does the data stop with them? A short and very incomplete
list of vulnerable standard lib modules includes: every single parsing
library (json, xml, html, plus all the third party libraries that do
that), all of numpy (because it processes data which probably came
from a user [yes, integers can trigger the vulnerability]), difflib,
the math module, most database adaptors, anything that parses metadata
(including commonly used third party libs like PIL), the tarfile lib
along with other compressed format handlers, the csv module,
robotparser, plistlib, argparse, pretty much everything under the
heading of "18. Internet Data Handling" (email, mailbox, mimetypes,
etc.), "19. Structured Markup Processing Tools", "20. Internet
Protocols and Support", "21. Multimedia Services", "22.
Internationalization", TKinter, and all the os calls that handle
filenames. The list is impossibly large, even if we completely ignore
user code. This MUST be fixed at a language level.

I challenge you to find me 15 standard lib components that are certain
to never handle user-controlled input.

> Note that changing the hash algorithm for a persistent process, even though each process may have a different seed or randomized source, allows attacks for the life of that process, if an attack vector can be created during its lifetime. This is not a problem for systems where each request is handled by a different process, but is a problem for systems where processes are long-running and handle many requests.

This point has been made many times now. I urge you to read the entire
thread on the mailing list. Your implementation is impractical because
your "safe" implementation completely ignores all hash caching (each
entry must be re-hashed for that dict). Your implementation is still
vulnerable in exactly the way you mentioned if you ever have any kind
of long-lived dict in your program thread.

> You have entered the class of people that claim lots of vulnerabilities, without enumeration.

I have enumerated. Stop making this argument.
msg150847 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2012-01-08 05:36
Glenn, you have reached a point where you stop bike-shedding and start to troll by attacking people. Please calm down. I'm sure that you are just worried about the future of Python and all the bad things, that might be introduced by a fix for the issue.

Please trust us! Paul, Victor, Antoine and several more involved developers are professional Python devs and have been for years. Most of them do Python development for a living. We won't kill the snake that pays our bills. ;) Ultimately it's Guido's choice, too. 

Martin:
Ouch, the startup impact is large! Have we reached a point where "one size fits all" doesn't work any longer? It's getting harder to have just one executable for 500ms scripts and server processes that last for weeks.

Marc-Andre:
Have you profiled your suggestion? I'm interested in the speed implications. My gut feeling is that your idea could be slower, since you have added more instructions to a tight loop, that is execute on every lookup, insert, update and deletion of a dict key. The hash modification could have a smaller impact, since the hash is cached. I'm merely speculating here until we have some numbers to compare.
msg150856 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-08 10:20
> Christian Heimes added the comment:
> Ouch, the startup impact is large! Have we reached a point where "one size fits all" doesn't work any longer? It's getting harder to have just one executable for 500ms scripts and server processes that last for weeks.

This concerns me too, and is one reason I think the collision counting
code might be the winning solution. Randomness is hard to do correctly
and is expensive. If we can avoid it, we should try very hard to do
so...

> Christian Heimes said to Marc-Andre:
> Have you profiled your suggestion? I'm interested in the speed implications. My gut feeling is that your idea could be slower, since you have added more instructions to a tight loop, that is execute on every lookup, insert, update and deletion of a dict key. The hash modification could have a smaller impact, since the hash is cached. I'm merely speculating here until we have some numbers to compare.

Interesting point, though I think we might be able to work it out so
that we're only adding instructions when there's actually a detected
collision. I'll be interested to see what the benchmarks (and real
world) have to say about the impacts of randomization as compared to
the existing black-magic optimization of the hash function.
msg150857 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-08 11:33
Tim Peters wrote:
> 
> Tim Peters <tim.peters@gmail.com> added the comment:
> 
> [Marc-Andre]
>> BTW: I wonder how long it's going to take before
>> someone figures out that our merge sort based
>> list.sort() is vulnerable as well... its worst-
>> case performance is O(n log n), making attacks
>> somewhat harder.
> 
> I wouldn't worry about that, because nobody could stir up anguish
> about it by writing a paper ;-)
> 
> 1. O(n log n) is enormously more forgiving than O(n**2).
> 
> 2. An attacker need not be clever at all:  O(n log n) is not only
> sort()'s worst case, it's also its _expected_ case when fed randomly
> ordered data.
> 
> 3. It's provable that no comparison-based sorting algorithm can have
> better worst-case asymptotic behavior when fed randomly ordered data.
> 
> So if anyone whines about this, tell 'em to go do something useful instead :-)

Right on all accounts :-)
msg150859 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-08 11:47
Christian Heimes wrote:
> Marc-Andre:
> Have you profiled your suggestion? I'm interested in the speed implications. My gut feeling is that your idea could be slower, since you have added more instructions to a tight loop, that is execute on every lookup, insert, update and deletion of a dict key. The hash modification could have a smaller impact, since the hash is cached. I'm merely speculating here until we have some numbers to compare.

I haven't done any profiling on this yet, but will run some
tests.

The lookup functions in the dict implementation are optimized
to make the first non-collision case fast. The patch doesn't touch this
loop. The only change is in the collision case, where an increment
and comparison is added (and then only after the comparison which
is the real cost factor in the loop). I did add a printf() to
see how often this case occurs - it's a surprisingly rare case,
which suggests that Tim, Christian and all the others that have
invested considerable time into the implementation have done
a really good job here.

BTW: I noticed that a rather obvious optimization appears to be
missing from the Python dict initialization code: when passing in
a list of (key, value) pairs, the implementation doesn't make
use of the available length information and still starts with an
empty (small) dict table and then iterates over the pairs, increasing
the table size as necessary. It would be better to start with a
table that is presized to O(len(data)). The dict implementation
already provides such a function, but it's not being used
in the case dict(pair_list). Anyway, just an aside.
msg150865 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-08 14:23
> Randomness is hard to do correctly
> and is expensive. If we can avoid it, we should try very hard to do
> so...

os.urandom() is actually cheaper on Windows 7 here:

1000000 loops, best of 3: 1.78 usec per loop

than on Linux:

$ ./python -m timeit -s "import os" "os.urandom(16)"
100000 loops, best of 3: 4.85 usec per loop
$ ./python -m timeit -s "import os; f=os.open('/dev/urandom', os.O_RDONLY)" "os.read(f, 16)"
100000 loops, best of 3: 2.35 usec per loop

(note that the os.read timing is optimistic since I'm not checking the
return value!)

I don't know if the patch's startup overhead has to do with initializing
the crypo context or simply with looking up the symbols in advapi32.dll.
Perhaps we should link explicitly against advapi32.dll as suggested by
Martin?
msg150866 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-08 14:26
Again, Roundup ate up some of the text:

>PCbuild\amd64\python.exe  -m timeit -s "import os" "os.urandom(16)"
1000000 loops, best of 3: 1.81 usec per loop

(for the record, the Roundup issue is at http://psf.upfronthosting.co.za/roundup/meta/issue264 )
msg150934 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-09 12:16
Marc-Andre Lemburg wrote:
> 
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> 
> Christian Heimes wrote:
>> Marc-Andre:
>> Have you profiled your suggestion? I'm interested in the speed implications. My gut feeling is that your idea could be slower, since you have added more instructions to a tight loop, that is execute on every lookup, insert, update and deletion of a dict key. The hash modification could have a smaller impact, since the hash is cached. I'm merely speculating here until we have some numbers to compare.
> 
> I haven't done any profiling on this yet, but will run some
> tests.

I ran pybench and pystone: neither shows a significant change.

I wish we had a simple to run benchmark based on Django to allow
checking such changes against real world applications. Not that I
expect different results from such a benchmark...

To check the real world impact, I guess it would be best to
run a few websites with the patch for a week and see whether the
collision exception gets raised.
msg151012 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-10 11:37
Version 3 of my patch:
 - Add PYTHONHASHSEED environment variable to get a fixed seed or to
disable the randomized hash function (PYTHONHASHSEED=0)
 - Add tests on the randomized hash function
 - Add more tests on os.urandom()
msg151017 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-10 14:26
> Version 3 of my patch:
>  - Add PYTHONHASHSEED environment variable to get a fixed seed or to
> disable the randomized hash function (PYTHONHASHSEED=0)
>  - Add tests on the randomized hash function
>  - Add more tests on os.urandom()

You forgot random.c.

+        PyErr_SetString(PyExc_RuntimeError, "Fail to generate random
bytes");

I would put an OSError and preserve the errno.

+    def test_null_hash(self):
+        # PYTHONHASHSEED=0 disables the randomized hash
+        self.assertEqual(self.get_hash("abc", 0), -1600925533)
+
+    def test_fixed_hash(self):
+        # test a fixed seed for the randomized hash
+        self.assertEqual(self.get_hash("abc", 42), -206076799)

This is portable on both 32-bit and 64-bit builds?
msg151031 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-10 22:15
Patch version 4:
 - os.urandom() raises again exceptions on failure
 - drop support of VMS (which used RAND_pseudo_bytes from OpenSSL): I don't see how to link Python/random.c to libcrypto on VMS, I don't have VMS, and it don't see how it was working because posixmodule.c was neither linked to libcrypto !?
 - fix test_dict, test_gdb, test_builtin
 - win32_urandom() handles size bigger than INT_MAX using a loop (it may be DWORD max instead?)
 - _PyRandom_Init() does nothing it is called twice to fix a _testembed failure (don't change the Unicode secret because Python stores some strings somewhere and never destroy them)
msg151033 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-10 23:07
Patch version 5 fixes test_unicode for 64-bit system.
msg151047 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-11 09:28
STINNER Victor wrote:
> 
> Patch version 5 fixes test_unicode for 64-bit system.

Victor, I don't think the randomization idea is going anywhere. The
code has many issues:

 * it is exceedingly complex
 * the method would need to be implemented for all hashable
   Python types
 * it causes startup time to increase (you need urandom data for
   every single hashable Python data type)
 * it causes run-time to increase due to changes in the hash
   algorithm (more operations in the tight loop)
 * causes different processes in a multi-process setup to use different
   hashes for the same object
 * doesn't appear to work well in embedded interpreters that
   regularly restarted interpreters (AFAIK, some objects persist across
   restarts and those will have wrong hash values in the newly started
   instances)

The most important issue, though, is that it doesn't really
protect Python against the attack - it only makes it less
likely that an adversary will find the init vector (or a way
around having to find it via crypt analysis).

OTOH, the collision counting patch is very simple, doesn't have
the performance issues and provides real protection against the
attack. Even better still, it can detect programming errors in
hash method implementations.

IMO, it would be better to put efforts into refining the collision
detection patch (perhaps adding support for the universal hash
method slot I mentioned) and run some real life tests with it.
msg151048 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-11 09:56
>  * it is exceedingly complex

Which part exactly? For hash(str), it just add two extra XOR.

>  * the method would need to be implemented for all hashable Python types

It was already discussed, and it was said that only hash(str) need to
be modified.

>  * it causes startup time to increase (you need urandom data for
>   every single hashable Python data type)

My patch reads 8 or 16 bytes from /dev/urandom which doesn't block. Do
you have a benchmark showing a difference?

I didn't try my patch on Windows yet.

>  * it causes run-time to increase due to changes in the hash
>   algorithm (more operations in the tight loop)

I posted a micro-benchmark on hash(str) on python-dev: the overhead is
nul. Did you have numbers showing that the overhead is not nul?

>  * causes different processes in a multi-process setup to use different
>   hashes for the same object

Correct. If you need to get the same hash, you can disable the
randomized hash (PYTHONHASHSEED=0) or use a fixed seed (e.g.
PYTHONHASHSEED=42).

>  * doesn't appear to work well in embedded interpreters that
>   regularly restarted interpreters (AFAIK, some objects persist across
>   restarts and those will have wrong hash values in the newly started
>   instances)

test_capi runs _testembed which restarts a embedded interpreters 3
times, and the test pass (with my patch version 5). Can you write a
script showing the problem if there is a real problem?

In an older version of my patch, the hash secret was recreated at each
initiliazation. I changed my patch to only generate the secret once.

> The most important issue, though, is that it doesn't really
> protect Python against the attack - it only makes it less
> likely that an adversary will find the init vector (or a way
> around having to find it via crypt analysis).

I agree that the patch is not perfect. As written in the patch, it
just makes the attack more complex. I consider that it is enough.

Perl has a simpler protection than the one proposed in my patch. Is
Perl vulnerable to the hash collision vulnerability?
msg151061 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-11 14:34
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>>  * it is exceedingly complex
> 
> Which part exactly? For hash(str), it just add two extra XOR.

I'm not talking specifically about your patch, but the whole idea
and the needed changes in general.

>>  * the method would need to be implemented for all hashable Python types
> 
> It was already discussed, and it was said that only hash(str) need to
> be modified.

Really ? What about the much simpler attack on integer hash values ?

You only have to send a specially crafted JSON dictionary with integer
keys to a Python web server providing JSON interfaces in order to
trigger the integer hash attack.

The same goes for the other Python data types.

>>  * it causes startup time to increase (you need urandom data for
>>   every single hashable Python data type)
> 
> My patch reads 8 or 16 bytes from /dev/urandom which doesn't block. Do
> you have a benchmark showing a difference?
> 
> I didn't try my patch on Windows yet.

Your patch only implements the simple idea of adding an init
vector and a fixed suffix vector (which you don't need since
it doesn't prevent hash collisions).

I don't think that's good enough, since
it doesn't change how the hash algorithm works on the actual
data, but instead just shifts the algorithm to a different
sequence. If you apply the same logic to the integer hash
function, you'll see that more clearly.

Paul's algorithm is much more secure in this respect, but it
requires more random startup data.

>>  * it causes run-time to increase due to changes in the hash
>>   algorithm (more operations in the tight loop)
> 
> I posted a micro-benchmark on hash(str) on python-dev: the overhead is
> nul. Did you have numbers showing that the overhead is not nul?

For the simple solution, that's an expected result, but if you want
more safety, then you'll see a hit due to the random data getting
XOR'ed in every single loop.

>>  * causes different processes in a multi-process setup to use different
>>   hashes for the same object
> 
> Correct. If you need to get the same hash, you can disable the
> randomized hash (PYTHONHASHSEED=0) or use a fixed seed (e.g.
> PYTHONHASHSEED=42).

So you have the choice of being able to work in a multi-process
environment and be vulnerable to the attack or not. I think we
can do better :-)

Note that web servers written in Python tend to be long running
processes, so an attacker has lots of time to test various
seeds.

>>  * doesn't appear to work well in embedded interpreters that
>>   regularly restarted interpreters (AFAIK, some objects persist across
>>   restarts and those will have wrong hash values in the newly started
>>   instances)
> 
> test_capi runs _testembed which restarts a embedded interpreters 3
> times, and the test pass (with my patch version 5). Can you write a
> script showing the problem if there is a real problem?
> 
> In an older version of my patch, the hash secret was recreated at each
> initiliazation. I changed my patch to only generate the secret once.

Ok, that should fix the case.

Two more issue that I forgot:

 * enabling randomized hashing can make debugging a lot harder, since
   it's rather difficult to reproduce the same state in a controlled
   way (unless you record the hash seed somewhere in the logs)

and even though applications should not rely on the order of dict
repr()s or str()s, they do often enough:

 * randomized hashing will result in repr() and str() of dictionaries
   to be random as well

>> The most important issue, though, is that it doesn't really
>> protect Python against the attack - it only makes it less
>> likely that an adversary will find the init vector (or a way
>> around having to find it via crypt analysis).
> 
> I agree that the patch is not perfect. As written in the patch, it
> just makes the attack more complex. I consider that it is enough.

Wouldn't you rather see a fix that works for all hash functions
and Python objects ? One that doesn't cause performance
issues ?

The collision counting idea has this potential.

> Perl has a simpler protection than the one proposed in my patch. Is
> Perl vulnerable to the hash collision vulnerability?

I don't know what Perl did or how hashing works in Perl, so cannot
comment on the effect of their fix. FWIW, I don't think that we
should use Perl or Java as reference here.
msg151062 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-11 14:45
> OTOH, the collision counting patch is very simple, doesn't have
> the performance issues and provides real protection against the
> attack.

I don't know about real protection: you can still slow down dict
construction by 1000x (the number of allowed collisions per lookup),
which can be enough combined with a brute-force DOS.

Also, how about false positives? Having legitimate programs break
because of legitimate data would be a disaster.
msg151063 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2012-01-11 14:55
>>>  * the method would need to be implemented for all hashable Python types
>> It was already discussed, and it was said that only hash(str) need to
>> be modified.
> 
> Really ? What about the much simpler attack on integer hash values ?
> 
> You only have to send a specially crafted JSON dictionary with integer
> keys to a Python web server providing JSON interfaces in order to
> trigger the integer hash attack.

JSON objects are decoded as dicts with string keys, integers keys are 
not possible.

 >>> json.loads(json.dumps({1:2}))
{'1': 2}
msg151064 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-11 15:41
Mark Shannon wrote:
> 
> Mark Shannon <mark@hotpy.org> added the comment:
> 
>>>>  * the method would need to be implemented for all hashable Python types
>>> It was already discussed, and it was said that only hash(str) need to
>>> be modified.
>>
>> Really ? What about the much simpler attack on integer hash values ?
>>
>> You only have to send a specially crafted JSON dictionary with integer
>> keys to a Python web server providing JSON interfaces in order to
>> trigger the integer hash attack.
> 
> JSON objects are decoded as dicts with string keys, integers keys are 
> not possible.
> 
>  >>> json.loads(json.dumps({1:2}))
> {'1': 2}

Thanks for the correction. Looks like XML-RPC also doesn't accept
integers as dict keys. That's good :-)

However, as Paul already noted, such attacks can also occur in other
places or parsers in an application, e.g. when decoding FORM parameters
that use integers to signal a line or parameter position (example:
value_1=2&value_2=3...) which are then converted into a dictionary
mapping the position integer to the data.

marshal and pickle are vulnerable, but then you normally don't expose
those to untrusted data.
msg151065 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-11 16:03
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> OTOH, the collision counting patch is very simple, doesn't have
>> the performance issues and provides real protection against the
>> attack.
> 
> I don't know about real protection: you can still slow down dict
> construction by 1000x (the number of allowed collisions per lookup),
> which can be enough combined with a brute-force DOS.

On my slow dev machine 1000 collisions run in around 22ms:

python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 1000))"
100 loops, best of 3: 22.4 msec per loop

Using this for a DOS attack would be rather noisy, much unlike
sending a single POST.

Note that the choice of 1000 as limit is rather arbitrary. I just
chose it because it's high enough because it's very unlikely to be
hit by an application that is not written to trigger it and it's low
enough to still provide a good run-time behavior. Perhaps an
even lower figure would be better.

> Also, how about false positives? Having legitimate programs break
> because of legitimate data would be a disaster.

Yes, which is why the patch should be disabled by default (using
an env var) in dot-releases. It's probably also a good idea to
make the limit configurable to adjust to ones needs.

Still, it is *very* unlikely that you run into real data causing
more than 1000 collisions for a single insert.

For full protection the universal hash method idea would have
to be implemented (adding a parameter to the hash methods, so
that they can be parametrized). This would then allow switching
the dict to an alternative hash implementation resolving the collision
problem, in case the implementation detects high number of
collisions.
msg151069 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-11 17:28
> On my slow dev machine 1000 collisions run in around 22ms:
> 
> python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 1000))"
> 100 loops, best of 3: 22.4 msec per loop
> 
> Using this for a DOS attack would be rather noisy, much unlike
> sending a single POST.

Note that sending one POST is not enough, unless the attacker is content
with blocking *one* worker process for a couple of seconds or minutes
(which is a rather tiny attack if you ask me :-)). Also, you can combine
many dicts in a single JSON list, so that the 1000 limit isn't
overreached for any of the dicts.

So in all cases the attacker would have to send many of these POST
requests in order to overwhelm the target machine. That's how DOS
attacks work AFAIK.

> Yes, which is why the patch should be disabled by default (using
> an env var) in dot-releases. It's probably also a good idea to
> make the limit configurable to adjust to ones needs.

Agreed if it's disabled by default then it's not a problem, but then
Python is vulnerable by default...
msg151070 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-01-11 17:34
[Antoine]
> Also, how about false positives? Having legitimate programs break
> because of legitimate data would be a disaster.

This worries me, too.

[MAL]
> Yes, which is why the patch should be disabled by default (using
> an env var) in dot-releases.

Are you proposing having it enabled by default in Python 3.3?
msg151071 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-11 17:38
Mark Dickinson wrote:
> 
> Mark Dickinson <dickinsm@gmail.com> added the comment:
> 
> [Antoine]
>> Also, how about false positives? Having legitimate programs break
>> because of legitimate data would be a disaster.
> 
> This worries me, too.
> 
> [MAL]
>> Yes, which is why the patch should be disabled by default (using
>> an env var) in dot-releases.
> 
> Are you proposing having it enabled by default in Python 3.3?

Possibly, yes. Depends on whether anyone comes up with a problem in
the alpha, beta, RC release cycle.

It would be great to have the universal hash method approach for
Python 3.3. That way Python could self-heal itself in case it
finds too many collisions. My guess is that it's still better
to raise an exception, though, since it would uncover either
attacks or programming errors.
msg151073 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-11 18:05
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> On my slow dev machine 1000 collisions run in around 22ms:
>>
>> python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 1000))"
>> 100 loops, best of 3: 22.4 msec per loop
>>
>> Using this for a DOS attack would be rather noisy, much unlike
>> sending a single POST.
> 
> Note that sending one POST is not enough, unless the attacker is content
> with blocking *one* worker process for a couple of seconds or minutes
> (which is a rather tiny attack if you ask me :-)). Also, you can combine
> many dicts in a single JSON list, so that the 1000 limit isn't
> overreached for any of the dicts.

Right, but such an approach only scales linearly and doesn't
exhibit the quadric nature of the collision resolution.

The above with 10000 items takes 5 seconds on my machine.
The same with 100000 items is still running after 16 minutes.

> So in all cases the attacker would have to send many of these POST
> requests in order to overwhelm the target machine. That's how DOS
> attacks work AFAIK.

Depends :-) Hiding a few tens of such requests in the input stream
of a busy server is easy. Doing the same with thousands of requests
is a lot harder.

FWIW: The above dict string version just has some 263kB for the 100000
case, 114kB if gzip compressed.

>> Yes, which is why the patch should be disabled by default (using
>> an env var) in dot-releases. It's probably also a good idea to
>> make the limit configurable to adjust to ones needs.
> 
> Agreed if it's disabled by default then it's not a problem, but then
> Python is vulnerable by default...

Yes, but at least the user has an option to switch on the added
protection. We'd need some field data to come to a decision.
msg151074 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-11 18:18
> [MAL]
> > Yes, which is why the patch should be disabled by default (using
> > an env var) in dot-releases.
> 
> Are you proposing having it enabled by default in Python 3.3?

I would personally prefer 3.3 and even 3.2 to have proper randomization
(either Paul's or Victor's or another proposal). Victor's proposal makes
fixing other hash functions very simple (there could even be helper
macros). The only serious concern IMO is startup time under Windows;
someone with Windows-fu should investigate that.

2.x maintainers might want to be more conservative, although disabling a
fix (the collision counter) by default doesn't sound very wise or
helpful to me.
(for completeness, the collision counter must also be added to sets,
btw)

It would be nice to hear from distro maintainers here.
msg151078 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-11 19:07
I've benchmarked Victor's patch and got the following results:

Report on Linux localhost.localdomain 2.6.38.8-desktop-9.mga #1 SMP Tue Dec 20 09:45:44 UTC 2011 x86_64 x86_64
Total CPU cores: 4

### call_simple ###
Min: 0.223778 -> 0.209204: 1.07x faster
Avg: 0.227634 -> 0.212437: 1.07x faster
Significant (t=15.40)
Stddev: 0.00291 -> 0.00248: 1.1768x smaller
Timeline: http://tinyurl.com/87vkdps

### fastpickle ###
Min: 0.484052 -> 0.499832: 1.03x slower
Avg: 0.487370 -> 0.507909: 1.04x slower
Significant (t=-8.40)
Stddev: 0.00261 -> 0.00481: 1.8446x larger
Timeline: http://tinyurl.com/7ntcudz

### float ###
Min: 0.052819 -> 0.051540: 1.02x faster
Avg: 0.054304 -> 0.052922: 1.03x faster
Significant (t=3.89)
Stddev: 0.00125 -> 0.00126: 1.0101x larger
Timeline: http://tinyurl.com/7rqfurw

### formatted_logging ###
Min: 0.252709 -> 0.257303: 1.02x slower
Avg: 0.254741 -> 0.259967: 1.02x slower
Significant (t=-4.90)
Stddev: 0.00155 -> 0.00181: 1.1733x larger
Timeline: http://tinyurl.com/8xu2zdt

### normal_startup ###
Min: 0.450661 -> 0.435943: 1.03x faster
Avg: 0.454536 -> 0.438212: 1.04x faster
Significant (t=9.41)
Stddev: 0.00327 -> 0.00209: 1.5661x smaller
Timeline: http://tinyurl.com/8ygw272

### nqueens ###
Min: 0.269426 -> 0.255306: 1.06x faster
Avg: 0.270105 -> 0.255844: 1.06x faster
Significant (t=28.63)
Stddev: 0.00071 -> 0.00086: 1.2219x larger
Timeline: http://tinyurl.com/823dwzo

### regex_compile ###
Min: 0.390307 -> 0.380736: 1.03x faster
Avg: 0.391959 -> 0.382025: 1.03x faster
Significant (t=8.93)
Stddev: 0.00194 -> 0.00156: 1.2395x smaller
Timeline: http://tinyurl.com/72shbzh

### silent_logging ###
Min: 0.060115 -> 0.057777: 1.04x faster
Avg: 0.060241 -> 0.058019: 1.04x faster
Significant (t=13.29)
Stddev: 0.00010 -> 0.00036: 3.4695x larger
Timeline: http://tinyurl.com/76bfguf

### unpack_sequence ###
Min: 0.000043 -> 0.000046: 1.07x slower
Avg: 0.000044 -> 0.000047: 1.06x slower
Significant (t=-107.47)
Stddev: 0.00000 -> 0.00000: 1.1231x larger
Timeline: http://tinyurl.com/6us6yys

The following not significant results are hidden, use -v to show them:
call_method, call_method_slots, call_method_unknown, fastunpickle, iterative_count, json_dump, json_load, nbody, pidigits, regex_effbot, regex_v8, richards, simple_logging, startup_nosite, threaded_count.


In short, any difference is in the noise.
msg151092 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-11 21:46
I must be missing something, but how is raising an exception when a collision threshold is reached a good thing?
Basically, we're just exchanging a DoS for another (just feed the server process with ad-hoc data and he'll commit suicide). Sure, the caller can catch the exception to detect this, but what for? Restart the process, so that the attacker can just try again?
Also, there's the potential of perfectly legit applications breaking.
IMHO, randomization is the way to go, so that an attacker cannot generate a set of colliding values beforehand, which renders the attack impracticle. The same idea is behind ASLR used in modern kernels, and AFAICT, has been chosen by other implementations.
If a such patch has a negligible performance impact, then it should definitely be enabled by default. People who want deterministic hashing (maybe to bypass an application bug, or just because the want determinism) can disable it if they really want to.
msg151120 - (view) Author: Frank Sievertsen (fx5) Date: 2012-01-12 08:53
I'd like to add a few notes:

1. both 32-bit and 64-bit python are vulnerable
2. collision-counting will break other things
3. imho, randomization is the way to go, enabled by default.
4. do we need a steady hash-function later again?

I created ~500KB of colliding strings for both 32-bit and 64-bit python.
It works impressively good:

32bit: ~500KB payload keep django busy for >30 minutes.
64bit: ~500KB payload keep django busy for 5 minutes.

Django is more vulnerable than python-dict alone, because it
* converts the strings to unicode first, making the comparision more expensive
* does 5 dict-lookups per key.

So Python's dict of str alone is probably ~10x faster. Of course it's much harder to create the payload for 64-bit python than for 32-bit, but it works for both.

The collision-counting idea makes some sense in the web environment, but for other software types it can causes serious problems.

I don't want my software to stop working because someone managed to enter 1000 bad strings into it. Think of a software that handles names of customers or filenames. We don't want it to break completely just because someone entered a few clever names.

Randomization fixes most of these problems.

However, it breaks the steadiness of hash(X) between two runs of the same software. There's probably code out there that assumes that hash(X) always returns the same value: database- or serialization-modules, for example.

There might be good reasons to also have a steady hash-function available. The broken code is hard to fix if no such a function is available at all. Maybe it's possible to add a second steady hash-functions later again?

For the moment I think the best way is to turn on randomization of hash() by default, but having a way to turn it off.
msg151121 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-12 09:27
Frank Sievertsen wrote:
> 
> I don't want my software to stop working because someone managed to enter 1000 bad strings into it. Think of a software that handles names of customers or filenames. We don't want it to break completely just because someone entered a few clever names.

Collision counting is just a simple way to trigger an action. As I mentioned
in my proposal on this ticket, raising an exception is just one way to deal
with the problem in case excessive collisions are found. A better way is to
add a universal hash method, so that the dict can adapt to the data and
modify the hash functions for just that dict (without breaking other
dicts or changing the standard hash functions).

Note that raising an exception doesn't completely break your software.
It just signals a severe problem with the input data and a likely
attack on your software. As such, it's no different than turning on DOS
attack prevention in your router.

In case you do get an exception, a web server will simply return a 500 error
and continue working normally.

For other applications, you may see a failure notice in your logs. If
you're sure that there are no possible ways to attack the application using
such data, then you can simply disable the feature to prevent such
exceptions.

> Randomization fixes most of these problems.

See my list of issues with this approach (further up on this ticket).

> However, it breaks the steadiness of hash(X) between two runs of the same software. There's probably code out there that assumes that hash(X) always returns the same value: database- or serialization-modules, for example.
> 
> There might be good reasons to also have a steady hash-function available. The broken code is hard to fix if no such a function is available at all. Maybe it's possible to add a second steady hash-functions later again?

This is one of the issues I mentioned.

> For the moment I think the best way is to turn on randomization of hash() by default, but having a way to turn it off.
msg151122 - (view) Author: Graham Dumpleton (grahamd) Date: 2012-01-12 10:02
Right back at the start it was said:

"""
We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release. The env var PYTHONRANDOMHASH=1 would enable the randomization. It's simple to set the env var in e.g. Apache for mod_python and mod_wsgi.
"""

with a environment variable PYTHONHASHSEED still being mentioned towards the end.

Be aware that a user being able to set an environment variable which is used on Python interpreter initialisation when using mod_python or mod_wsgi is not as trivial as made out in leading comment.

To set an environment variable would require the setting of the environment variable to be done in the Apache etc init.d scripts, or if the Apache distro still follows Apache Software Foundation conventions, in the 'envvars' file.

Having to do this requires root access and is inconvenient, especially since where it needs to be done differs between every distro.

Where there are other environment variables that are useful to set for interpreter initialisation, mod_wsgi has been changed in the past to add specific directives for the Apache configuration file to set them prior to interpreter initialisation. This at least makes it somewhat easier, but still only of help where you are the admin of the server.

If that approach is necessary, then although mod_wsgi could eventually add such a directive, as mod_python is dead it will never happen for it.

As to another question posed about whether mod_wsgi itself is doing anything to combat this, the answer is no as don't believe there is anything it can do. Values like the query string or post data is simply passed through as is and always pulled apart by the application.
msg151157 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-13 00:08
Patch version 6:

 - remove a debug code in dev_urandom() (did always raise an exception for testing)
 - dev_urandom() raises an exception if open() fails
 - os.urandom() uses again the right exception type and message (instead of a generic exception)
 - os.urandom() is not more linked to PYTHONHASHSEED
 - replace uint32_t by unsigned int in lcg_urandom() because Visual Studio 8 doesn't provide this type. "unsigned __int32" is available but I prefer to use a more common type. 32 or 64-bit types are supposed to generate the same sequence number (I didn't test).
 - fix more tests
 - regrtest.py restarts the process with PYTHONHASHSEED=randomseed if -r --randomseed=SEED is used
 - fix compilation on Windows (add random.c to the Visual Studio project file)
msg151158 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-13 00:36
I wrote bench_startup.py to measure the startup time on Windows. The precision of the script is quite bad because Windows timer has a bad resolution (e.g. 15.6 ms on Windows 7) :-/

In release mode, the best startup time is 45.2 ms without random, 50.9 ms with random: an overhead of 5.6 ms (12%).

My script uses PYTHONHASHSEED=0 to disable the initialization of CryptoGen. You may modify the script to compare an unpatched Python with a patched Python for better numbers.

An overhead 12% is important. random-6.patch contains a faster (but also weaker) RNG on Windows, disable at compilation time. Search "#if 1" at the end of random.c. It uses my linear congruential generator (LCG) initialized with gettimeofday() and getpid() (which are known to be weak) instead of CryptoGen. Using the LCG, the startup overhead is between 1 and 2%.
msg151159 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-13 00:48
SafeDict.py: with this solution, the hash of key has to be recomputed at each access to the dict (creation, get, set), the hash is not cached in the string object.
msg151167 - (view) Author: Zbyszek Jędrzejewski-Szmek (zbysz) * Date: 2012-01-13 10:17
Added some small comments in http://bugs.python.org/review/13703/show.
msg151353 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-16 12:45
The vulnerability is known since 2003 (Usenix 2003): read "Denial of
Service via Algorithmic Complexity Attacks" by Scott A. Crosby and Dan
S. Wallach.
http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003.pdf

This paper compares Perl 5.8 hash function, MD5, UHASH (UMAC
universal), CW (Carter-Wegman) and XOR12. Read more about UMAC:
http://en.wikipedia.org/wiki/UMAC
"A UMAC has provable cryptographic strength and is usually a lot less
computationally intensive than other MACs."

oCERT advisory #2011-003: multiple implementations denial-of-service
via hash algorithm collision
http://www.ocert.org/advisories/ocert-2011-003.html

nRuns advisory:
http://www.nruns.com/_downloads/advisory28122011.pdf

CRuby 1.8.7 fix (use a randomized hash function):
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/branches/ruby_1_8_7/string.c?r1=34151&r2=34150&pathrev=34151
http://svn.ruby-lang.org/cgi-bin/viewvc.cgi?view=revision&revision=34151

JRuby uses Murmurhash and a hash (random) "seed" since JRuby 1.6.5.1:
https://github.com/jruby/jruby/commit/c1c9f95ed29cb93806fbc90e9eaabb9c406581e5
https://github.com/jruby/jruby/commit/2fc3a13c4af99be7f25f7dfb6ae3459505bb7c61
http://jruby.org/2011/12/27/jruby-1-6-5-1
JRUBY-6324: random seed for srand is not initialized properly:
https://github.com/jruby/jruby/commit/f7041c2636f46e398e3994fba2045e14a890fc14

Murmurhash:
https://sites.google.com/site/murmurhash/
pyhash implements Murmurhash:
http://code.google.com/p/pyfasthash/
msg151401 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2012-01-16 18:29
> The vulnerability is known since 2003 (Usenix 2003): read "Denial of
> Service via Algorithmic Complexity Attacks" by Scott A. Crosby and Dan
> S. Wallach.

Crosby started a meaningful thread on python-dev at that time similar to the current one:

  http://mail.python.org/pipermail/python-dev/2003-May/035874.html

It includes a some good insight into the problem.
msg151402 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-16 18:58
Eric Snow wrote:
> 
> Eric Snow <ericsnowcurrently@gmail.com> added the comment:
> 
>> The vulnerability is known since 2003 (Usenix 2003): read "Denial of
>> Service via Algorithmic Complexity Attacks" by Scott A. Crosby and Dan
>> S. Wallach.
> 
> Crosby started a meaningful thread on python-dev at that time similar to the current one:
> 
>   http://mail.python.org/pipermail/python-dev/2003-May/035874.html
> 
> It includes a some good insight into the problem.

Thanks for the pointer. Some interesting postings...

Vulnerability of applications:
http://mail.python.org/pipermail/python-dev/2003-May/035887.html

Speed of hashing, portability and practical aspects:
http://mail.python.org/pipermail/python-dev/2003-May/035902.html

Changing the hash function:
http://mail.python.org/pipermail/python-dev/2003-May/035911.html
http://mail.python.org/pipermail/python-dev/2003-May/035915.html
msg151419 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-17 01:53
Patch version 7:
 - Make PyOS_URandom() private (renamed to _PyOS_URandom)
 - os.urandom() releases the GIL for I/O operation for its implementation reading /dev/urandom
 - move _Py_unicode_hash_secret_t documentation into unicode_hash()

I moved also fixes for tests in a separated patch: random_fix-tests.patch.
msg151422 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-17 02:10
Some tests are still failing with my 2 patches:
 - test_dis
 - test_inspect
 - test_json
 - test_packaging
 - test_ttk_textonly
 - test_urllib
msg151448 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-17 12:21
Patch version 8: the whole test suite now pass successfully.

The remaining question is if CryptoGen should be used instead of the
weak LCG initialized by gettimeofday() and getpid(). According to
Martin von Loewis, we must link statically Python to advapi32.dll. It
should speed up the startup.
msg151449 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-17 12:36
Hum, test_runpy fails something with a segfault and/or a recursion
limit because of my hack to rerun regrtest.py to set PYTHONHASHSEED
environment variable. The fork should be defined if main() of
regrtest.py is called directly. Example:

diff --git a/Lib/test/regrtest.py b/Lib/test/regrtest.py
--- a/Lib/test/regrtest.py
+++ b/Lib/test/regrtest.py
@@ -258,7 +258,7 @@ def main(tests=None, testdir=None, verbo
          findleaks=False, use_resources=None, trace=False, coverdir='coverage',
          runleaks=False, huntrleaks=False, verbose2=False, print_slow=False,
          random_seed=None, use_mp=None, verbose3=False, forever=False,
-         header=False, failfast=False, match_tests=None):
+         header=False, failfast=False, match_tests=None, allow_fork=False):
     """Execute a test suite.

     This also parses command-line options and modifies its behavior
@@ -559,6 +559,11 @@ def main(tests=None, testdir=None, verbo
         except ValueError:
             print("Couldn't find starting test (%s), using all tests" % start)
     if randomize:
+        hashseed = os.getenv('PYTHONHASHSEED')
+        if (not hashseed and allow_fork):
+            os.environ['PYTHONHASHSEED'] = str(random_seed)
+            os.execv(sys.executable, [sys.executable] + sys.argv)
+            return
         random.seed(random_seed)
         print("Using random seed", random_seed)
         random.shuffle(selected)
@@ -1809,4 +1814,4 @@ if __name__ == '__main__':
     # change the CWD, the original CWD will be used. The original CWD is
     # available from support.SAVEDCWD.
     with support.temp_cwd(TESTCWD, quiet=True):
-        main()
+        main(allow_fork=True)

As Antoine wrote on IRC, regrtest.py should be changed later.
msg151468 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-17 16:23
#13712 contains a patch for test_packaging.
msg151472 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-17 16:35
> #13712 contains a patch for test_packaging.

It doesn't look related to randomized hash function. random-8.patch
contains a fix to test_packaging.
msg151474 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-17 16:46
>> #13712 contains a patch for test_packaging.
> It doesn't look related to randomized hash function.
Trust me.  (If you read the whole report you’ll see why it looks unrelated: instead of sorting things like your patch does mine addresses a more serious behavior bug).

> random-8.patch contains a fix to test_packaging.
I know, but mine is a bit better.
msg151484 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-17 19:59
To be more explicit about Martin A. Lemburg's msg151121 (which I agree with):

Count the collisions on a single lookup. 
If they exceed a threshhold, do something different.

Martin's strawman proposal was threshhold=1000, and raise.  It would be just as easy to say "whoa!  5 collisions -- time to use the alternative hash instead" (and, possibly, to issue a warning).  

Even that slight tuning removes the biggest objection, because it won't ever actually fail.

Note that the use of a (presumably stronger 2nd) hash wouldn't come into play until (and unless) there was a problem for that specific key in that specific dictionary.  For the normal case, nothing changes -- unless we take advantage of the existence of a 2nd hash to simplify the first few rounds of collision resolution.  (Linear probing is more cache-friendly, but also more vulnerable to worst-case behavior -- but if probing stops at 4 or 8, that may not matter much.)  For quick scripts, the 2nd hash will almost certainly never be needed, so startup won't pay the penalty.

The only down side I see is that the 2nd (presumably randomized) hash won't be cached without another slot, which takes more memory and shouldn't be done in a bugfix release.
msg151519 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-01-18 06:16
I like what you've done in #13704 better than what I see in random-8.patch so far.  see the code review comments i've left on both issues.
msg151528 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-18 10:01
> I like what you've done in #13704 better than what I see in random-8.patch so far.  see the code review comments i've left on both issues.

I didn't write "3106cc0a2024.diff" patch attached to #13704, I just
clicked on the button to generate a patch from the repository.
Christian Heimes wrote the patch.

I don't really like "3106cc0a2024.diff", we don't need Mersenne
Twister to initialize the hash secret. The patch doesn't allow to set
a fixed secret if you need the same secret for a group of processes.
msg151560 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-18 18:59
STINNER Victor wrote:
> 
> Patch version 7:
>  - Make PyOS_URandom() private (renamed to _PyOS_URandom)
>  - os.urandom() releases the GIL for I/O operation for its implementation reading /dev/urandom
>  - move _Py_unicode_hash_secret_t documentation into unicode_hash()
> 
> I moved also fixes for tests in a separated patch: random_fix-tests.patch.

Don't you think that the number of corrections you have to apply in order
to get the tests working again shows how much impact such a change would
have in real-world applications ?

Perhaps we should start to think about a compromise: make both the
collision counting and the hash seeding optional and let the user
decide which option is best.

BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix
which needlessly complicates the code and doesn't any additional
protection against hash value collisions.
msg151561 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-01-18 19:08
On Wed, Jan 18, 2012 at 10:59 AM, Marc-Andre Lemburg <report@bugs.python.org
> wrote:

>
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>
> STINNER Victor wrote:
> >
> > Patch version 7:
> >  - Make PyOS_URandom() private (renamed to _PyOS_URandom)
> >  - os.urandom() releases the GIL for I/O operation for its
> implementation reading /dev/urandom
> >  - move _Py_unicode_hash_secret_t documentation into unicode_hash()
> >
> > I moved also fixes for tests in a separated patch:
> random_fix-tests.patch.
>
> Don't you think that the number of corrections you have to apply in order
> to get the tests working again shows how much impact such a change would
> have in real-world applications ?
>
> Perhaps we should start to think about a compromise: make both the
> collision counting and the hash seeding optional and let the user
> decide which option is best.
>

I like this, esp. if for old releases the collision counting is on by
default and the hash seeding is off by default, while in 3.3 both should be
on by default. Different env vars or flags should be used to enable/disable
them.

> BTW: The patch still includes the unnecessary
> _Py_unicode_hash_secret.suffix
> which needlessly complicates the code and doesn't any additional
> protection against hash value collisions.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
>
msg151565 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-18 21:05
> I like this, esp. if for old releases the collision counting is on by
> default and the hash seeding is off by default, while in 3.3 both should be
> on by default. Different env vars or flags should be used to enable/disable
> them.

I would hope 3.3 only gets randomized hashing. Collision counting is a
hack to make bugfix releases 99.999%-compatible instead of 99.9% ;)
msg151566 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-01-18 21:10
On Wed, Jan 18, 2012 at 1:05 PM, Antoine Pitrou <report@bugs.python.org>wrote:

>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> > I like this, esp. if for old releases the collision counting is on by
> > default and the hash seeding is off by default, while in 3.3 both should
> be
> > on by default. Different env vars or flags should be used to
> enable/disable
> > them.
>
> I would hope 3.3 only gets randomized hashing. Collision counting is a
> hack to make bugfix releases 99.999%-compatible instead of 99.9% ;)
>

Really? I'd expect the difference to be more than 2 nines. The randomized
hashing has two problems: (a) change in dict order; (b) hash varies between
processes. I cannot imagine counterexamples to the collision counting that
weren't constructed specifically as counterexamples.
msg151567 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-18 21:14
> Really? I'd expect the difference to be more than 2 nines. The randomized
> hashing has two problems: (a) change in dict order; (b) hash varies between
> processes.

Personally I don't think the change in dict order is a problem (hashing
already changes between 32-bit and 64-bit builds, and we sometimes
change the calculation too: it might change *more* often with random
hashes, while it went unnoticed in some cases before). So only (b) is a
problem and I don't think it affects more than 0.01% of
applications/users :)
msg151574 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-18 22:52
> Don't you think that the number of corrections you have to apply in order
> to get the tests working again shows how much impact such a change would
> have in real-world applications ?

Let see the diffstat:

 Doc/using/cmdline.rst                       |    7
 Include/pythonrun.h                         |    2
 Include/unicodeobject.h                     |    6
 Lib/json/__init__.py                        |    4
 Lib/os.py                                   |   17 -
 Lib/packaging/create.py                     |    7
 Lib/packaging/tests/test_create.py          |   18 -
 Lib/test/mapping_tests.py                   |    2
 Lib/test/regrtest.py                        |    5
 Lib/test/test_builtin.py                    |    1
 Lib/test/test_dis.py                        |   36 ++-
 Lib/test/test_gdb.py                        |   11 -
 Lib/test/test_inspect.py                    |    1
 Lib/test/test_os.py                         |   35 ++-
 Lib/test/test_set.py                        |   25 ++
 Lib/test/test_unicode.py                    |   39 ++++
 Lib/test/test_urllib.py                     |   16 -
 Lib/test/test_urlparse.py                   |    6
 Lib/tkinter/test/test_ttk/test_functions.py |    2
 Makefile.pre.in                             |    1
 Modules/posixmodule.c                       |  126 ++-----------
 Objects/unicodeobject.c                     |   20 +-
 PCbuild/pythoncore.vcproj                   |    4
 Python/pythonrun.c                          |    3
 Python/random.c                             |  268 ++++++++++++++++++++++++++++
 25 files changed, 488 insertions(+), 174 deletions(-)

Except Lib/packaging/create.py, all other changes are related to the
introduction of the randomized hash function, or fix tests... Even
Lib/packaging/create.py change is related to fixing tests. The test
can be changed differently, but I like the idea of having always the
same output in packaging (e.g. it is more readable for the user if
files are sorted).

I expected to have to do something on multiprocessing, but nope, it
doesn't care of the hash value.

So I expect something similar in applications: no change in the
applications, but a lot of hacks/tricks in tests.

> Perhaps we should start to think about a compromise: make both the
> collision counting and the hash seeding optional and let the user
> decide which option is best.

I don't think that we need two fixes for a single vulnerability (in
the same Python version), one is enough. If we decide to count
collisions, the randomized hash idea can be simply dropped. But we may
use a different fix for Python 3.3 and for stable versions (e.g. count
collisions for stable versions and use randomized hash for 3.3).

> BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix
> which needlessly complicates the code and doesn't any additional
> protection against hash value collisions

How does it complicate the code? It adds an extra XOR to hash(str) and
4 or 8 bytes in memory, that's all. It is more difficult to compute
the secret from hash(str) output if there is a prefix *and* a suffix.
If there is only a prefix, knowning a single hash(str) value is just
enough to retrieve directly the secret.
.
> I don't think it affects more than 0.01% of applications/users :)

It would help to try a patched Python on a real world application like
Django to realize how much code is broken (or not) by a randomized
hash function.
msg151582 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-18 23:23
A possible advantage of having the 3.3 fix available in earlier versions is that people will be able to turn it on and have that be the *only* change -- just as with __future__ imports done one at a time.
msg151583 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-01-18 23:25
On Wed, Jan 18, 2012 at 1:10 PM, Guido van Rossum
<report@bugs.python.org> wrote:
> On Wed, Jan 18, 2012 at 1:05 PM, Antoine Pitrou <report@bugs.python.org>wrote:
> >
> > I would hope 3.3 only gets randomized hashing. Collision counting is a
> > hack to make bugfix releases 99.999%-compatible instead of 99.9% ;)
> >
>
> Really? I'd expect the difference to be more than 2 nines. The randomized
> hashing has two problems: (a) change in dict order; (b) hash varies between
> processes. I cannot imagine counterexamples to the collision counting that
> weren't constructed specifically as counterexamples.

For the purposes of 3.3 I'd prefer to just have randomized hashing and
not the collision counting in order to keep things from getting too
complicated.  But I will not object if we opt to do both.

As much as the counting idea rubs me wrong, even if it were on by
default I agree that most non-contrived things will never encounter it
and it is easy to document how to work around it by disabling it
should anyone actually be impeded by it.

The concern I have with that approach from a web service point of view
is that it too can be gamed in the more rare server situation of
someone managing to fill a persistent data structure up with enough
colliding values to be _close_ to the limit such that the application
then dies while trying to process all future requests that _hit_ the
limit (a persisting 500 error DOS rather than an exception raised only
in one offending request that deserved that 500 error anyways). Not
nearly as likely a scenario but it is one I'd keep an eye open for
with an attacker hat on.

MvL's suggestion of using AVL trees for hash bucket slots instead of
our linear slot finding algorithm is a better way to fix the ultimate
problem by never devolving into linear behavior at all. It is
naturally more complicated but could likely even be done while
maintaining ABI compatibility. I haven't pondered designs and
performance costs for that. Possibly a memory hit and one or two extra
indirect lookups in the normal case and some additional complexity in
the iteration case.

-gps
msg151584 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-18 23:30
> MvL's suggestion of using AVL trees for hash bucket slots instead of
> our linear slot finding algorithm is a better way to fix the ultimate
> problem by never devolving into linear behavior at all.

A dict can contain non-orderable keys, I don't know how an AVL tree can
fit into that.
msg151585 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-01-18 23:31
> A dict can contain non-orderable keys, I don't know how an AVL tree can
> fit into that.

good point!
msg151586 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-18 23:37
> As much as the counting idea rubs me wrong,

FWIW, the original 2003 paper reported that the url-caching system that 
they tested used collision-counting to evade attacks.
msg151589 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-01-18 23:44
On Wed, Jan 18, 2012 at 3:37 PM, Terry J. Reedy <report@bugs.python.org>wrote:

>
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
>
> > As much as the counting idea rubs me wrong,
>
> FWIW, the original 2003 paper reported that the url-caching system that
> they tested used collision-counting to evade attacks.

You mean as a fix or that they successfully attacked a collision-counting
system?
msg151590 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-18 23:46
> > As much as the counting idea rubs me wrong,
> 
> FWIW, the original 2003 paper reported that the url-caching system that 
> they tested used collision-counting to evade attacks.

I think that was DJB's DNS server/cache actually.
But deciding to limit collisions in a specific application is not the
same as limiting them in the general case. Python dicts have a lot of
use cases that are not limited to storing URL parameters, domain names
or instance attributes: there is a greater risk of meeting pathological
cases with legitimate keys.
msg151596 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-01-19 00:46
On Wed, Jan 18, 2012 at 3:46 PM, Antoine Pitrou <report@bugs.python.org>wrote:

>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> > > As much as the counting idea rubs me wrong,
> >
> > FWIW, the original 2003 paper reported that the url-caching system that
> > they tested used collision-counting to evade attacks.
>
> I think that was DJB's DNS server/cache actually.
> But deciding to limit collisions in a specific application is not the
> same as limiting them in the general case. Python dicts have a lot of
> use cases that are not limited to storing URL parameters, domain names
> or instance attributes: there is a greater risk of meeting pathological
> cases with legitimate keys.
>

Really? This sounds like FUD to me.
msg151604 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-19 01:15
> You mean as a fix or that they successfully attacked a collision-counting
> system?

Successful anticipation and blocking of hash attack: after a chain of 
100 DNS 'treats the request as a cache miss'. What is somewhat special 
for this app is being able to bail at that point. Crosby & Wallach still 
think 'his fix could be improved', I presume by using one of their 
recommended hashes.
http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003.pdf
section 3.2, DJB DNS server; section 5, fixes
msg151617 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-01-19 13:03
> Even Lib/packaging/create.py change is related to fixing tests. The test can be changed
> differently, but I like the idea of having always the same output in packaging (e.g. it is
> more readable for the user if files are sorted).

See #13712 for why this is a fake fix.
msg151620 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-19 13:13
I tried the collision counting with a low number of collisions:

less than 15 collisions
-----------------------

Fail at startup.

5 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f
10 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f

dict((str(k), 0) for k in range(2000000))
-----------------------------------------

15 collisions (32,768 buckets, 18024 used=55.0%): hash=0e4631d2 => 31d2
20 collisions (131,072 buckets, 81568 used=62.2%): hash=12660719 => 719
25 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
30 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21
35 collisions => ? (more than 10,000,000 integers)

random_dict('', 50000, charset, 1, 3)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyz0123456789'

15 collisions (8192 buckets, 5083 used=62.0%): hash=1526677a => 77a
20 collisions (32768 buckets, 19098 used=58.3%): hash=5d7760e6 => 60e6
25 collisions => <unable to generate a new key>

random_dict('', 50000, charset, 1, 3)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'

15 collisions (32768 buckets, 20572 used=62.8%): hash=789fe1e6 => 61e6
20 collisions (2048 buckets, 1297 used=63.3%): hash=2052533d => 33d
25 collisions => nope

random_dict('', 50000, charset, 1, 10)
--------------------------------------

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'

15 collisions (32768 buckets, 18964 used=57.9%): hash=94d7c4f5 => 44f5
20 collisions (32768 buckets, 21548 used=65.8%): hash=acb5b39e => 339e
25 collisions (8192 buckets, 5395 used=65.9%): hash=04d367ae => 7ae
30 collisions => nope

random_dict() comes from the following script:
***
import random

def random_string(charset, minlen, maxlen):
    strlen = random.randint(minlen, maxlen)
    return ''.join(random.choice(charset) for index in xrange(strlen))

def random_dict(prefix, count, charset, minlen, maxlen):
    dico = {}
    keys = set()
    for index in xrange(count):
        for tries in xrange(10000):
            key = prefix + random_string(charset, minlen, maxlen)
            if key in keys:
                continue
            keys.add(key)
            break
        else:
            raise ValueError("unable to generate a new key")
        dico[key] = None

charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%'
charset = 'abcdefghijklmnopqrstuvwxyz0123456789'
random_dict('', 50000, charset, 1, 3)
***

I ran the Django test suite. With a limit of 20 collisions, 60 tests
fail. With a limit of 50 collisions, there is no failure. But I don't
think that the test suite uses large data sets.

I also triend the Django test suite with a randomized hash function.
There are 46 failures. Many (all?) are related to the order of dict
keys: repr(dict) or indirectly in a HTML output. I didn't analyze all
failures. I suppose that Django can simply run the test suite using
PYTHONHASHSEED=0 (disable the randomized hash function), at least in a
first time.
msg151625 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-19 14:27
STINNER Victor wrote:
> ...
> So I expect something similar in applications: no change in the
> applications, but a lot of hacks/tricks in tests.

Tests usually check output of an application given a certain
input. If those fail with the randomization, then it's likely
real-world application uses will show the same kinds of failures
due to the application changing from deterministic to
non-deterministic via the randomization.

>> BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix
>> which needlessly complicates the code and doesn't any additional
>> protection against hash value collisions
> 
> How does it complicate the code? It adds an extra XOR to hash(str) and
> 4 or 8 bytes in memory, that's all. It is more difficult to compute
> the secret from hash(str) output if there is a prefix *and* a suffix.
> If there is only a prefix, knowning a single hash(str) value is just
> enough to retrieve directly the secret.

The suffix only introduces a constant change in all hash values
output, so even if you don't know the suffix, you can still
generate data sets with collisions by just having the prefix.

>> I don't think it affects more than 0.01% of applications/users :)
> 
> It would help to try a patched Python on a real world application like
> Django to realize how much code is broken (or not) by a randomized
> hash function.

That would help for both approaches, indeed.

Please note, that you'd have to extend the randomization to
all other Python data types as well in order to reach the same level
of security as the collision counting approach.

As-is the randomization patch does not solve the integer key attack and
even though parsers such as JSON and XML-RPC aren't directly affected,
it is well possible that stringified integers such as IDs are converted
back to integers later during processing, thereby triggering the
attack.

Note that the integer attack also applies to other number types
in Python:

(3, 3, 3)

See Tim's post I referenced earlier on for the reasons. Here's
a quick summary ;-) ...

{3: 3}
msg151626 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-19 14:31
> Please note, that you'd have to extend the randomization to
> all other Python data types as well in order to reach the same level
> of security as the collision counting approach.

You also have to extend the collision counting to sets, by the way.
msg151628 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-19 14:37
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> Please note, that you'd have to extend the randomization to
>> all other Python data types as well in order to reach the same level
>> of security as the collision counting approach.
> 
> You also have to extend the collision counting to sets, by the way.

Indeed, but that's easy, since the set implementation derives from
the dict implementation.
msg151629 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-19 14:43
Django's tests will *not* be run with HASHEED=0, if they're broken with hash randomization then they are likely broken on random.choice(["32-bit", "64-bit", "pypy", "jython", "ironpython"]) and we strive to run on all those platforms. If our tests are order dependent then they're broken, and we'll fix the tests.

Further, most of the failures I can think of would be failures in the tests that wouldn't actually be failures in a real application, such as the rendered HTML being different because a tag's attributes are in a different order.
msg151632 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-19 15:11
STINNER Victor wrote:
> 
> I tried the collision counting with a low number of collisions:
> ... no false positives with a limit of 50 collisions ...

Thanks for running those tests. Looks like a limit lower than 1000
would already do just fine.

Some timings showing how long it would take to hit a limit:

# 100
python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 100))"
100 loops, best of 3: 297 usec per loop

# 250
python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 250))"
100 loops, best of 3: 1.46 msec per loop

# 500
python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 500))"
100 loops, best of 3: 5.73 msec per loop

# 750
python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 750))"
100 loops, best of 3: 12.7 msec per loop

# 1000
python2.7 -m timeit -n 100 "dict((x*(2**64 - 1), 1) for x in xrange(1, 1000))"
100 loops, best of 3: 22.4 msec per loop

These timings have to matched against the size of the payload
needed to trigger those limits.

In any case, the limit needs to be configurable like the hash seed
in the randomization patch.
msg151633 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-19 15:13
[Reposting, since roundup removed part of the Python output]

M.-A. Lemburg wrote:
> Note that the integer attack also applies to other number types
> in Python:
> 
> --> (hash(3), hash(3.0), hash(3+0j)
> (3, 3, 3)
> 
> See Tim's post I referenced earlier on for the reasons. Here's
> a quick summary ;-) ...
> 
> --> {3:1, 3.0:2, 3+0j:3}
> {3: 3}
msg151647 - (view) Author: Frank Sievertsen (fx5) Date: 2012-01-19 18:05
> The suffix only introduces a constant change in all hash values
> output, so even if you don't know the suffix, you can still
> generate data sets with collisions by just having the prefix.

That's true. But without the suffix, I can pretty easy and efficient guess the prefix by just seeing the result of a few well-chosen and short repr(dict(X)). I suppose that's harder with the suffix.
msg151662 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-20 00:38
Frank Sievertsen wrote:
> 
> Frank Sievertsen <python@sievertsen.de> added the comment:
> 
>> The suffix only introduces a constant change in all hash values
>> output, so even if you don't know the suffix, you can still
>> generate data sets with collisions by just having the prefix.
> 
> That's true. But without the suffix, I can pretty easy and efficient guess the prefix by just seeing the result of a few well-chosen and short repr(dict(X)). I suppose that's harder with the suffix.

Since the hash function is known, it doesn't make things much
harder. Without suffix you just need hash('') to find out what
the prefix is. With suffix, two values are enough.

Say P is your prefix and S your suffix. Let's say you can get the
hash values of A = hash('') and B = hash('\x00').

With Victor's hash function you have (IIRC):

A = hash('')     = P ^ (0<<7) ^ 0 ^ S = P ^ S
B = hash('\x00') = ((P ^ (0<<7)) * 1000003) ^ 0 ^ 1 ^ S = (P * 1000003) ^ 1 ^ S

Let X = A ^ B, then

X = P ^ (P * 1000003) ^ 1

since S ^ S = 0 and 0 ^ Y = Y (for any Y), i.e. the suffix doesn't
make any difference.

For P < 500000, you can then easily calculate P from X
using:

P = X // 1000002

(things obviously get tricky once overflow kicks in)

Note that for number hashes the randomization doesn't work at all,
since there's no length or feedback loop involved.

With Victor's approach hash(0) would output the whole seed,
but even if the seed is not known, creating an attack data
set is trivial, since hash(x) = P ^ x ^ S.
msg151664 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-20 01:11
> Since the hash function is known, it doesn't make things much
> harder. Without suffix you just need hash('') to find out what
> the prefix is. With suffix, two values are enough.

With my patch, hash('') always return zero. I don't remember who asked
me to do that, but it avoids to leak too easily the secret :-) I wrote
some info how to compute the secret:
http://bugs.python.org/issue13703#msg150706

I don't see how to compute the secret, but it doesn't mean that it is
impossible :-) I suppose that you have to brute force some bits, at
least if you only have repr(dict) which gives only (indirectly) the
lower bits of the hash.

> (things obviously get tricky once overflow kicks in)

hash() doesn't overflow: if you know the string, you can run the
algorithm backward. To divide, you can compute 1/1000003 mod 2^32 (or
mod 2^64): 2021759595 and 16109806864799210091. So x/1000003 mod 2^32
= x*2021759595 mod 2^32.

See my invert_mod() function of:
https://bitbucket.org/haypo/misc/src/tip/python/mathfunc.py

> With Victor's approach hash(0) would output the whole seed,
> but even if the seed is not known, creating an attack data
> set is trivial, since hash(x) = P ^ x ^ S.

I suppose that it would be too simple to compute the secret of a
randomized integer hash, so it is maybe better to leave them
unchanged. Using a different secret from strings and integer would not
protect Python against an attack only using integers, but integer keys
are less common than string keys (especially on web applications).

Anyway, I changed my mind about randomized hash: I now prefer counting
collisions :-)
msg151677 - (view) Author: Frank Sievertsen (fx5) Date: 2012-01-20 04:58
>> That's true. But without the suffix, I can pretty easy and efficient
>> guess the prefix by just seeing the result of a few well-chosen and
>> short repr(dict(X)). I suppose that's harder with the suffix.

> Since the hash function is known, it doesn't make things much
> harder. Without suffix you just need hash('') to find out what
> the prefix is. With suffix, two values are enough

This is obvious and absolutely correct!

But it's not what I talked about. I didn't talk about the result of
hash(X), but about the result of repr(dict([(str: val), (str:
val)....])), which is more likely to happen and not so trivial
(if you want to know more than the last 8 bits)

IMHO this problem shows that we can't advice dict() or set() for
(potential dangerous) user-supplied keys at the moment.

I prefer randomization because it fixes this problem. The
collision-counting->exception prevents a software from becoming slow,
but it doesn't make it work as expected.

Sure, you can catch the exception. But when you get the exception,
probably you wanted to add the items for a reason: Because you want
them to be in the dict and that's how your software works.

Imagine an irc-server using a dict to store the connected users, using
the nicknames as keys. Even if the irc-server catches the unexpected
exception while connecting a new user (when adding his/her name to the
dict), an attacker could connect 999 special-named users to prevent a
specific user from connecting in future.

Collision-counting->exception can make it possible to inhibit a
specific future add to the dict. The outcome is highly application
dependent.

I think it fixes 95% of the attack-vectors, but not all and it adds a
few new risks. However, of course it's much better then doing nothing
to fix the problem.
msg151679 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-20 09:03
> A dict can contain non-orderable keys, I don't know how an AVL tree
> can fit into that.

They may be non-orderable, but since they are required to be hashable,
I guess one can build an comparison function with the following:

def cmp(x, y):
    if x == y:
        return 0
     elif hash(x) <= hash(y):
         return -1
     else:
         return 1

It doesn't yield a mathematical order because it lacks the
anti-symmetry property, but it should be enough for a binary search
tree.
msg151680 - (view) Author: Frank Sievertsen (fx5) Date: 2012-01-20 09:30
> They may be non-orderable, but since they are required to be hashable,
> I guess one can build an comparison function with the following:

Since we are are trying to fix a problem where hash(X) == hash(Y), you
can't make them orderable by using the hash-values and build a binary
out of the (equal) hash-values.
msg151681 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-20 10:39
> Since we are are trying to fix a problem where hash(X) == hash(Y), you
> can't make them orderable by using the hash-values and build a binary
> out of the (equal) hash-values.

That's not what I suggested.
Keys would be considered equal if they are indeed equal (__eq__). The
hash value is just used to know if the key belongs to the left or the
right child tree. With a self-balanced binary search tree, you'd still
get O(log(N)) complexity.

Anyway, I still think that the hash randomization is the right way to
go, simply because it does solve the problem, whereas the collision
counting doesn't: Martin made a very good point on python-dev with his
database example.
msg151682 - (view) Author: Frank Sievertsen (fx5) Date: 2012-01-20 10:43
> The hash value is just used to know if the key belongs to the left
> or the right child tree.

Yes, that's what I don't understand: How can you do this, when ALL
hash-values are equal.
msg151684 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-20 10:52
> Yes, that's what I don't understand: How can you do this, when ALL
> hash-values are equal.

You're right, that's stupid.
Short night...
msg151685 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-20 11:17
Charles-François Natali wrote:
> 
> Anyway, I still think that the hash randomization is the right way to
> go, simply because it does solve the problem, whereas the collision
> counting doesn't: Martin made a very good point on python-dev with his
> database example.

For completeness, I quote Martin here:

"""
The main issue with that approach is that it allows a new kind of attack.

An attacker now needs to find 1000 colliding keys, and submit them
one-by-one into a database. The limit will not trigger, as those are
just database insertions.

Now, if the applications also as a need to read the entire database
table into a dictionary, that will suddenly break, and not for the
attacker (which would be ok), but for the regular user of the
application or the site administrator.

So it may be that this approach actually simplifies the attack, making
the cure worse than the disease.
"""

Martin is correct in that it is possible to trick an application
into building some data pool which can then be used as indirect
input for an attack.

What I don't see is what's wrong with the application raising
an exception in case it finds such data in an untrusted source
(reading arbitrary amounts of user data from a database is just
as dangerous as reading such data from any other source).

The exception will tell the programmer to be more careful and
patch the application not to read untrusted data without
additional precautions.

It will also tell the maintainer of the application that there
was indeed an attack on the system which may need to be
tracked down.

Note that the collision counting demo patch is trivial - I just
wanted to demonstrate how it works. As already mentioned, there's
room for improvement:

If Python objects were to provide an additional
method for calculating a universal hash value (based on an
integer input parameter), the dictionary in question could
use this to rehash itself and avoid the attack. Think of this
as "randomization when needed". (*)

Since the dict would still detect the problem, it could also
raise a warning to inform the maintainer of the application.

So you get the best of both worlds and randomization would only
kick in when it's really needed to keep the application running.
msg151689 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-20 12:58
> Note that the collision counting demo patch is trivial - I just
> wanted to demonstrate how it works. As already mentioned, there's
> room for improvement:
>
> If Python objects were to provide an additional
> method for calculating a universal hash value (based on an
> integer input parameter), the dictionary in question could
> use this to rehash itself and avoid the attack. Think of this
> as "randomization when needed".

Yes, the solution can be improved, but maybe not in stable versions
(the patch for stable versions should be short and simple).

If the hash output depends on an argument, the result cannot be
cached. So I suppose that dictionary lookups become slower than the
dictionary switches to the randomized mode. It would require to add an
optional argument to hash functions, or add a new function to some (or
all?) builtin types.
msg151691 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2012-01-20 14:42
> So you get the best of both worlds and randomization would only
> kick in when it's really needed to keep the application running.

Of course, but then the collision counting approach loses its main
advantage over randomized hashing: smaller patch, easier to backport.
If you need to handle a potential abnormal number of collisions
anyway, why not account for it upfront, instead of drastically
complexifying the algorithm? While larger, the randomization is
conceptually simpler.

The only argument in favor the collision counting is that it will not
break applications relying on dict order: it has been argued several
times that such applications are already broken, but that's of course
not an easy decision to make, especially for stable versions...
msg151699 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-20 17:31
Marc-Andre Lemburg:
>> So you get the best of both worlds and randomization would only
>> kick in when it's really needed to keep the application running.

Charles-François Natali
> The only argument in favor the collision counting is that it will not
> break applications relying on dict order:

There is also the "taxes suck" argument; if hashing is made complex,
then every object (or at least almost every string) pays a price, even
if it will never be stuck in a dict big enough to matter.

With collision counting, there are no additional operations unless and
until there is at least one collision -- in other words, after the
base hash algorithm has already started to fail for that particular
piece of data.

In fact, the base algorithm can be safely simplified further,
precisely because it does not need to be quite as adequate for
reprobes on data that does have at least one collision.
msg151700 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-01-20 17:39
On Thu, Jan 19, 2012 at 8:58 PM, Frank Sievertsen <report@bugs.python.org>wrote:

>
> Frank Sievertsen <python@sievertsen.de> added the comment:
>
> >> That's true. But without the suffix, I can pretty easy and efficient
> >> guess the prefix by just seeing the result of a few well-chosen and
> >> short repr(dict(X)). I suppose that's harder with the suffix.
>
> > Since the hash function is known, it doesn't make things much
> > harder. Without suffix you just need hash('') to find out what
> > the prefix is. With suffix, two values are enough
>
> This is obvious and absolutely correct!
>
> But it's not what I talked about. I didn't talk about the result of
> hash(X), but about the result of repr(dict([(str: val), (str:
> val)....])), which is more likely to happen and not so trivial
> (if you want to know more than the last 8 bits)
>
> IMHO this problem shows that we can't advice dict() or set() for
> (potential dangerous) user-supplied keys at the moment.
>
> I prefer randomization because it fixes this problem. The
> collision-counting->exception prevents a software from becoming slow,
> but it doesn't make it work as expected.
>

That depends. If collision counting prevents the DoS attack that may be
"work as expected", assuming you believe (as I do) that "real life" data
won't ever have that many collisions.

Note that every web service is vulnerable to some form of DoS where a
sufficient number of malicious requests will keep all available servers
occupied so legitimate requests suffer delays and timeouts. The defense is
to have sufficient capacity so that a potential attacker would need a large
amount of resources to do any real damage. The hash collision attack vastly
reduces the amount of resources needed to bring down a service; crashing
early moves the balance of power significantly back, and that's all we can
ask for.

Sure, you can catch the exception. But when you get the exception,
> probably you wanted to add the items for a reason: Because you want
> them to be in the dict and that's how your software works.
>

No, real data would never make this happen, so it's a "don't care" case (at
least for the vast majority of services). An attacker could also send you
such a large amount of data that your server runs out of memory, or starts
swapping (which is almost worse). But that requires for the attacker to
have enough bandwidth to send you that data. Or they could send you very
many requests. Same requirement.

All we need to guard for here is the unfortunate multiplication of the
attacker's effort due to the behavior of the collision-resolution code in
the dict implementation. Beyond that it's every app for itself.

> Imagine an irc-server using a dict to store the connected users, using
> the nicknames as keys. Even if the irc-server catches the unexpected
> exception while connecting a new user (when adding his/her name to the
> dict), an attacker could connect 999 special-named users to prevent a
> specific user from connecting in future.
>

Or they could use many other tactics. At this point the attack is specific
to this IRC implementation and it's no longer Python's responsibility.

> Collision-counting->exception can make it possible to inhibit a
> specific future add to the dict. The outcome is highly application
> dependent.
>
> I think it fixes 95% of the attack-vectors, but not all and it adds a
> few new risks. However, of course it's much better then doing nothing
> to fix the problem.
>

Right -- it vastly increases the effort needed to attack any particular
service, and does not affect any behavior of existing Python apps.
msg151701 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-20 17:42
On Fri, Jan 20, 2012 at 7:58 AM, STINNER Victor
> If the hash output depends on an argument, the result cannot be
> cached.

They can still be cached in a separate dict based on id, rather than
string contents.

It may also be possible to cache them in the dict itself; for a
string-only dict, the hash of each entry is already cached on the
object, and the cache member of the entry is technically redundant.
Entering a key with the alternative hash can also switch the lookup
function to one that handles that possibility, just as entering a
non-string key currently does.

> It would require to add an
> optional argument to hash functions, or add a new function to some
> (or all?) builtin types.

For backports, the alternative hashing could be done privately within
dict and set, and would not require new slots on other types.
msg151703 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-20 18:11
I ran the test suite of Twisted 11.1 using a limit of 20 collisions:
there is no test failing because of hash collisions.
msg151707 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-20 22:55
On Fri, 2012-01-06 at 12:52 +0000, Marc-Andre Lemburg wrote:
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> 
> Demo patch implementing the collision limit idea for Python 2.7.
> 
> ----------
> Added file: http://bugs.python.org/file24151/hash-attack.patch
> 

Marc: is this the latest version of your patch?

Whether or not we go with collision counting and/or adding a random salt
to hashes and/or something else, I've had a go at updating your patch

Although debate on python-dev seems to have turned against the
collision-counting idea, based on flaws reported by Frank Sievertsen
http://mail.python.org/pipermail/python-dev/2012-January/115726.html
it seemed to me to be worth at least adding some test cases to flesh out
the approach.  Note that the test cases deliberately avoid containing
"hostile" data.

Am attaching an updated version which:
  * adds various FIXMEs (my patch isn't ready yet, but I wanted to get
more eyes on this)

  * introduces a new TooManyHashCollisions exception, and uses that
rather than KeyError (currently it extends BaseException; am not sure
where it should sit in the exception hierarchy).

  * adds debug text to the above exception, including the repr() and
hash of the key for which the issue was triggered:
  TooManyHashCollisions: 1001 hash collisions within dict at key
ChosenHash(999, 42) with hash 42

  * moves exception-setting to a helper function, to avoid duplicated
code

  * adds a sys.max_dict_collisions, though currently with just a
copy-and-paste of the 1000 value from dictobject.c

  * starts adding a test suite to test_dict.py, using a ChosenHash
helper class (to avoid having to publish hostile data), and a context
manager for ensuring the timings of various operations fall within sane
bounds, so I can do things like this:
        with self.assertFasterThan(seconds=TIME_LIMIT) as cm:
            for i in range(sys.max_dict_collisions -1 ):
                key = ChosenHash(i, 42)
                d[key] = 0

The test suite reproduces the TooManyHashCollisions response to a basic
DoS, and also "successfully" fails due to scenario 2 in Frank's email
above (assuming I understood his email correctly).

Presumably this could also incorporate a reproducer for scenario 1 in
this email, though I don't have one yet (but I don't want to make
hostile data public).

The patch doesn't yet do anything for sets.

Hope this is helpful
Dave
msg151714 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-21 03:16
On Fri, 2012-01-20 at 22:55 +0000, Dave Malcolm wrote:
> Dave Malcolm <dmalcolm@redhat.com> added the comment:
> 
> On Fri, 2012-01-06 at 12:52 +0000, Marc-Andre Lemburg wrote:
> > Marc-Andre Lemburg <mal@egenix.com> added the comment:
> > 
> > Demo patch implementing the collision limit idea for Python 2.7.
> > 
> > ----------
> > Added file: http://bugs.python.org/file24151/hash-attack.patch
> > 
> 
> Marc: is this the latest version of your patch?
> 
> Whether or not we go with collision counting and/or adding a random salt
> to hashes and/or something else, I've had a go at updating your patch
> 
> Although debate on python-dev seems to have turned against the
> collision-counting idea, based on flaws reported by Frank Sievertsen
> http://mail.python.org/pipermail/python-dev/2012-January/115726.html
> it seemed to me to be worth at least adding some test cases to flesh out
> the approach.  Note that the test cases deliberately avoid containing
> "hostile" data.

I had a brainstorm, and I don't yet know if the following makes sense,
but here's a crude patch with another approach, which might get around
the issues Frank raises.

Rather than count the number of equal-hash collisions within each call
to lookdict, instead keep a per-dict count of the total number of
iterations through the probe sequence (regardless of the hashing),
amortized across all calls to lookdict, and if it looks like we're going
O(n^2) rather than O(n), raise an exception.  Actually, that's not quite
it, but see below...

We potentially have 24 words of per-dictionary storage hiding in the
ma_smalltable area within PyDictObject, which we can use when ma_mask >=
PyDict_MINSIZE (when mp->ma_table != mp->ma_smalltable), without
changing sizeof(PyDictObject) and thus breaking ABI.  I hope there isn't
any code out there that uses this space.  (Anyone know of any?)

This very crude patch uses that area to add per-dict tracking of the
total number of iterations spent probing for a free PyDictEntry whilst
constructing the dictionary.  It rules that if we've gone more than (32
* ma_used) iterations whilst constructing the dictionary (counted across
all ma_lookup calls), then we're degenerating into O(n^2) behavior, and
this triggers an exception.  Any other usage of ma_lookup resets the
count (e.g. when reading values back).  I picked the scaling factor of
32 from out of the air; I hope there's a smarter threshold.  

I'm assuming that an attack scenario tends to involve a dictionary that
goes through a construction phase (which the attacker is aiming to
change from O(N) to O(N^2)), and then a usage phase, whereas there are
other patterns of dictionary usage in which insertion and lookup are
intermingled for which this approach wouldn't raise an exception.

This leads to exceptions like this:

AlgorithmicComplexityError: dict construction used 4951 probes for 99
entries at key 99 with hash 42

(i.e. the act of constructing a dict with 99 entries required traversing
4951 PyDictEntry slots, suggesting someone is sending deliberately
awkward data).

Seems to successfully handle both the original DoS and the second
scenario in Frank's email.  I don't have a reproducer for the first of
Frank's scenarios, but in theory it ought to handle it.  (I hope!)

Have seen two failures within python test suite from this, which I hope
can be fixed by tuning the thresholds and the reset events (they seem to
happen when a large dict is emptied).

May have a performance impact, but I didn't make any attempt to optimize
it (beyond picking a power of two for the scaling factor).

(There may be random bits of the old patch thrown in; sorry)

Thoughts? (apart from "ugh! it's ugly!" yes I know - it's late here)
Dave
msg151731 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-21 14:27
> Thoughts? (apart from "ugh! it's ugly!" yes I know - it's late here)

Is it guaranteed that no usage pattern can render this protection
inefficient? What if a dict is constructed by intermingling lookups and
inserts?
Similarly, what happens with e.g. the common use case of
dictdefault(list), where you append() after the lookup/insert? Does some
key distribution allow the attack while circumventing the protection?
msg151734 - (view) Author: Zbyszek Jędrzejewski-Szmek (zbysz) * Date: 2012-01-21 15:36
The hashing with random seed is only marginally slower or more 
complicated than current version.

The patch is big because it moves random number generator initialization 
code around. There's no "per object" tax, and the cost of the random 
number generator initialization is only significant on windows. 
Basically, there's no "tax".

Zbyszek
msg151735 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-21 17:02
On Sat, 2012-01-21 at 14:27 +0000, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> > Thoughts? (apart from "ugh! it's ugly!" yes I know - it's late here)
> 
> Is it guaranteed that no usage pattern can render this protection
> inefficient? What if a dict is constructed by intermingling lookups and
> inserts?
> Similarly, what happens with e.g. the common use case of
> dictdefault(list), where you append() after the lookup/insert? Does some
> key distribution allow the attack while circumventing the protection?

Yes, I agree that I was making an unrealistic assumption about usage
patterns.  There was also some global state (the "is_inserting"
variable).

I've tweaked the approach somewhat, moved the global to be per-dict, and
am attaching a revised version of the patch:
   amortized-probe-counting-dmalcolm-2012-01-21-003.patch

In this patch, rather than reset the count each time, I keep track of
the total number of calls to insertdict() that have happened for each
"large dict" (i.e. for which ma_table != ma_smalltable), and the total
number of probe iterations that have been needed to service those
insertions/overwrites.  It raises the exception when the *number of
probe iterations per insertion* exceeds a threshold factor (rather than
merely comparing the number of iterations against the current ma_used of
the dict).  I believe this means that it's tracking and checking every
time the dict is modified, and (I hope) protects us against any data
that drives the dict implementation away from linear behavior (because
that's essentially what it's testing for).  [the per-dict stats are
reset each time that it shrinks down to using ma_smalltable again, but I
think at-risk usage patterns in which that occurs are uncommon relative
to those in which it doesn't].

When attacked, this leads to exceptions like this:
AlgorithmicComplexityError: dict construction used 1697 probes whilst
performing 53 insertions (len() == 58) at key 58 with hash 42

i.e we have a dictionary containing 58 keys, which has seen 53
insert/overwrite operations since transitioning to the non-ma_smalltable
representation (at size 6); presumably it has 128 PyDictEntries.
Servicing those 53 operations has required a total 1697 iterations
through the probing loop, or a little over 32 probes per insert.

I just did a full run of the test suite (using run_tests.py), and it
mostly passed the new tests I've added (included the test for scenario 2
from Frank's email).

There were two failures:
======================================================================
FAIL: test_inheritance (test.test_pep352.ExceptionClassTests)
----------------------------------------------------------------------
AssertionError: 1 != 0 : {'AlgorithmicComplexityError'} not accounted
for
----------------------------------------------------------------------
which is obviously fixable (given a decision on where the exception
lives in the hierarchy)

and this one:
test test_mutants crashed -- Traceback (most recent call last):
  File
"/home/david/coding/python-hg/cpython-count-collisions/Lib/test/regrtest.py", line 1214, in runtest_inner
    the_package = __import__(abstest, globals(), locals(), [])
  File
"/home/david/coding/python-hg/cpython-count-collisions/Lib/test/test_mutants.py", line 159, in <module>
    test(100)
  File
"/home/david/coding/python-hg/cpython-count-collisions/Lib/test/test_mutants.py", line 156, in test
    test_one(random.randrange(1, 100))
  File
"/home/david/coding/python-hg/cpython-count-collisions/Lib/test/test_mutants.py", line 132, in test_one
    dict2keys = fill_dict(dict2, range(n), n)
  File
"/home/david/coding/python-hg/cpython-count-collisions/Lib/test/test_mutants.py", line 118, in fill_dict
    Horrid(random.choice(candidates))
AlgorithmicComplexityError: dict construction used 2753 probes whilst
performing 86 insertions (len() == 64) at key Horrid(86) with hash 42
though that seems to be deliberately degenerate code.

Caveats:
* no overflow handling (what happens after 2**32 modifications to a
long-lived dict on a 32-bit build?) - though that's fixable.
* no idea what the scaling factor for the threshold should be (there may
also be a deep mathematical objection here, based on how big-O notation
is defined in terms of an arbitrary scaling factor and limit)
* not optimized; I haven't looked at performance yet
* doesn't cover set(), though that also has spare space (I hope) via its
own smalltable array.

BTW, note that although I've been working on this variant of the
collision counting approach, I'm not opposed to the hash randomization
approach, or to adding extra checks in strategic places within the
stdlib: I'm keen to get some kind of appropriate fix approved by the
upstream Python development community so I can backport it to the
various recent-to-ancient versions of CPython I support in RHEL (and
Fedora), before we start seeing real-world attacks.

Hope this is helpful
Dave
msg151737 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-21 17:07
(or combination of fixes, of course)
msg151739 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-21 18:57
> In this patch, rather than reset the count each time, I keep track of
> the total number of calls to insertdict() that have happened for each
> "large dict" (i.e. for which ma_table != ma_smalltable), and the total
> number of probe iterations that have been needed to service those
> insertions/overwrites.  It raises the exception when the *number of
> probe iterations per insertion* exceeds a threshold factor (rather than
> merely comparing the number of iterations against the current ma_used of
> the dict).

This sounds much more robust than the previous attempt.

> When attacked, this leads to exceptions like this:
> AlgorithmicComplexityError: dict construction used 1697 probes whilst
> performing 53 insertions (len() == 58) at key 58 with hash 42

We'll have to discuss the name of the exception and the error message :)

> Caveats:
> * no overflow handling (what happens after 2**32 modifications to a
> long-lived dict on a 32-bit build?) - though that's fixable.

How do you suggest to fix it?

> * no idea what the scaling factor for the threshold should be (there may
> also be a deep mathematical objection here, based on how big-O notation
> is defined in terms of an arbitrary scaling factor and limit)

I'd make the threshold factor a constant, e.g. 64 or 128 (it should not
be too small, to avoid false positives).
We're interested in the actual slowdown factor, which a constant factor
models adequately. It's the slowdown factor which makes a DOS attack
using this technique efficient. Whether or not dict construction truely
degenerates into a O(n**2) operation is less relevant.

There needs to be a way to disable it: an environment variable would be
the minimum IMO.
Also, in 3.3 there should probably be a sys function to enable or
disable it at runtime. Not sure it should be backported since it's a new
API.
msg151744 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-21 21:07
Well, the old attempt was hardly robust :)

Can anyone see any vulnerabilities in this approach?

Yeah; I was mostly trying to add raw data (to help me debug the
implementation).

I wonder if the dict statistics should be exposed with extra attributes
or a method on the dict; e.g. a __stats__ attribute, something like
this:

LargeDictStats(keys=58, mask=127, insertions=53, iterations=1697)

SmallDictStats(keys=3, mask=7)

or somesuch. Though that's a detail, I think.

> > Caveats:
> > * no overflow handling (what happens after 2**32 modifications to a
> > long-lived dict on a 32-bit build?) - though that's fixable.
> 
> How do you suggest to fix it?

If the dict is heading towards overflow of these counters, it's either
long-lived, or *huge*.

Possible approaches:
(a) use 64-bit counters rather than 32-bit, though that's simply
delaying the inevitable
(b) when one of the counters gets large, divide both of them by a
constant (e.g. 2).  We're interested in their ratio, so dividing both by
a constant preserves this.

By "a constant" do you mean from the perspective of big-O notation, or
do you mean that it should be hardcoded (I was wondering if it should be
a sys variable/environment variable etc?).

> We're interested in the actual slowdown factor, which a constant factor
> models adequately. It's the slowdown factor which makes a DOS attack
> using this technique efficient. Whether or not dict construction truely
> degenerates into a O(n**2) operation is less relevant.

OK.

> There needs to be a way to disable it: an environment variable would be
> the minimum IMO.

e.g. set it to 0 to enable it, set it to nonzero to set the scale
factor.
Any idea what to call it? 

PYTHONALGORITHMICCOMPLEXITYTHRESHOLD=0 would be quite a mouthful.

OK

BTW, presumably if we do it, we should do it for sets as well?
msg151745 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-21 22:20
> I wonder if the dict statistics should be exposed with extra attributes
> or a method on the dict; e.g. a __stats__ attribute, something like
> this:
> 
> LargeDictStats(keys=58, mask=127, insertions=53, iterations=1697)
> 
> SmallDictStats(keys=3, mask=7)

Sounds a bit overkill, and it shouldn't be a public API (which
__methods__ are). Even a private API on dicts would quickly become
visible, since dicts are so pervasive.

> > > Caveats:
> > > * no overflow handling (what happens after 2**32 modifications to a
> > > long-lived dict on a 32-bit build?) - though that's fixable.
> > 
> > How do you suggest to fix it?
> 
> If the dict is heading towards overflow of these counters, it's either
> long-lived, or *huge*.
> 
> Possible approaches:
> (a) use 64-bit counters rather than 32-bit, though that's simply
> delaying the inevitable

Well, even assuming one billion lookup probes per second on a single
dictionary, the inevitable will happen in 584 years with a 64-bit
counter (but only 4 seconds with a 32-bit counter).

A real issue, though, may be the cost of 64-bit arithmetic on 32-bit
CPUs.

> (b) when one of the counters gets large, divide both of them by a
> constant (e.g. 2).  We're interested in their ratio, so dividing both by
> a constant preserves this.

Sounds good, although we may want to pull this outside of the critical
loop.

> By "a constant" do you mean from the perspective of big-O notation, or
> do you mean that it should be hardcoded (I was wondering if it should be
> a sys variable/environment variable etc?).

Hardcoded, as in your patch.

> > There needs to be a way to disable it: an environment variable would be
> > the minimum IMO.
> 
> e.g. set it to 0 to enable it, set it to nonzero to set the scale
> factor.

0 to enable it sounds misleading. I'd say:
- 0 to disable it
- 1 to enable it and use the default scaling factor
- >= 2 to enable it and set the scaling factor

> Any idea what to call it? 

PYTHONDICTPROTECTION ?
Most people should either enable or disable it, not change the scaling
factor.

> BTW, presumably if we do it, we should do it for sets as well?

Yeah, and use the same env var / sys function.
msg151747 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-21 22:41
On Sat, 2012-01-21 at 22:20 +0000, Antoine Pitrou wrote:

> Sounds a bit overkill, and it shouldn't be a public API (which
> __methods__ are). Even a private API on dicts would quickly become
> visible, since dicts are so pervasive.

Fair enough.

> > > > Caveats:
> > > > * no overflow handling (what happens after 2**32 modifications to a
> > > > long-lived dict on a 32-bit build?) - though that's fixable.
> > > 
> > > How do you suggest to fix it?
> > 
> > If the dict is heading towards overflow of these counters, it's either
> > long-lived, or *huge*.
> > 
> > Possible approaches:
> > (a) use 64-bit counters rather than 32-bit, though that's simply
> > delaying the inevitable
> 
> Well, even assuming one billion lookup probes per second on a single
> dictionary, the inevitable will happen in 584 years with a 64-bit
> counter (but only 4 seconds with a 32-bit counter).
> 
> A real issue, though, may be the cost of 64-bit arithmetic on 32-bit
> CPUs.
> 
> > (b) when one of the counters gets large, divide both of them by a
> > constant (e.g. 2).  We're interested in their ratio, so dividing both by
> > a constant preserves this.
> 
> Sounds good, although we may want to pull this outside of the critical
> loop.

OK; I'll look at implementing (b).

Oops, yeah, that was a typo; I meant 0 to disable.

> - 0 to disable it
> - 1 to enable it and use the default scaling factor
> - >= 2 to enable it and set the scaling factor

You said above that it should be hardcoded; if so, how can it be changed
at run-time from an environment variable?  Or am I misunderstanding.

Works for me.

> > BTW, presumably if we do it, we should do it for sets as well?
> 
> Yeah, and use the same env var / sys function.

Despite the "DICT" in the title?  OK.

Thanks for the feedback.
msg151748 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-21 22:45
> You said above that it should be hardcoded; if so, how can it be changed
> at run-time from an environment variable?  Or am I misunderstanding.

You're right, I used the wrong word. I meant it should be a constant
independently of the dict size. But, indeed, not hard-coded in the
source.

> > > BTW, presumably if we do it, we should do it for sets as well?
> > 
> > Yeah, and use the same env var / sys function.
> 
> Despite the "DICT" in the title?  OK.

Well, dict is the most likely target for these attacks.
msg151753 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-01-21 23:42
On Sat, Jan 21, 2012 at 2:45 PM, Antoine Pitrou <report@bugs.python.org> wrote:
>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
>> You said above that it should be hardcoded; if so, how can it be changed
>> at run-time from an environment variable?  Or am I misunderstanding.
>
> You're right, I used the wrong word. I meant it should be a constant
> independently of the dict size. But, indeed, not hard-coded in the
> source.
>
>> > > BTW, presumably if we do it, we should do it for sets as well?
>> >
>> > Yeah, and use the same env var / sys function.
>>
>> Despite the "DICT" in the title?  OK.
>
> Well, dict is the most likely target for these attacks.
>

While true I wouldn't make that claim as there will be applications
using a set in a vulnerable manner. I'd prefer to see any such
environment variable name used to configure this behavior not mention
DICT or SET but just say HASHTABLE.  That is a much better bikeshed
color. ;)

I'm still in the hash seed randomization camp but I'm finding it
interesting all of the creative ways others are trying to "solve" this
problem in a way that could be enabled by default in stable versions
regardless. :)

-gps
msg151754 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-21 23:47
On Sat, Jan 21, 2012 at 5:42 PM, Gregory P. Smith <report@bugs.python.org>wrote:

>
> Gregory P. Smith <greg@krypto.org> added the comment:
>
> On Sat, Jan 21, 2012 at 2:45 PM, Antoine Pitrou <report@bugs.python.org>
> wrote:
> >
> > Antoine Pitrou <pitrou@free.fr> added the comment:
> >
> >> You said above that it should be hardcoded; if so, how can it be changed
> >> at run-time from an environment variable?  Or am I misunderstanding.
> >
> > You're right, I used the wrong word. I meant it should be a constant
> > independently of the dict size. But, indeed, not hard-coded in the
> > source.
> >
> >> > > BTW, presumably if we do it, we should do it for sets as well?
> >> >
> >> > Yeah, and use the same env var / sys function.
> >>
> >> Despite the "DICT" in the title?  OK.
> >
> > Well, dict is the most likely target for these attacks.
> >
>
> While true I wouldn't make that claim as there will be applications
> using a set in a vulnerable manner. I'd prefer to see any such
> environment variable name used to configure this behavior not mention
> DICT or SET but just say HASHTABLE.  That is a much better bikeshed
> color. ;)
>
> I'm still in the hash seed randomization camp but I'm finding it
> interesting all of the creative ways others are trying to "solve" this
> problem in a way that could be enabled by default in stable versions
> regardless. :)
>
> -gps
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
>

I'm a little slow, so bear with me, but David, does this counting scheme in
any way address the issue of:

I'm able to put N pieces of data into the database on successive requests,
but then *rendering* that data puts it in a dictionary, which renders that
page unviewable by anyone.
msg151756 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-22 02:13
5 more characters:
PYTHONHASHTABLEPROTECTION
or
PYHASHTABLEPROTECTION
maybe?

I'm in *both* camps: I like hash seed randomization fwiw.  I'm nervous
about enabling either of the approaches by default, but I can see myself
backporting both approaches into RHEL's ancient Python versions,
compiled in, disabled by default, but available at runtime via env vars
(assuming that no major flaws are discovered in my patch e.g.
performance).

I'm sorry if I'm muddying the waters by working on this approach.

Is the hash randomization approach ready to go, or is more work needed?
If the latter, is there a clear TODO list?
(for backporting to 2.*, presumably we'd want PyStringObject to be
randomized; I think this means that PyBytesObject needs to be randomized
also in 3.*; don't we need hash(b'foo') == hash('foo') ?).  Does the
patch needs to also randomize the hashes of the numeric types? (I think
not; that may break too much 3rd-party code (NumPy?)).

[If we're bikeshedding,  I prefer the term "salt" to "seed" in the hash
randomization approach: there's a per-process "hash salt", which is
either randomly generated, or comes from the environment, set to 0 to
disable]
msg151758 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-22 03:43
On Sat, Jan 21, 2012 at 3:47 PM, Alex Gaynor <report@bugs.python.org> wrote:
> I'm able to put N pieces of data into the database on successive requests,
> but then *rendering* that data puts it in a dictionary, which renders that
> page unviewable by anyone.

This and the problems Frank mentions are my primary concerns about the
counting approach. Without the original suggestion of modifying the
hash and continuing without an exception (which has its own set of
problems), the "valid data python can't process" problem is a pretty
big one. Allowing attackers to poison interactions for other users is
unacceptable.

The other thing I haven't seen mentioned yet is that while it is true
that most web applications do have robust error handling to produce
proper 500s, an unexpected error will usually result in restarting the
server process - something that can carry significant weight by
itself. I would consider it a serious problem if every attack request
required a complete application restart, a la original cgi.

I'm strongly in favor of randomization. While there are many broken
applications in the wild that depend on dictionary ordering, if we
ship with this feature disabled by default for security and bugfix
branches, and enable it for 3.3, users can opt-in to protection as
they need it and as they fix their applications. Users who have broken
applications can still safely apply the security fix (without even
reading the release notes) because it won't change the default
behavior. Distro managers can make an appropriate choice for their
user base. Most importantly, it negates the entire "compute once,
attack everywhere" class of collision problems, even if we haven't
explicitly discovered them.
msg151794 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-01-23 00:22
@dmalcolm: How did you chose Py_MAX_AVERAGE_PROBES_PER_INSERT=32? Did you try your patch on applications like the test suite of Django or Twisted?
msg151796 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-23 03:48
On Sat, 2012-01-21 at 23:47 +0000, Alex Gaynor wrote:
> Alex Gaynor <alex.gaynor@gmail.com> added the comment:
> 
> On Sat, Jan 21, 2012 at 5:42 PM, Gregory P. Smith <report@bugs.python.org>wrote:
> 
> >
> > Gregory P. Smith <greg@krypto.org> added the comment:
> >
> > On Sat, Jan 21, 2012 at 2:45 PM, Antoine Pitrou <report@bugs.python.org>
> > wrote:
> > >
> > > Antoine Pitrou <pitrou@free.fr> added the comment:
> > >
> > >> You said above that it should be hardcoded; if so, how can it be changed
> > >> at run-time from an environment variable?  Or am I misunderstanding.
> > >
> > > You're right, I used the wrong word. I meant it should be a constant
> > > independently of the dict size. But, indeed, not hard-coded in the
> > > source.
> > >
> > >> > > BTW, presumably if we do it, we should do it for sets as well?
> > >> >
> > >> > Yeah, and use the same env var / sys function.
> > >>
> > >> Despite the "DICT" in the title?  OK.
> > >
> > > Well, dict is the most likely target for these attacks.
> > >
> >
> > While true I wouldn't make that claim as there will be applications
> > using a set in a vulnerable manner. I'd prefer to see any such
> > environment variable name used to configure this behavior not mention
> > DICT or SET but just say HASHTABLE.  That is a much better bikeshed
> > color. ;)
> >
> > I'm still in the hash seed randomization camp but I'm finding it
> > interesting all of the creative ways others are trying to "solve" this
> > problem in a way that could be enabled by default in stable versions
> > regardless. :)
> >
> > -gps
> >
> > ----------
> >
> > _______________________________________
> > Python tracker <report@bugs.python.org>
> > <http://bugs.python.org/issue13703>
> > _______________________________________
> >
> 
> I'm a little slow, so bear with me, but David, does this counting scheme in
> any way address the issue of:
> 
> I'm able to put N pieces of data into the database on successive requests,
> but then *rendering* that data puts it in a dictionary, which renders that
> page unviewable by anyone.

It doesn't address this issue - though if the page is taking many hours
to render, is that in practice less unviewable that everyone getting an
immediate exception with (perhaps) a useful error message?

Unfortunately, given the current scale factor, my patch may make it
worse: in my tests, this approach rejected malicious data much more
quickly than the old collision-counting one, which I thought was a good
thing - but then I realized that this means that an attacker adopting
the strategy you describe would have to do less work to trigger the
exception than to trigger the slowdown.  So I'm not convinced my
approach flies, and I'm leaning towards working on the hash
randomization patch rather than pursuing this.

I need sleep though, so I'm not sure the above is coherent
Dave
msg151798 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-23 04:04
I arbitrarily started with 50, and then decided a power of two would be
quicker when multiplying.  There wasn't any rigorous analysis behind the
choice of factor.

Though, as noted in msg151796, I've gone off this idea, since I think
the "protection" creates additional avenues of attack.

I think getting some kind of hash randomization patch into the hands of
users ASAP is the way forward here (even if disabled by default).

If we're going to support shipping backported versions of the hash
randomization patch with the randomization disabled, did we decide on a
way of enabling it?  If not, then I propose that those who want to ship
with it disabled by default standardize on (say):

  PYTHONHASHRANDOMIZATION

as an environment variable: if set to nonzero, it enables hash
randomization (reading the random seed as per the 3.3. patch, and
respecting the PYTHONHASHSEED variable if that's also set).  If set to
zero or not present, hash randomization is disabled.

Does that sound sane?

(we can't use PYTHONHASHSEED for this, since if a number is given, that
means "use this number", right?)

FWIW, I favor hash randomization in 2.* for PyStringObject,
PyUnicodeObject, PyBufferObject, and the 3 datetime classes in
Modules/_datetimemodule.c (see the implementation of generic_hash in
that file), but to not do it for the numeric types.

Sorry; I only tried it on the python test suite (and on a set of
reproducers for the DoS that I've written for RH's in-house test suite).
msg151812 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-23 13:07
Dave Malcolm wrote:
> 
> Dave Malcolm <dmalcolm@redhat.com> added the comment:
> 
> On Fri, 2012-01-06 at 12:52 +0000, Marc-Andre Lemburg wrote:
>> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>>
>> Demo patch implementing the collision limit idea for Python 2.7.
>>
>> ----------
>> Added file: http://bugs.python.org/file24151/hash-attack.patch
>>
> 
> Marc: is this the latest version of your patch?

Yes. As mentioned in the above message, it's just a demo of how
the collision limit idea can be implemented.

> Whether or not we go with collision counting and/or adding a random salt
> to hashes and/or something else, I've had a go at updating your patch
> 
> Although debate on python-dev seems to have turned against the
> collision-counting idea, based on flaws reported by Frank Sievertsen
> http://mail.python.org/pipermail/python-dev/2012-January/115726.html
> it seemed to me to be worth at least adding some test cases to flesh out
> the approach.  Note that the test cases deliberately avoid containing
> "hostile" data.

Martin's example is really just a red herring: it doesn't matter
where the hostile data originates or how it gets into the application.
There are many ways an attacker can get the O(n^2) worst case
timing triggered.

Frank's example is an attack on the second possible way to
trigger the O(n^2) behavior. See msg150724 further above where I
listed the two possibilities:

"""
An attack can be based on trying to find many objects with the same
hash value, or trying to find many objects that, as they get inserted
into a dictionary, very often cause collisions due to the collision
resolution algorithm not finding a free slot.
"""

My demo patch only addresses the first variant. In order to cover
the second variant as well, you'd have to count and limit the
number of iterations in the perturb for-loop of the lookdict()
functions where the hash value of the slot does not match the
key's hash value.

Note that the second variant is both a lot less likely to trigger
(due to the dict getting resized on a regular basis) and the
code involved a lot faster than the code for the first
variant (which requires a costly object comparison), so the
limit for the second variant would have to be somewhat higher
than for the first.

BTW: The collision counting patch chunk for the string dicts in my
demo patch is wrong. I've attached a corrected version. In the
original patch it was counting both collision variants with the
same counter and limit.
msg151813 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-23 13:38
Alex Gaynor wrote:
> I'm able to put N pieces of data into the database on successive requests,
> but then *rendering* that data puts it in a dictionary, which renders that
> page unviewable by anyone.

I think you're asking a bit much here :-) A broken app is a broken
app, no matter how nice Python tries to work around it. If an
app puts too much trust into user data, it will be vulnerable
one way or another and regardless of how the user data enters
the app.

These are the collision counting possibilities we've discussed
so far:

With an collision counting exception you'd get a clear notice that
something in your data and your application is wrong and needs
fixing. The rest of your web app will continue to work fine and
you won't run into a DoS problem taking down all of your web
server.

With the proposed enhancement of collision counting + universal hash
function for Python 3.3, you'd get a warning printed to the logs, the
dict implementation would self-heal and your page is viewable nonetheless.
The admin would then see the log entry and get a chance to fix the
problem.

Note: Even if Python works around the problem successfully, there's no
guarantee that the data doesn't end up being processed by some other
tool in the chain with similar problems. All this is a work-around
for an application bug, nothing more. Silencing the problem
by e.g. using randomization in the string hash algorithm
doesn't really help in identifying the bug.

Overall, I don't think we should make Python's hash function
non-deterministic. Even with the universal hash function idea,
the dict implementation should use a predefined way of determining
the next hash parameter to use, so that running the application
twice against attack data will still result in the same data
output.
msg151814 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-23 13:40
> Frank's example is an attack on the second possible way to
> trigger the O(n^2) behavior. See msg150724 further above where I
> listed the two possibilities:
> 
> """
> An attack can be based on trying to find many objects with the same
> hash value, or trying to find many objects that, as they get inserted
> into a dictionary, very often cause collisions due to the collision
> resolution algorithm not finding a free slot.
> """

No, Frank's examples attack both possible ways.
msg151815 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-23 13:56
> With an collision counting exception you'd get a clear notice that
> something in your data and your application is wrong and needs
> fixing. The rest of your web app will continue to work fine

Except when it doesn't, because you've also broken batch processing
functions and the like.

> Note: Even if Python works around the problem successfully, there's no
> guarantee that the data doesn't end up being processed by some other
> tool in the chain with similar problems.

Non-Python tools don't use Python's hash functions, they are therefore
not vulnerable to the same data.
msg151825 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-23 16:43
Here's a version of the collision counting patch that takes both hash
and slot collisions into account.

I've also added a test script which demonstrates both types of
collisions using integer objects (since it's trivial to calculate
their hashes).

To see the collision counting, enable the DEBUG_DICT_COLLISIONS
macro variable.
msg151826 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-23 16:45
> I've also added a test script which demonstrates both types of
> collisions using integer objects (since it's trivial to calculate
> their hashes).

I forgot to mention: the test script is for 64-bit platforms. It's
easy to adapt it to 32-bit if needed.
msg151847 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-23 21:31
I'm attaching an attempt at backporting haypo's random-8.patch to 2.7

Changes relative to random-8.patch:

   * The randomization is off by default, and must be enabled by setting
     a new environment variable PYTHONHASHRANDOMIZATION to a non-empty string.
     (if so then, PYTHONHASHSEED also still works, if provided, in the same
     way as in haypo's patch)

   * All of the various "Py_hash_t" become "long" again (Py_hash_t was
     added in 3.2: issue9778)

   * I expanded the randomization from just PyUnicodeObject to also cover
     these types:

     * PyStringObject

     * PyBufferObject

     The randomization does not cover numeric types: if we change the hash of
     int so that hash(i) no longer equals i, we also have to change it
     consistently for long, float, complex, decimal.Decimal and
     fractions.Fraction; however, there are 3rd-party numeric types that
     have their own __hash__ implementation that mimics int.__hash__ (see
     e.g. numpy)

     As noted in http://bugs.python.org/issue13703#msg151063 and
     http://bugs.python.org/issue13703#msg151064, it's not possible
     to directly create a dict with integer keys via JSON or XML-RPC.

     This seems like a tradeoff between the risk of attack via other means
     vs breakage induced by not having hash() == hash() for the various
     equivalent numerical representations in pre-existing code.

   * To support my expanded usage of the random secret, I moved:
       
       PyAPI_DATA(_Py_unicode_hash_secret_t) _Py_unicode_hash_secret

     from unicodeobject.h to object.h and renamed it to:

       PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;

     This also exposes it for usage by C extension modules, just in case
     they need it (Murphy's Law suggests we will need if we don't expose
     it).   This is an extension of the API, but warranted, I feel.  My
     plan for downstream RHEL is to add this explicitly to the RPM metadata
     as a "Provides" of the RPM providing libpython.so so that if something
     needs to use it, it can express a "Requires" on it; I assume that
     something similar is possible with .deb)

   * generalized test_unicode.HashTest to support the new env var and the
     additional types.  In my version, get_hash takes a _repr string rather
     than an object, so that I can test it with a buffer().  Arguably the
     tests should thus be moved from test_unicode to somewhere else, but this
     location keeps things consistent with haypo's patch.

     haypo: in random-8.patch, within test_unicode.HashTest.test_null_hash,
     "hash_empty" seems to be misnamed

   * dropped various selftest fixes where the corresponding selftests don't
     exist in 2.7

   * adds a description of the new environment variables to the manpage;
     arguably this should be done for the patch for the default branch also

Caveats:

   * only tested on Linux (Fedora 15 x86_64); not tested on Windows.  Tested
     via "make test" both with and without PYTHONHASHRANDOMIZATION=1

   * not yet benchmarked

 Doc/using/cmdline.rst                      |   28 ++
 Include/object.h                           |    7 
 Include/pythonrun.h                        |    2 
 Lib/lib-tk/test/test_ttk/test_functions.py |    2 
 Lib/os.py                                  |   19 -
 Lib/test/mapping_tests.py                  |    2 
 Lib/test/regrtest.py                       |    5 
 Lib/test/test_gdb.py                       |   15 +
 Lib/test/test_inspect.py                   |    1 
 Lib/test/test_os.py                        |   47 +++-
 Lib/test/test_unicode.py                   |   55 +++++
 Makefile.pre.in                            |    1 
 Misc/python.man                            |   22 ++
 Modules/posixmodule.c                      |  126 ++----------
 Objects/bufferobject.c                     |    8 
 Objects/object.c                           |    2 
 Objects/stringobject.c                     |    8 
 Objects/unicodeobject.c                    |   17 +
 PCbuild/pythoncore.vcproj                  |    4 
 Python/pythonrun.c                         |    2 
 b/Python/random.c                          |  284 +++++++++++++++++++++++++++++
 21 files changed, 510 insertions(+), 147 deletions(-)
msg151850 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-01-23 21:39
> To see the collision counting, enable the DEBUG_DICT_COLLISIONS
> macro variable.

Running (part of (*)) the test suite with debugging enabled on a 64-bit
machine shows that slot collisions are much more frequent than
hash collisions, which only account for less than 0.01% of all
collisions.

It also shows that slot collisions in the low 1-10 range are
most frequent, with very few instances of a dict lookup
reaching 20 slot collisions (less than 0.0002% of all
collisions).

The great number of cases with 1 or 2 slot collisions surprised
me. It seems that there's potential for improvement of
the perturbation formula left.

Due to the large number of 1 or 2 slot collisions, the patch
is going to cause a minor hit to dict lookup performance.
It may make sense to unroll the slot search loop and only
start counting after the third round of misses.

(*) I stopped the run after several hours run-time, producing
some 148GB log data.
msg151867 - (view) Author: Paul McMillan (PaulMcMillan) Date: 2012-01-24 00:14
> I think you're asking a bit much here :-) A broken app is a broken
> app, no matter how nice Python tries to work around it. If an
> app puts too much trust into user data, it will be vulnerable
> one way or another and regardless of how the user data enters
> the app.

I notice your patch doesn't include fixes for the entire standard
library to work around this problem. Were you planning on writing
those, or leaving that for others?

As a developer, I honestly don't know how I can state with certainty
that input data is clean or not, until I actually see the error you
propose. I can't check validity before the fact, the way I can check
for invalid unicode before storing it in my database. Once I see the
error (probably only after my application is attacked, certainly not
during development), it's too late. My application can't know which
particular data triggered the error, so it can't delete it. I'm
reduced to trial and error to remove the offending data, or to writing
code that never stores more than 1000 things in a dictionary. And I
have to accept that the standard library may not work on any
particular data I want to process, and must write code that detects
the error state and somehow magically removes the offending data.

The alternative, randomization, simply means that my dictionary
ordering is not stable, something that is already the case.

While I appreciate that the counting approach feels cleaner;
randomization is the only solution that makes practical sense.
msg151869 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-24 00:42
On Mon, Jan 23, 2012 at 4:39 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:

> Running (part of (*)) the test suite with debugging enabled on a 64-bit
> machine shows that slot collisions are much more frequent than
> hash collisions, which only account for less than 0.01% of all
> collisions.

Even 1 in 10,000 seems pretty high, though I suppose it is a result of
non-random input.  (For a smalldict with 8 == 2^3 slots, on a 64-bit
machine, true hash collisions "should" only account for 1 in 2^61 slot
collisions.)

> It also shows that slot collisions in the low 1-10 range are
> most frequent, with very few instances of a dict lookup
> reaching 20 slot collisions (less than 0.0002% of all
> collisions).

Thus the argument that collisions > N implies (possibly malicious)
data that really needs a different hash -- and that this dict instance
in particular should take the hit to use an alternative hash.  (Do
note that this alternative hash could be stored in the hash member of
the PyDictEntry; if anything actually *equal* to the key comes along,
it will have gone through just as many collisions, and therefore also
have been rehashed.)

> The great number of cases with 1 or 2 slot collisions surprised
> me. It seems that there's potential for improvement of
> the perturbation formula left.

In retrospect, this makes sense.

    for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
        i = (i << 2) + i + perturb + 1;

If two objects collided then they have the same last few last few bits
in their hashes -- which means they also have the same last few bits
in their initial perturb.  And since the first probe is to slot 6i+1,
it funnels down to only even consider half the slots until the second
probe.

Also note that this explains why Randomization could make the Django
tests fail, even though 64-bit users haven't complained.  The initial
hash(&mask) is the same, and the first probe is the same, and (for a
small enough dict) so are the next several.  In a dict with 2^12
slots, the first 6 tries will be the same ... so I doubt the test
cases have sufficiently large amounts of sufficiently unlucky data to
notice very often -- unless the hash itself is changed, as in the
patch.
msg151870 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-01-24 00:44
On Mon, Jan 23, 2012 at 1:32 PM, Dave Malcolm <report@bugs.python.org> wrote:
>
> Dave Malcolm <dmalcolm@redhat.com> added the comment:
>
> I'm attaching an attempt at backporting haypo's random-8.patch to 2.7
>
> Changes relative to random-8.patch:
>
>   * The randomization is off by default, and must be enabled by setting
>     a new environment variable PYTHONHASHRANDOMIZATION to a non-empty string.
>     (if so then, PYTHONHASHSEED also still works, if provided, in the same
>     way as in haypo's patch)
>
>   * All of the various "Py_hash_t" become "long" again (Py_hash_t was
>     added in 3.2: issue9778)
>
>   * I expanded the randomization from just PyUnicodeObject to also cover
>     these types:
>
>     * PyStringObject
>
>     * PyBufferObject
>
>     The randomization does not cover numeric types: if we change the hash of
>     int so that hash(i) no longer equals i, we also have to change it
>     consistently for long, float, complex, decimal.Decimal and
>     fractions.Fraction; however, there are 3rd-party numeric types that
>     have their own __hash__ implementation that mimics int.__hash__ (see
>     e.g. numpy)
>
>     As noted in http://bugs.python.org/issue13703#msg151063 and
>     http://bugs.python.org/issue13703#msg151064, it's not possible
>     to directly create a dict with integer keys via JSON or XML-RPC.
>
>     This seems like a tradeoff between the risk of attack via other means
>     vs breakage induced by not having hash() == hash() for the various
>     equivalent numerical representations in pre-existing code.

Exactly.  I would NOT worry about hash repeatability for integers and
complex data structures.  It is not at the core of the common problem
(maybe a couple application specific problems but not a general "all
python web apps" severity problem).

Doing it for base byte string and unicode string like objects is
sufficient.  Good catch on doing it for buffer objects, I'd forgotten
about those. ;)  A big flaw with haypo's patch is that it only
considers unicode instead of all byte-string-ish stuff.  (the code in
issue13704 does that better).

>
>   * To support my expanded usage of the random secret, I moved:
>
>       PyAPI_DATA(_Py_unicode_hash_secret_t) _Py_unicode_hash_secret
>
>     from unicodeobject.h to object.h and renamed it to:
>
>       PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;
>
>     This also exposes it for usage by C extension modules, just in case
>     they need it (Murphy's Law suggests we will need if we don't expose
>     it).   This is an extension of the API, but warranted, I feel.  My
>     plan for downstream RHEL is to add this explicitly to the RPM metadata
>     as a "Provides" of the RPM providing libpython.so so that if something
>     needs to use it, it can express a "Requires" on it; I assume that
>     something similar is possible with .deb)

Exposing this is good.  There is a hash table implementation within
modules/expat/xmlparse.c that should probably use it as well.

>   * generalized test_unicode.HashTest to support the new env var and the
>     additional types.  In my version, get_hash takes a _repr string rather
>     than an object, so that I can test it with a buffer().  Arguably the
>     tests should thus be moved from test_unicode to somewhere else, but this
>     location keeps things consistent with haypo's patch.
>
>     haypo: in random-8.patch, within test_unicode.HashTest.test_null_hash,
>     "hash_empty" seems to be misnamed

Lets move this to a better location in all patches.  At this point
haypo's patch is not done yet so relevant bits of what you are doing
here is likely to be fed back into the eventual 3.3 tip patch.

-gps
msg151939 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-25 11:05
I'm attaching a patch which implements a hybrid approach:
  hybrid-approach-dmalcolm-2012-01-25-001.patch

This is a blend of various approaches from the discussion, taking aspects of both hash randomization *and* collision-counting.

It incorporates code from
  amortized-probe-counting-dmalcolm-2012-01-21-003.patch
  backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch
  random-8.patch
along with ideas from:
  http://mail.python.org/pipermail/python-dev/2012-January/115812.html

The patch is against the default branch (although my primary goal here is eventual backporting).

As per haypo's random-8.patch, a randomization seed is read at startup.

By default, the existing hash() values are preserved, and no randomization is performed until a dict comes under attack.  This preserves existing behaviors (such as dict ordering) under non-attack conditions.

For large dictionaries, it reuses the ma_smalltable area to track the amortized cost of all modifications to this dictionary.

When the cost exceeds a set threshold, we convert the dictionary's ma_lookup function from lookdict/lookdict_unicode to a "paranoid" variant.  These variants ignore the hash passed in, and instead uses a new function:
   PyObject_RandomizedHash(obj)
to give a second hash value, which is fixed value for a given object within the process, but not predictable to an attacker for the most high-risk types (PyUnicodeObject and PyBytesObject).

This patch is intended as a base for backporting, and takes it as given that we can't expand PyTypeObject or hide something in one of the Py*Methods tables; iirc we've run out of tp_flags in 2.*, hence we're forced to implement PyObject_RandomizedHash via direct ob_type comparison, for the most high-risk types.  

As noted in http://bugs.python.org/issue13703#msg151870:

> I would NOT worry about hash repeatability for integers and
> complex data structures.  It is not at the core of the common problem
> (maybe a couple application specific problems but not a general "all
> python web apps" severity problem).

> Doing it for base byte string and unicode string like objects is
> sufficient.

[We can of course implement hash randomization by default in 3.3, but I care more about getting a fix into the released branches ASAP]

Upon transition of a dict to paranoid mode, the hash values become unpredictable to an attacker, and all PyDictEntries are rebuilt based on the new hash values.

Handling the awkward case within custom ma_lookup functions allows us to move most of the patch from out of the fast path, and lookdict/lookdict_unicode only need minimal changes (stat gathering for the above cost analysis tracking).

Once a dict has transitioned to paranoid mode, it isn't using PyObject_Hash anymore, and thus isn't using cached object values, performing a more expensive calculation, but I believe this calculation is essentially constant-time.

This preserves hash() and dict order for the cases where you're not under attack, and gracefully handles the attack without having to raise an exception: it doesn't introduce any new exception types.

It preserves ABI, assuming no-one else is reusing ma_smalltable.

It is suitable for backporting to 3.2, 2.7, and earlier (I'm investigating fixing this going all the way back to Python 2.2)

Under the old implementation, there were 4 types of PyDictObject, given these two booleans:
  * "small vs large" i.e ma_table == ma_smalltable vs ma_table != ma_smalltable
  * "all keys are str" vs arbitary keys i.e ma_lookdict == lookdict_unicode vs lookdict

Under this implementation, this doubles to 8 kinds, adding the boolean:
  * normal hash vs randomized hash (normal vs "paranoid").

This is expressed via the ma_lookdict callback, adding two new variants, lookdict_unicode_paranoid and lookdict_paranoid

Note that if a paranoid dict goes small again (ma_table == ma_smalltable), it stays paranoid.  This is for simplicity: it avoids having to rebuild all of the non-randomized me_hash values again (which could fail).

Naturally the patch adds selftests.  I had to add some diagnostic methods to support them; dict gains _stats() and _make_paranoid() methods, and sys gains a _getrandomizedhash() method.  These could be hidden more thoroughly if need be (see DICT_PROTECTION_TRACKING in dictobject.c).  Amongst other things, the selftests measure wallclock time taken for various dict operations (and so might introduce failures on a heavily-loaded machine, I guess).

Hopefully this approach is a viable way forward.

Caveats and TODO items:

TODO: I haven't yet tuned the safety threshold.  According to http://bugs.python.org/issue13703#msg151850:
> slot collisions are much more frequent than
> hash collisions, which only account for less than 0.01% of all
> collisions.
>
> It also shows that slot collisions in the low 1-10 range are
> most frequent, with very few instances of a dict lookup
> reaching 20 slot collisions (less than 0.0002% of all
> collisions).

This suggests that the threshold of 32 slot/hash collisions per lookup may already be high enough.

TODO: in a review of an earlier version of the complexity detection idea, Antoine Pitrou suggested that make the protection scale factor be a run-time configurable value, rather than a #define.  This isn't done yet.

TODO: run more extensive tests (e.g. Django and Twisted), monitoring the worst-case complexity that's encountered

TODO: not yet benchmarked and optimized.  I want to get feedback on the approach before I go in and hand-optimize things (e.g. by hand-inlining check_iter_count, and moving the calculations out of the loop etc).  I believe any performance issues ought to be fixable, in that the we can get the cost of this for the "we're not under attack" case to be negligible, and the "under attack" case should transition from O(N^2) to O(N), albeit it with a larger constant factor.

TODO: this doesn't cover sets, but assuming this approach works, the patch can be extended to cover it in an analogous way.

TODO: should it cover PyMemoryViewObject, buffer object, etc?

TODO: should it cover the hashing in Modules/expat/xmlparse.c?  FWIW I rip this code out when doing my downstream builds in RHEL and Fedora, and instead dynamically link against a system copy of expat

TODO: only tested on Linux so far (which is all I've got).  Fedora 15 x86_64 fwiw

 Doc/using/cmdline.rst     |    6 
 Include/bytesobject.h     |    2 
 Include/object.h          |    8 
 Include/pythonrun.h       |    2 
 Include/unicodeobject.h   |    2 
 Lib/os.py                 |   17 --
 Lib/test/regrtest.py      |    5 
 Lib/test/test_dict.py     |  298 +++++++++++++++++++++++++++++++++++++
 Lib/test/test_hash.py     |   53 ++++++
 Lib/test/test_os.py       |   35 +++-
 Makefile.pre.in           |    1 
 Modules/posixmodule.c     |  126 ++-------------
 Objects/bytesobject.c     |    7 
 Objects/dictobject.c      |  369 +++++++++++++++++++++++++++++++++++++++++++++-
 Objects/object.c          |   37 ++++
 Objects/unicodeobject.c   |   51 ++++++
 PCbuild/pythoncore.vcproj |    4 
 Python/pythonrun.c        |    3 
 Python/sysmodule.c        |   16 +
 b/Python/random.c         |  268 +++++++++++++++++++++++++++++++++
 20 files changed, 1173 insertions(+), 137 deletions(-)
msg151941 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-25 12:45
I've found a bug in my patch; insertdict writes the old non-randomized
hash value into me_hash at:
        ep->me_hash = hash;
rather than using the randomized hash, leading to issues when tested
against a real attack.

I'm looking into fixing it.
msg151942 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-25 12:47
On Wed, Jan 25, 2012 at 7:45 AM, Dave Malcolm <report@bugs.python.org>wrote:

>
> Dave Malcolm <dmalcolm@redhat.com> added the comment:
>
> I've found a bug in my patch; insertdict writes the old non-randomized
> hash value into me_hash at:
>        ep->me_hash = hash;
> rather than using the randomized hash, leading to issues when tested
> against a real attack.
>
> I'm looking into fixing it.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
>

What happens if I have a dict with str keys that goes into paranoid mode,
and I then do:

class A(object):
   def __init__(self, s):
       self.s = s
   def __eq__(self, other):
       return self.s == other
   def __hash__(self):
       return hash(self.s)

d[A("some str that's a key in d")]

Is it still able to find the value?
msg151944 - (view) Author: Frank Sievertsen (fx5) Date: 2012-01-25 13:12
> Is it still able to find the value?

Probably not. :( 

That's exactly why I stopped thinking about all two-hash-functions or rehashing ideas.
msg151956 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-25 17:49
On Wed, 2012-01-25 at 12:45 +0000, Dave Malcolm wrote:
> Dave Malcolm <dmalcolm@redhat.com> added the comment:
> 
> I've found a bug in my patch; insertdict writes the old non-randomized
> hash value into me_hash at:
>         ep->me_hash = hash;
> rather than using the randomized hash, leading to issues when tested
> against a real attack.

I'm attaching a revised version of the patch that should fix the above
issue:
  hybrid-approach-dmalcolm-2012-01-25-002.patch

Changes relative to -001.patch:
* updated insertdict() so that when it write ep->me_hash, it uses the
correct hash value.  Unfortunately there doesn't seem to be a good way
of reusing the value we calculated in the "paranoid" ma_lookup
callbacks, without breaking ABI (suggestions welcome),  so we call
PyObject_RandomizedHash again.
* slightly reworked the two _paranoid ma_lookup callbacks to capture the
randomized hash as a local variable, in case there's a way of reusing it
in insertdict()
* when lookdict() calls into itself, it now calls mp->ma_lookup instead
* don't generate a fatal error with an unknown ma_lookup callback.

With this, I'm able to insert 200,000 non-equal PyUnicodeObject with
hash()==0 into a dict on a 32-bit build --with-pydebug in 2.2 seconds;
it can retrieve all the values correctly in about 4 seconds [compare
with ~1.5 hours of CPU churn for inserting the same data on an optimized
build without the patch on the same guest].

The amortized ratio of total work done per modification increases
linearly when under an O(N^2) attack, and the dict switches itself to
paranoid mode 56 insertions after ma_table stops using ma_smalltable
(that's when we start tracking stats).  After the transition to paranoid
mode, it drops to an average of a little under 2 probes per insertion
(the amortized ratio seems to be converging to about 1.9 probes per key
insertion at the point where my hostile test data runs out).
msg151959 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-25 18:05
> I'm attaching a revised version of the patch that should fix the above
> issue:
>   hybrid-approach-dmalcolm-2012-01-25-002.patch

It looks like that approach will break any non-builtin type (in either C
or Python) which can compare equal to bytes or str objects. If that's
the case, then I think the likelihood of acceptance is close to zero.

Also, the level of complication is far higher than in any other of the
proposed approaches so far (I mean those with patches), which isn't
really a good thing.

So I'm rather -1 myself on this approach, and would much prefer to
randomize hashes in all conditions.
msg151960 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-25 18:14
On Wed, Jan 25, 2012 at 6:06 AM, Dave Malcolm <dmalcolm@redhat.com>
added the comment:

>  hybrid-approach-dmalcolm-2012-01-25-001.patch

> As per haypo's random-8.patch, a randomization seed is read at startup.

Why not wait until it is needed?  I suspect a lot of scripts will
never need it for any dict, so why add the overhead to startup?

> Once a dict has transitioned to paranoid mode, it isn't using
> PyObject_Hash anymore, and thus isn't using cached object values

The alternative hashes could be stored in an id-keyed dict

 performing a more expensive calculation, but I believe this
calculation is essentially constant-time.
>
> This preserves hash() and dict order for the cases where you're not under attack, and gracefully handles the attack without having to raise an exception: it doesn't introduce any new exception types.
>
> It preserves ABI, assuming no-one else is reusing ma_smalltable.
>
> It is suitable for backporting to 3.2, 2.7, and earlier (I'm investigating fixing this going all the way back to Python 2.2)
>
> Under the old implementation, there were 4 types of PyDictObject, given these two booleans:
>  * "small vs large" i.e ma_table == ma_smalltable vs ma_table != ma_smalltable
>  * "all keys are str" vs arbitary keys i.e ma_lookdict == lookdict_unicode vs lookdict
>
> Under this implementation, this doubles to 8 kinds, adding the boolean:
>  * normal hash vs randomized hash (normal vs "paranoid").
>
> This is expressed via the ma_lookdict callback, adding two new variants, lookdict_unicode_paranoid and lookdict_paranoid
>
> Note that if a paranoid dict goes small again (ma_table == ma_smalltable), it stays paranoid.  This is for simplicity: it avoids having to rebuild all of the non-randomized me_hash values again (which could fail).
>
> Naturally the patch adds selftests.  I had to add some diagnostic methods to support them; dict gains _stats() and _make_paranoid() methods, and sys gains a _getrandomizedhash() method.  These could be hidden more thoroughly if need be (see DICT_PROTECTION_TRACKING in dictobject.c).  Amongst other things, the selftests measure wallclock time taken for various dict operations (and so might introduce failures on a heavily-loaded machine, I guess).
>
> Hopefully this approach is a viable way forward.
>
> Caveats and TODO items:
>
> TODO: I haven't yet tuned the safety threshold.  According to http://bugs.python.org/issue13703#msg151850:
>> slot collisions are much more frequent than
>> hash collisions, which only account for less than 0.01% of all
>> collisions.
>>
>> It also shows that slot collisions in the low 1-10 range are
>> most frequent, with very few instances of a dict lookup
>> reaching 20 slot collisions (less than 0.0002% of all
>> collisions).
>
> This suggests that the threshold of 32 slot/hash collisions per lookup may already be high enough.
>
> TODO: in a review of an earlier version of the complexity detection idea, Antoine Pitrou suggested that make the protection scale factor be a run-time configurable value, rather than a #define.  This isn't done yet.
>
> TODO: run more extensive tests (e.g. Django and Twisted), monitoring the worst-case complexity that's encountered
>
> TODO: not yet benchmarked and optimized.  I want to get feedback on the approach before I go in and hand-optimize things (e.g. by hand-inlining check_iter_count, and moving the calculations out of the loop etc).  I believe any performance issues ought to be fixable, in that the we can get the cost of this for the "we're not under attack" case to be negligible, and the "under attack" case should transition from O(N^2) to O(N), albeit it with a larger constant factor.
>
> TODO: this doesn't cover sets, but assuming this approach works, the patch can be extended to cover it in an analogous way.
>
> TODO: should it cover PyMemoryViewObject, buffer object, etc?
>
> TODO: should it cover the hashing in Modules/expat/xmlparse.c?  FWIW I rip this code out when doing my downstream builds in RHEL and Fedora, and instead dynamically link against a system copy of expat
>
> TODO: only tested on Linux so far (which is all I've got).  Fedora 15 x86_64 fwiw
>
>  Doc/using/cmdline.rst     |    6
>  Include/bytesobject.h     |    2
>  Include/object.h          |    8
>  Include/pythonrun.h       |    2
>  Include/unicodeobject.h   |    2
>  Lib/os.py                 |   17 --
>  Lib/test/regrtest.py      |    5
>  Lib/test/test_dict.py     |  298 +++++++++++++++++++++++++++++++++++++
>  Lib/test/test_hash.py     |   53 ++++++
>  Lib/test/test_os.py       |   35 +++-
>  Makefile.pre.in           |    1
>  Modules/posixmodule.c     |  126 ++-------------
>  Objects/bytesobject.c     |    7
>  Objects/dictobject.c      |  369 +++++++++++++++++++++++++++++++++++++++++++++-
>  Objects/object.c          |   37 ++++
>  Objects/unicodeobject.c   |   51 ++++++
>  PCbuild/pythoncore.vcproj |    4
>  Python/pythonrun.c        |    3
>  Python/sysmodule.c        |   16 +
>  b/Python/random.c         |  268 +++++++++++++++++++++++++++++++++
>  20 files changed, 1173 insertions(+), 137 deletions(-)
>
> ----------
> Added file: http://bugs.python.org/file24320/hybrid-approach-dmalcolm-2012-01-25-001.patch
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
msg151961 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-25 18:29
Sorry; hit the wrong key... intended message below:

On Wed, Jan 25, 2012 at 6:06 AM, Dave Malcolm <dmalcolm@redhat.com>
added the comment:

[lots of good stuff]

>  hybrid-approach-dmalcolm-2012-01-25-001.patch

> As per haypo's random-8.patch, a randomization seed is read at
> startup.

Why not wait until it is needed?  I suspect a lot of scripts will
never need it for any dict, so why add the overhead to startup?

> Once a dict has transitioned to paranoid mode, it isn't using
> PyObject_Hash anymore, and thus isn't using cached object values

The alternative hashes could be stored in an id-keyed
WeakKeyDictionary; that would handle at least the normal case of using
exactly the same string for the lookup.

> Note that if a paranoid dict goes small again
> (ma_table == ma_smalltable), it stays paranoid.

As I read it, that couldn't happen, because paranoid dicts couldn't
shrink at all.  (Not letting them shrink beneath 2*PyDict_MINSIZE does
seem like a reasonable solution.)

Additional TODOs...

The checks for Unicode and Dict should not be exact; it is OK to do on
a subclass so long as they are using the same lookdict (and, for
unicode, the same eq).

Additional small strings should be excluded from the new hash, to
avoid giving away the secret.  At a minimum, single-char strings
should be excluded, and I would prefer to exclude all strings of
length <= N (where N defaults to 4).
msg151965 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-25 19:04
On Wed, Jan 25, 2012 at 1:05 PM,  Antoine Pitrou <pitrou@free.fr>
added the comment:

> It looks like that approach will break any non-builtin type (in either C
> or Python) which can compare equal to bytes or str objects. If that's
> the case, then I think the likelihood of acceptance is close to zero.

(1)  Isn't that true of *any* patch that changes hashing?  (Thus the
PYTHONHASHSEED=0 escape hatch.)

(2)  I think it would still work for the lookdict_string (or
lookdict_unicode) case ... which is the normal case, and also where
most vulnerabilities should appear.

(3)  If the alternate hash is needed for non-string keys, there is no
perfect resolution, but I suppose you could get closer with

    if obj == str(obj):
        newhash=hash(str(obj))
msg151966 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-25 19:13
> Jim Jewett <jimjjewett@gmail.com> added the comment:
> 
> On Wed, Jan 25, 2012 at 1:05 PM,  Antoine Pitrou <pitrou@free.fr>
> added the comment:
> 
> > It looks like that approach will break any non-builtin type (in either C
> > or Python) which can compare equal to bytes or str objects. If that's
> > the case, then I think the likelihood of acceptance is close to zero.
> 
> (1)  Isn't that true of *any* patch that changes hashing?  (Thus the
> PYTHONHASHSEED=0 escape hatch.)

If a third-party type wants to compare equal to bytes or str objects,
its __hash__ method will usually end up calling hash() on the equivalent
bytes/str object. That's especially true for Python types (I don't think
anyone wants to re-implement a slow str-alike hash in pure Python).

> (2)  I think it would still work for the lookdict_string (or
> lookdict_unicode) case ... which is the normal case, and also where
> most vulnerabilities should appear.

It would probably still work indeed.

> (3)  If the alternate hash is needed for non-string keys, there is no
> perfect resolution, but I suppose you could get closer with
> 
>     if obj == str(obj):
>         newhash=hash(str(obj))

That may be slowing down things quite a bit. It looks correct though.
msg151967 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-25 19:19
On Wed, 2012-01-25 at 18:05 +0000, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> > I'm attaching a revised version of the patch that should fix the above
> > issue:
> >   hybrid-approach-dmalcolm-2012-01-25-002.patch
> 
> It looks like that approach will break any non-builtin type (in either C
> or Python) which can compare equal to bytes or str objects. If that's
> the case, then I think the likelihood of acceptance is close to zero.

How?

> Also, the level of complication is far higher than in any other of the
> proposed approaches so far (I mean those with patches), which isn't
> really a good thing.

So would I.  I want something I can backport, though.
msg151970 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-25 19:28
Le mercredi 25 janvier 2012 à 19:19 +0000, Dave Malcolm a écrit :
> Dave Malcolm <dmalcolm@redhat.com> added the comment:
> 
> On Wed, 2012-01-25 at 18:05 +0000, Antoine Pitrou wrote:
> > Antoine Pitrou <pitrou@free.fr> added the comment:
> > 
> > > I'm attaching a revised version of the patch that should fix the above
> > > issue:
> > >   hybrid-approach-dmalcolm-2012-01-25-002.patch
> > 
> > It looks like that approach will break any non-builtin type (in either C
> > or Python) which can compare equal to bytes or str objects. If that's
> > the case, then I think the likelihood of acceptance is close to zero.
> 
> How?

This kind of type, for example:

class C:
    def __hash__(self):
        return hash(self._real_str)

    def __eq__(self, other):
        if isinstance(other, C):
           other = other._real_str
        return self._real_str == other

If I'm not mistaken, looking up C("abc") will stop matching "abc" when
there are too many collisions in one of your dicts.

> > Also, the level of complication is far higher than in any other of the
> > proposed approaches so far (I mean those with patches), which isn't
> > really a good thing.
> 
> So would I.  I want something I can backport, though.

Well, your approach sounds like it subtly and unpredictably changes the
behaviour of dicts when there are too many collisions, so I'm not sure
it's a good idea to backport it, either.

If we don't want to backport full hash randomization, I think I much
prefer raising a BaseException when there are too many collisions,
rather than this kind of (excessively) sophisticated workaround. You
*are* changing a fundamental datatype in a rather complicated way.
msg151973 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-25 20:23
I think you're right: it will stop matching it during lookup within such
a dict, since the dict will be using the secondary hash for "abc", but
hash() for the C instance.

It will still match outside of the dict, and within other dicts.

So yes, this would be a subtle semantic change when under attack.
Bother.

Having said that, note that within the typical attack scenarios (HTTP
headers, HTTP POST, XML-RPC, JSON), we have a pure-str dict (or
sometimes a pure-bytes dict).  Potentially I could come up with a patch
that only performs this change for such a case (pure-str is easier,
given that we already track this), which would avoid the semantic change
you identify, whilst still providing protection against a wide range of
attacks.

Is it worth me working on this?

> > > Also, the level of complication is far higher than in any other of the
> > > proposed approaches so far (I mean those with patches), which isn't
> > > really a good thing.
> > 
> > So would I.  I want something I can backport, though.
> 
> Well, your approach sounds like it subtly and unpredictably changes the
> behaviour of dicts when there are too many collisions, so I'm not sure
> it's a good idea to backport it, either.
> 
> If we don't want to backport full hash randomization, I think I much
> prefer raising a BaseException when there are too many collisions,
> rather than this kind of (excessively) sophisticated workaround. You
> *are* changing a fundamental datatype in a rather complicated way.

Well, each approach has pros and cons, and we've circled around between
hash randomization vs collision counting vs other approaches for several
weeks.  I've supplied patches for 3 different approaches.

Is this discussion likely to reach a conclusion soon?  Would it be
regarded as rude if I unilaterally ship something close to:
  backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch
in RHEL/Fedora, so that my users have some protection they can enable if
they get attacked? (see http://bugs.python.org/msg151847).  If I do
this, I can post the patches here in case other distributors want to
apply them.

As for python.org, who is empowered to make a decision here?  How can we
move this forward?
msg151977 - (view) Author: Frank Sievertsen (fx5) Date: 2012-01-25 21:34
For the sake of completeness:
Collision-counting (with Exception) has interesting effects, too.

>>> d={((1<<(65+i))-2**(i+4)): 9 for i in range(1001)}
>>> for i in list(d): 
...  del d[i]

>>> d
{}
>>> 9 in d
False
>>> 0 in d
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'too many slot collisions'
>>> d[9] = 1
>>> d
{9: 1}
>>> d == {0: 1}
False
>>> {0: 1} == d
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'too many slot collisions'
msg151984 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-25 23:14
> I think you're right: it will stop matching it during lookup within such
> a dict, since the dict will be using the secondary hash for "abc", but
> hash() for the C instance.
> 
> It will still match outside of the dict, and within other dicts.
> 
> So yes, this would be a subtle semantic change when under attack.
> Bother.

Hmm, you're right, perhaps it's not as important as I thought.

By the way, have you run benchmarks on some of your patches?

> Is this discussion likely to reach a conclusion soon?  Would it be
> regarded as rude if I unilaterally ship something close to:
>   backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch
> in RHEL/Fedora, so that my users have some protection they can enable if
> they get attacked?

I don't think Fedora shipping its own patches can be considered "rude"
by anyone else than its users. And deciding what is best for your users
is indeed your job as a distro maintainer, not python-dev's.

> As for python.org, who is empowered to make a decision here?  How can we
> move this forward?

I don't know. Guido is empowered if he wants to make a pronouncement.
Otherwise, we have the following data points:

- hash randomization is generally considered the cleanest solution
- hash randomization cannot be enabled by default in bugfix, let alone
security releases
- collision counting can mitigate some of the attacks, although it can
have weaknesses (see Frank's emails) and it comes with its own problems
(breaking the program "later on")

So I'd suggest the following course of action:
- ship and enable some form of collision counting on bugfix and security
releases
- ship and enable hash randomization in 3.3
msg152030 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-26 21:00
I'd like to propose an entirely different approach: use AVL trees for colliding strings, for dictionaries containing only unicode or byte strings.

A prototype for this is in http://hg.python.org/sandbox/loewis/branches#avl
It is not fully working yet, but I'm now confident that this is a feasible approach.

It has the following advantages over the alternatives:
- performance in case of collisions is O(log(N)), where N is the number of colliding keys
- no new exceptions are raised, except for MemoryError if it runs out of memory for allocating nodes in the tree
- the hash values do not change
- the dictionary order does not change as long as no keys collide on hash values (which for all practical purposes should mean that the dictionary order does not change in all places where it matters)
msg152033 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-26 21:04
On Thu, Jan 26, 2012 at 4:00 PM, Martin v. Löwis <report@bugs.python.org>wrote:

>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> I'd like to propose an entirely different approach: use AVL trees for
> colliding strings, for dictionaries containing only unicode or byte strings.
>
> A prototype for this is in
> http://hg.python.org/sandbox/loewis/branches#avl
> It is not fully working yet, but I'm now confident that this is a feasible
> approach.
>
> It has the following advantages over the alternatives:
> - performance in case of collisions is O(log(N)), where N is the number of
> colliding keys
> - no new exceptions are raised, except for MemoryError if it runs out of
> memory for allocating nodes in the tree
> - the hash values do not change
> - the dictionary order does not change as long as no keys collide on hash
> values (which for all practical purposes should mean that the dictionary
> order does not change in all places where it matters)
>
> ----------
> nosy: +loewis
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
>

Martin,

What happens if, instead of putting strings in a dictionary directly, I
have them wrapped in something.  For example, the classes Antoine and I
pasted early.  These define hash and equal as being strings, but don't have
an ordering.

Alex
msg152037 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-26 22:13
On Thu, 2012-01-26 at 21:04 +0000, Alex Gaynor wrote:
> Alex Gaynor <alex.gaynor@gmail.com> added the comment:
> 
> On Thu, Jan 26, 2012 at 4:00 PM, Martin v. Löwis <report@bugs.python.org>wrote:
> 
> >
> > Martin v. Löwis <martin@v.loewis.de> added the comment:
> >
> > I'd like to propose an entirely different approach: use AVL trees for
> > colliding strings, for dictionaries containing only unicode or byte strings.
> >
> > A prototype for this is in
> > http://hg.python.org/sandbox/loewis/branches#avl
> > It is not fully working yet, but I'm now confident that this is a feasible
> > approach.
> >
> > It has the following advantages over the alternatives:
> > - performance in case of collisions is O(log(N)), where N is the number of
> > colliding keys
> > - no new exceptions are raised, except for MemoryError if it runs out of
> > memory for allocating nodes in the tree
> > - the hash values do not change
> > - the dictionary order does not change as long as no keys collide on hash
> > values (which for all practical purposes should mean that the dictionary
> > order does not change in all places where it matters)
> >
> > ----------
> > nosy: +loewis
> >
> > _______________________________________
> > Python tracker <report@bugs.python.org>
> > <http://bugs.python.org/issue13703>
> > _______________________________________
> >
> 
> Martin,
> 
> What happens if, instead of putting strings in a dictionary directly, I
> have them wrapped in something.  For example, the classes Antoine and I
> pasted early.  These define hash and equal as being strings, but don't have
> an ordering.

[Obviously I'm not Martin, but his idea really interests me]

Looking at:
http://hg.python.org/sandbox/loewis/file/58be269aa0b1/Objects/dictobject.c#l517
as soon as any key insertion or lookup occurs where the key isn't
exactly one of the correct types, the dict flattens any AVL trees back
into the regular flat representation (and switches to lookdict for
ma_lookup), analogous to the existing ma_lookup transition on dicts.

From my reading of the code, if you have a dict purely of bytes/str,
collisions on a hash value lead to the PyDictEntry's me_key being set to
an AVL tree.  All users of the ma_lookup callback within dictobject.c
check to see if they're getting such PyDictEntry back.  If they are,
they call into the tree, which leads to TREE_FIND(), TREE_INSERT() and
TREE_DELETE() invocations as appropriate; ultimately, the AVL macros
call back to within node_cmp():
   PyObject_Compare(left->key, right->key)

[Martin, I'm sorry if I got this wrong]

So *if* I'm reading the code correctly, it might be possible to
generalize it from {str, bytes} to any set of types within which
ordering and equality checking of instances from any type is "sane",
which loosely, would seem to be: that we can reliably compare all
objects from any type within the set, so that we can use the comparisons
to perform a search to hone in on a pair of keys that compare as
"equal", without any chance of raising exceptions, or missing a valid
chance for two objects to be equal etc.

I suspect that you can't plug arbitrary user-defined types into it,
since there's no way to guarantee that ordering and comparison work in
the ways that the AVL lookup requires.

But I could be misreading Martin's code.  [thinking aloud, if a pair of
objects don't implement comparison at the PyObject_Compare level, is it
possible to instead simply compare the addresses of the objects?  I
don't think so, since you have a custom equality implementation in your
UDT, but maybe I've missed something?]

Going higher-level, I feel that there are plenty of attacks against
pure-str/bytes dicts, and having protection against them is worthwhile -
even if there's no direct way to use it to protect the use-case you
describe.

Hope this is helpful
Dave
msg152039 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-26 22:34
> as soon as any key insertion or lookup occurs where the key isn't
> exactly one of the correct types, the dict flattens any AVL trees back
> into the regular flat representation (and switches to lookdict for
> ma_lookup), analogous to the existing ma_lookup transition on dicts.

Correct.

> TREE_DELETE() invocations as appropriate; ultimately, the AVL macros
> call back to within node_cmp():
>    PyObject_Compare(left->key, right->key)

Correct.

> I suspect that you can't plug arbitrary user-defined types into it,
> since there's no way to guarantee that ordering and comparison work in
> the ways that the AVL lookup requires.

That's all true. It would be desirable to automatically determine which
types also support total order in addition to hashing, alas, there is
currently no protocol for it. On the contrary, Python has moved away
of assuming that all objects form a total order.

> [thinking aloud, if a pair of
> objects don't implement comparison at the PyObject_Compare level, is it
> possible to instead simply compare the addresses of the objects?

2.x has an elaborate logic to provide a total order on objects. It
took the entire 1.x and 2.x series to fix issues with that order, only
to recognize that it is not feasible to provide one - hence the introduction
of rich comparisons and the omission of cmp in 3.x.

For the dictionary, using addresses does not work: the order of objects
needs to be consistent with equality (i.e. x < y and x == y must not
simultaneously hold, as must x == y and y < z imply that also x < z,
else the tree lookup won't find the equal keys).

> Going higher-level, I feel that there are plenty of attacks against
> pure-str/bytes dicts, and having protection against them is worthwhile -
> even if there's no direct way to use it to protect the use-case you
> describe.

Indeed. This issue doesn't need to fix *all* possible attacks using
hash collisions. Instead, it needs to cover the common case, and it
needs to allow users to rewrite their code so that they can protect
it against this family of attacks.
msg152040 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-26 22:42
> What happens if, instead of putting strings in a dictionary directly, I
> have them wrapped in something.  For example, the classes Antoine and I
> pasted early.  These define hash and equal as being strings, but don't have
> an ordering.

As Dave has analysed: the dictionary falls back to the current implementation.
So wrt. your question "Is it still able to find the value?", the answer is

Yes, certainly. It's fully backwackwards compatible, with the limitation
in msg152030 (i.e. the dictionary order may change for dictionaries with
string keys colliding in their hash() values).
msg152041 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-26 22:43
On Thu, Jan 26, 2012 at 5:42 PM, Martin v. Löwis <report@bugs.python.org>wrote:

>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> > What happens if, instead of putting strings in a dictionary directly, I
> > have them wrapped in something.  For example, the classes Antoine and I
> > pasted early.  These define hash and equal as being strings, but don't
> have
> > an ordering.
>
> As Dave has analysed: the dictionary falls back to the current
> implementation.
> So wrt. your question "Is it still able to find the value?", the answer is
>
> Yes, certainly. It's fully backwackwards compatible, with the limitation
> in msg152030 (i.e. the dictionary order may change for dictionaries with
> string keys colliding in their hash() values).
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
>

But using non-__builtin__.str objects (such as UserString) would expose the
user to an attack?
msg152043 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-26 23:03
> But using non-__builtin__.str objects (such as UserString) would expose the
> user to an attack?

Not necessarily: only if they use these strings as dictionary keys, and only
if they do so in contexts where arbitrary user input is consumed. In these
cases, users need to rewrite their code to replace the keys. Using dictionary
wrappers (such as UserDict), this is possible using only local changes.
msg152046 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-26 23:22
I'm sorry then, but I'm a little confused.  I think we pretty clearly
established earlier that requiring users to make changes anywhere they
stored user data would be dangerous, because these locations are often in
libraries or other places where the code creating and modifying the
dictionary has no idea it's user data in it.

The proposed AVL solution fails if it requires users to fundamentally
restructure their data depending on it's origin.

We have solution that is known to work in all cases: hash randomization.
 There were three discussed issues with it:

a) Code assuming a stable ordering to dictionaries
b) Code assuming hashes were stable across runs.
c) Code reimplementing the hashing algorithm of a core datatype that is now
randomized.

I don't think any of these are realistic issues the way "doesn't protect
all cases" is.  (a) was never a documented, or intended property, indeed it
breaks all the time, if you insert keys in the wrong order, use a different
platform, or anything else can change.  (b) For the same reasons code
relying on (b) only worked if you didn't change anything, and in practice
I'm convinced neither of these were common (if ever existed).  Finally (c),
while it's a concern, I've reviewed Django, SQLAlchemy, PyPy, and the
stdlib: there is only one place where compatibility with a core-hash is
attempted, decimal.Decimal.

In summary, I think the case against hash-randomization has been seriously
overstated, and in no way is more dangerous than having a solution that
fails to solve the problem comprehensively.  Further, I think it is
imperative that we reach a consensus on this quickly, as the only reason
this hasn't been widely exploited yet is the lack of availability of the
data, when it becomes available I firmly expect just about every high
profile Python site on the internet (of which there are many) to be
attacked.

On Thu, Jan 26, 2012 at 6:03 PM, Martin v. Löwis <report@bugs.python.org>wrote:

>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> > But using non-__builtin__.str objects (such as UserString) would expose
> the
> > user to an attack?
>
> Not necessarily: only if they use these strings as dictionary keys, and
> only
> if they do so in contexts where arbitrary user input is consumed. In these
> cases, users need to rewrite their code to replace the keys. Using
> dictionary
> wrappers (such as UserDict), this is possible using only local changes.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
>
msg152051 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-26 23:43
> I'm sorry then, but I'm a little confused.  I think we pretty clearly
> established earlier that requiring users to make changes anywhere they
> stored user data would be dangerous, because these locations are often in
> libraries or other places where the code creating and modifying the
> dictionary has no idea it's user data in it.

I don't consider that established for the specific case of string-like
objects. Users can easily determine whether they use string-like objects,
and if so, in what places, and what data gets put into them.

> The proposed AVL solution fails if it requires users to fundamentally
> restructure their data depending on it's origin.

It doesn't fail at all. User don't *have* to restructure their code,
let alone fundamentally. Their code may currently be vulnerable, yet
not use string-like objects at all. With the proposed solution, such
code will be fixed for good.

It's true that the solution does not fix all cases of the vulnerability,
but neither does any other proposed solution.

> We have solution that is known to work in all cases: hash randomization.

Well, you *believe* that it fixes the problem, even though it actually
may not, assuming an attacker can somehow reproduce the hash function.

>  There were three discussed issues with it:
>
> a) Code assuming a stable ordering to dictionaries
> b) Code assuming hashes were stable across runs.
> c) Code reimplementing the hashing algorithm of a core datatype that is now
> randomized.
>
> I don't think any of these are realistic issues

I'm fairly certain that code will break in massive ways, despite any
argumentation that it should not. The question really is

Do we break code in a massive way, or do we fix the vulnerability
for most users with no code breakage?

I clearly value compatibility much higher than 100% protection against
a DoS-style attack (which has many other forms of protecting against
available also).

> (a) was never a documented, or intended property, indeed it
> breaks all the time, if you insert keys in the wrong order, use a different
> platform, or anything else can change.

Still, a lot of code relies on dictionary order, and successfully so,
in practice. Practicality beats purity.

> (b) For the same reasons code
> relying on (b) only worked if you didn't change anything

That's not true. You cannot practically change the way string hashing works
other than by changing the interpreter source. Hashes *are* currently stable
across runs.

> and in practice I'm convinced neither of these were common (if ever existed).

Are you willing to bet the trust people have in Python's bug fix policies
on that? I'm not.

> In summary, I think the case against hash-randomization has been seriously
> overstated, and in no way is more dangerous than having a solution that
> fails to solve the problem comprehensively.  Further, I think it is
> imperative that we reach a consensus on this quickly

Well, I cannot be part of a consensus that involves massive code breakage
in a bug fix release. Lacking consensus, either the release managers or
the BDFL will have to pronounce.
msg152057 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-27 01:19
> >  There were three discussed issues with it:
> >
> > a) Code assuming a stable ordering to dictionaries
> > b) Code assuming hashes were stable across runs.
> > c) Code reimplementing the hashing algorithm of a core datatype that is now
> > randomized.
> >
> > I don't think any of these are realistic issues
> 
> I'm fairly certain that code will break in massive ways, despite any
> argumentation that it should not. The question really is
> 
> Do we break code in a massive way, or do we fix the vulnerability
> for most users with no code breakage?
> 
> I clearly value compatibility much higher than 100% protection against
> a DoS-style attack (which has many other forms of protecting against
> available also).

If I your read patch correctly, collisions will produce additional
allocations of one distinct PyObject (i.e. AVL node) per colliding key.
That's a pretty massive change in memory consumption for string dicts
(and also in memory fragmentation and cache friendliness, probably). The
performance effect in most situations is likely to be negative too,
despite the better worst-case complexity.

IMO that would be a rather controversial change for a feature release,
let alone a bugfix or security release.

It would be nice to have the release managers' opinions on this issue.
msg152060 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-27 02:26
> If I your read patch correctly, collisions will produce additional
> allocations of one distinct PyObject (i.e. AVL node) per colliding key.

That's correct.

> That's a pretty massive change in memory consumption for string dicts
> (and also in memory fragmentation and cache friendliness, probably).

That's not correct. It's not a massive change, as colliding hash values
never happen in practice, unless you are being attacked, in which case it
will be one additional PyObject for the set of all colliding keys (i.e.
one object per possible hundreds of string objects). Even including the
nodes of the tree (one per colliding node) is IMO a moderate increase
in memory usage, in order to solve the vulnerability.

It also doesn't impact memory fragmentation badly, as these objects
are allocated using the Python small objects allocator.

> The
> performance effect in most situations is likely to be negative too,
> despite the better worst-case complexity.

Compared to the status quo? Hardly. In all practical applications,
collision never happens, so none of the extra code is ever exexcuted -
except for AVL_Check invocations, which are a plain pointer
comparison.

> IMO that would be a rather controversial change for a feature release,
> let alone a bugfix or security release.

Apparently so, but it's not clear to me why that is. That change meets
all criteria of a security fix release nicely, as opposed to the proposed
changes to the hash function, which break existing code.
msg152066 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-01-27 06:25
>> But using non-__builtin__.str objects (such as UserString) would expose the
>> user to an attack?
>
> Not necessarily: only if they use these strings as dictionary keys, and only
> if they do so in contexts where arbitrary user input is consumed. In these
> cases, users need to rewrite their code to replace the keys. Using dictionary
> wrappers (such as UserDict), this is possible using only local changes.

Could the AVL tree approach be extended to apply to dictionaries
containing keys of any single type that supports comparison?  That
approach would autodetect UserString or similar and support it
properly.

I expect that dictionaries with keys of more than one type to be very
rare and highly unlikely when it comes to values generated directly
via user input.

(and on top of all of this I believe we're all settled on having per
interpreter hash randomization _as well_ in 3.3; but this AVL tree
approach is one nice option for a backport to fix the major
vulnerability)

-gps
msg152070 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-27 08:42
> Could the AVL tree approach be extended to apply to dictionaries
> containing keys of any single type that supports comparison?  That
> approach would autodetect UserString or similar and support it
> properly.

I think we would need a place to store the single key type, which,
from an ABI point of view, might be difficult to find (but we could
overload ma_smalltable for that, or reserve ma_table[0]).

In addition, I think it is difficult to determine whether a type
supports comparison, at least in 2.x. For example,

class X:
  def __eq__(self, o):
    return self.a == o.a

allows to create objects x and y so that x<y<z, yet x==z.

For 3.x, we could assume that a failure to support comparison
raises an exception, in which case we could just wait for the
exception to happen, and then flatten the dictionary and start
over with the lookup. This would extend even to mixed key
types.
msg152104 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-27 17:45
On Thu, Jan 26, 2012 at 8:19 PM, Antoine Pitrou <report@bugs.python.org> wrote:

> If I read your [Martin v. Löwis' ] patch correctly, collisions will
> produce additional allocations ... That's a pretty massive
> change in memory consumption for string dicts

Not in practice.

The point I first missed is that this triggers only when the hash is
*fully* equal; if the hashes are merely equal after masking, then
today's try-another-slot approach will still be used, even for
strings.

Per ( http://bugs.python.org/issue13703#msg151850 ) Marc-Andre
Lemburg's measurements, full-hash equality explains only 1 in 10,000
collisions.  From a performance standpoint, we can almost ignore a
case that rare; it is almost certainly dwarfed by resizing.

I *am* a bit concerned that the possible contents of a dictentry
change; this could cause easily-missed-in-testing breakage for
anything that treats table as an array.  That said, it doesn't seem
much worse than the search finger, and there seemed to be recent
consensus that even promising an exact dict -- subclasses not allowed
-- didn't mean that direct access was sanctioned.  So it still seems
safer than changing the de-facto iteration order.
msg152112 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-27 19:32
> I *am* a bit concerned that the possible contents of a dictentry
> change; this could cause easily-missed-in-testing breakage for
> anything that treats table as an array.

This is indeed a concern: the new code needs to be exercised.
I came up with a Py_REDUCE_HASH #define; if set, the dict will only
use the lowest 3 bits of the hash, producing plenty collisions.
In that mode, the branch currently doesn't work at all due to the
remaining bugs.
msg152117 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-27 20:25
[Martin's approach]
> The point I first missed is that this triggers only when the hash is
> *fully* equal; if the hashes are merely equal after masking, then
> today's try-another-slot approach will still be used, even for
> strings.

But then isn't it vulnerable to Frank's first attack as exposed in
http://mail.python.org/pipermail/python-dev/2012-January/115726.html ?
msg152118 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-27 21:02
> But then isn't it vulnerable to Frank's first attack as exposed in
> http://mail.python.org/pipermail/python-dev/2012-January/115726.html ?

It would be, yes. That's sad.

That could be fixed by indeed creating trees in all cases (i.e. moving
away from open addressing altogether). The memory consumption does not worry
me here; however, dictionary order will change in more cases.

Compatibility could be restored by introducing a threshold for
tree creation: if insertion visits more than N slots, go back to the
original slot and put a tree there. I'd expect that N could be small,
e.g. N==4. Lookup would then have to consider all AVL trees along the
chain of visited slots, but ISTM it could also stop after visiting N
slots.
msg152125 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-27 21:42
On Fri, 2012-01-27 at 21:02 +0000, Martin v. Löwis wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> > But then isn't it vulnerable to Frank's first attack as exposed in
> > http://mail.python.org/pipermail/python-dev/2012-January/115726.html ?
> 
> It would be, yes. That's sad.
> 
> That could be fixed by indeed creating trees in all cases (i.e. moving
> away from open addressing altogether). The memory consumption does not worry
> me here; however, dictionary order will change in more cases.
> 
> Compatibility could be restored by introducing a threshold for
> tree creation: if insertion visits more than N slots, go back to the
> original slot and put a tree there. I'd expect that N could be small,
> e.g. N==4. Lookup would then have to consider all AVL trees along the
> chain of visited slots, but ISTM it could also stop after visiting N
> slots.

Perhaps we could combine my attack-detection code from 
  http://bugs.python.org/issue13703#msg151714
with Martin's AVL approach?  Use the ma_smalltable to track stats, and
when a dict detects that it's under attack,  *if* all the keys are
AVL-compatible, it could transition to full-AVL mode.  [I believe that
my patch successfully handles both of Frank's attacks, but I don't have
the test data - I'd be very grateful to receive a copy (securely)].

[See hybrid-approach-dmalcolm-2012-01-25-002.patch for the latest
version of attack-detection; I'm working on a rewrite in which I
restrict it to working just on pure-str dicts.  With that idea, when a
dict detects that it's under attack, *if* all the keys satisfy this
condition
  (new_hash(keyA) == new_hash(keyB)) iff (hash(keyA) == hash(keyB))
then all hash values get recalculated using new_hash (which is
randomized), which should offer protection in many common attack
scenarios, without the semantic change Alex and Antoine indicated]
msg152146 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-28 03:03
For the record, Barry and I agreed on what we'll be doing for stable releases [1]. David says he should have a patch soon.

[1] http://mail.python.org/pipermail/python-dev/2012-January/115892.html
msg152149 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-28 05:13
On Sat, 2012-01-28 at 03:03 +0000, Benjamin Peterson wrote:
> Benjamin Peterson <benjamin@python.org> added the comment:
> 
> For the record, Barry and I agreed on what we'll be doing for stable releases [1]. David says he should have a patch soon.
> 
> [1] http://mail.python.org/pipermail/python-dev/2012-January/115892.html

I'm attaching what I've got so far (need sleep).

Attached patch is for 3.1 and adds opt-in hash randomization.

It's based on haypo's work: random-8.patch (thanks haypo!), with
additional changes as seen in my backport of that to 2.7:
http://bugs.python.org/issue13703#msg151847

* The randomization is off by default, and must be enabled by setting
a new environment variable PYTHONHASHRANDOMIZATION to a non-empty
string. (if so then, PYTHONHASHSEED also still works, if provided, in
the same way as in haypo's patch)

* All of the various "Py_hash_t" become "long" again (Py_hash_t was
added in 3.2: issue9778)

* I expanded the randomization from just PyUnicodeObject to also cover
PyBytesObject, and the types within datetime.

* It doesn't cover numeric types; see my explanation in msg151847; also
see http://bugs.python.org/issue13703#msg151870

* It doesn't yet cover the embedded copy of expat.

* I moved the hash tests from test_unicode.py to test_hash.py

* I tweaked the wording of the descriptions of the envvars in
cmdline.rst and the manpage

* I've tested it on a 32-bit box, and it successfully protects against
one set of test data (four cases: assembling then reading back items by
key for a dict vs set, bytes vs str, with 200000 distinct items of data
which all have hash() == 0 in unmodified build; each takes about 1.5
seconds on this --with-pydebug build, vs of the order of hours).

* I haven't yet benchmarked it

* Only tested on Linux (Fedora x86_64 and i686).  I don't know the
impact on windows (e.g. startup time without the envvar vs with the env
vars).

I'm seeing one failing test:
======================================================================
FAIL: test_clear_dict_in_ref_cycle (__main__.ModuleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File
"/home/david/coding/python-hg/cpython-3.1-hash-randomization/Lib/test/test_module.py", line 79, in test_clear_dict_in_ref_cycle
    self.assertEqual(destroyed, [1])
AssertionError: Lists differ: [] != [1]
msg152183 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-28 19:26
This turns out to pass without PYTHONHASHRANDOMIZATION in the
environment, and fail intermittently with it.

Note that "make test" invokes the built python with "-E", so that it
ignores the setting of PYTHONHASHRANDOMIZATION in the environment.

Barry, Benjamin: does fixing this bug require getting the full test
suite to pass with randomization enabled (and fixing the intermittent
failures due to ordering issues), or is it acceptable to "merely" have
full passes without randomizing the hashes?

What do the buildbots do?
msg152186 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-01-28 20:05
I think we don't need to mess with tests in 2.6/3.1, but everything should pass under 2.7 and 3.2.
msg152199 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-28 23:14
On Sat, 2012-01-28 at 20:05 +0000, Benjamin Peterson wrote:
> Benjamin Peterson <benjamin@python.org> added the comment:
> 
> I think we don't need to mess with tests in 2.6/3.1, but everything should pass under 2.7 and 3.2.

New version of the patch for 3.1
  optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch

This version adds a command-line flag to enable hash-randomization: -R
(given that the -E flag disables env vars and thus disabled
PYTHONHASHRANDOMIZATION). See [1] below

[Is there a convenient way to check the length of the usage messages in
Modules/main.c?  I see this comment:
   /* Long usage message, split into parts < 512 bytes */ ]

I reworded the documentation somewhat based on input from Barry and
Antoine.

Also adds a NEWS item.

Passes "make test" on this x86_64 Fedora 15 box, --with-pydebug, though
that's without randomization enabled (it just does it within individual
test cases that explicitly enable it).

No performance testing done yet (though hopefully similar to that of
Victor's patch; see msg151078)

No idea of the impact on Windows users (I don't have a windows dev box).
It still has the stuff from Victor's patch described in msg151158.

How is this looking?
Dave

[1] IRC transcript concerning "-R" follows:
<__ap__> dmalcolm: IMO it would be simpler if there was only one env var
(preferably not too clumsy to type)
<__ap__> also, despite being neither barry nor gutworth, I think the
test suite *should* pass with randomized hashes
<__ap__> :)
<dmalcolm> :)
<__ap__> also the failure you're having is a bit worrying, since
apparently it's not about dict ordering
<dmalcolm> PYTHONHASHSEED exists mostly for selftesting (also for
compat, if you absolutely need to reproduce a specific random dict
ordering)
<__ap__> ok
<__ap__> if -E suppresses hash randomization, I think we should also add
a command-line flag
<__ap__> -R seems untaken
<__ap__> also it'll make things easier for Windows users, I think
msg152200 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-28 23:24
> Passes "make test" on this x86_64 Fedora 15 box, --with-pydebug, though
> that's without randomization enabled (it just does it within individual
> test cases that explicitly enable it).

I think you should check with randomization enabled, if only to see the
nature of the failures and if they are expected.
msg152203 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-28 23:56
> I think you should check with randomization enabled, if only to see the
> nature of the failures and if they are expected.

Including the list of when-enabled expected failures in the release 
notes would help those who compile and test.
msg152204 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-29 00:06
On Sat, 2012-01-28 at 23:56 +0000, Terry J. Reedy wrote:
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
> 
> > I think you should check with randomization enabled, if only to see the
> > nature of the failures and if they are expected.
> 
> Including the list of when-enabled expected failures in the release 
> notes would help those who compile and test.

OK, though note that because it's random, I'll have to run it a few
times, and we'll see what shakes out.

Am running with:
$  make test TESTPYTHONOPTS=-R
leading to:
   ./python -E -bb -R ./Lib/test/regrtest.py -l 

BTW, I see:
  Testing with flags: sys.flags(debug=0, division_warning=0, inspect=0,
interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0,
no_site=0, ignore_environment=1, verbose=0, bytes_warning=2)

which doesn't list the new flag.  Should I add it to sys.flags?  (or
does anyone ever do tuple-unpacking of that PyStructSequence and thus
rely on the number of elements?)
msg152270 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-01-29 22:36
On Jan 28, 2012, at 07:26 PM, Dave Malcolm wrote:

>This turns out to pass without PYTHONHASHRANDOMIZATION in the
>environment, and fail intermittently with it.
>
>Note that "make test" invokes the built python with "-E", so that it
>ignores the setting of PYTHONHASHRANDOMIZATION in the environment.
>
>Barry, Benjamin: does fixing this bug require getting the full test
>suite to pass with randomization enabled (and fixing the intermittent
>failures due to ordering issues), or is it acceptable to "merely" have
>full passes without randomizing the hashes?

I think we at least need to identify (to the best of our ability) the tests
that fail and include them in release notes.  If they're easy to fix, we
should fix them.  Maybe also open a bug report for each failure.

I'm okay though with some tests failing in 2.6 with this environment variable
set.  We needn't go back and fix them in 2.6 (since we're in security-fix only
mode), but I'll bet you'll get almost the same set for 2.7 and there we
*should* fix them, even if it happens after the release.

>What do the buildbots do?

I'm not sure, but as long as the buildbots are green, I'm happy. :)
msg152271 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-29 22:39
Given PYTHONHASHSEED, what is the point of PYTHONHASHRANDOMIZATION?

Alternative:

On startup, python reads a config file with the seed (which defaults to zero).

Add a function to write a random value to that config file for the next startup.
msg152275 - (view) Author: Mark Shannon (Mark.Shannon) * Date: 2012-01-29 22:50
Barry A. Warsaw wrote:
> Barry A. Warsaw <barry@python.org> added the comment:
> 
> On Jan 28, 2012, at 07:26 PM, Dave Malcolm wrote:
> 
>> This turns out to pass without PYTHONHASHRANDOMIZATION in the
>> environment, and fail intermittently with it.
>>
>> Note that "make test" invokes the built python with "-E", so that it
>> ignores the setting of PYTHONHASHRANDOMIZATION in the environment.
>>
>> Barry, Benjamin: does fixing this bug require getting the full test
>> suite to pass with randomization enabled (and fixing the intermittent
>> failures due to ordering issues), or is it acceptable to "merely" have
>> full passes without randomizing the hashes?
> 
> I think we at least need to identify (to the best of our ability) the tests
> that fail and include them in release notes.  If they're easy to fix, we
> should fix them.  Maybe also open a bug report for each failure.

http://bugs.python.org/issue13903 causes even more tests to fail,
so I'm submitting bug reports for most of the failing tests already.
msg152276 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-29 22:51
> Given PYTHONHASHSEED, what is the point of PYTHONHASHRANDOMIZATION?

How would you do what it does without it? I.e. how would you indicate
that it should randomize the seed, rather than fixing the seed value?

> On startup, python reads a config file with the seed (which defaults to zero).

-1 on configuration files that Python reads at startup (let alone in a
bugfix release).
msg152299 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-30 01:39
On Sun, 2012-01-29 at 00:06 +0000, Dave Malcolm wrote:

I went ahead and added the flag to sys.flags, so now
  $ make test TESTPYTHONOPTS=-R
shows:
Testing with flags: sys.flags(debug=0, division_warning=0, inspect=0,
interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0,
no_site=0, ignore_environment=1, verbose=0, bytes_warning=2,
hash_randomization=1)

...note the:
  hash_randomization=1
at the end of sys.flags.  (This seems useful for making it absolutely
clear if you're getting randomization or not).  Hopefully I'm not
creating too much work for the other Python implementations.

Am attaching new version of patch for 3.1:
  optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch
msg152300 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-30 01:44
On Sat, 2012-01-28 at 23:56 +0000, Terry J. Reedy wrote:
> Terry J. Reedy <tjreedy@udel.edu> added the comment:
> 
> > I think you should check with randomization enabled, if only to see the
> > nature of the failures and if they are expected.
> 
> Including the list of when-enabled expected failures in the release 
> notes would help those who compile and test.

Am attaching a patch which fixes various problems that are clearly just
assumptions about dict ordering:
  fix-unittests-broken-by-randomization-dmalcolm-2012-01-29-001.patch

 json/__init__.py                        |    4 +++-
 test/mapping_tests.py                   |    2 +-
 test/test_descr.py                      |   12 +++++++++++-
 test/test_urllib.py                     |    4 +++-
 tkinter/test/test_ttk/test_functions.py |    2 +-
 5 files changed, 19 insertions(+), 5 deletions(-)

Here are the issues that it fixes:
Lib/test/test_descr.py: fix for intermittent failure due to dict repr:
      File "Lib/test/test_descr.py", line 4304, in test_repr
        self.assertEqual(repr(self.C.__dict__), 'dict_proxy({!r})'.format(dict_))
    AssertionError: "dict_proxy({'__module__': 'test.test_descr', '__dict__': <attribute '__dict__' of 'C' objects>, '__doc__': None, '__weakref__': <attribute '__weakref__' of 'C' objects>, 'meth': <function meth at 0x5834be0>})"
                 != "dict_proxy({'__module__': 'test.test_descr', '__doc__': None, '__weakref__': <attribute '__weakref__' of 'C' objects>, 'meth': <function meth at 0x5834be0>, '__dict__': <attribute '__dict__' of 'C' objects>})"

Lib/json/__init__.py: fix (based on haypo's work) for intermittent failure:
    Failed example:
        json.dumps([1,2,3,{'4': 5, '6': 7}], separators=(',', ':'))
    Expected:
        '[1,2,3,{"4":5,"6":7}]'
    Got:
        '[1,2,3,{"6":7,"4":5}]'

Lib/test/mapping_tests.py: fix (based on haypo's work) for intermittent failures of test_collections, test_dict, and test_userdict seen here:
    ======================================================================
    ERROR: test_update (__main__.GeneralMappingTests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "Lib/test/mapping_tests.py", line 207, in test_update
        i1 = sorted(d.items())
    TypeError: unorderable types: str() < int()

Lib/test/test_urllib.py: fix (based on haypo's work) for intermittent failure:
    ======================================================================
    FAIL: test_nonstring_seq_values (__main__.urlencode_Tests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "Lib/test/test_urllib.py", line 844, in test_nonstring_seq_values
        urllib.parse.urlencode({"a": {"a": 1, "b": 1}}, True))
    AssertionError: 'a=a&a=b' != 'a=b&a=a'
    ----------------------------------------------------------------------

Lib/tkinter/test/test_ttk/test_functions.py: fix from haypo's patch for intermittent failure:
    Traceback (most recent call last):
      File "Lib/tkinter/test/test_ttk/test_functions.py", line 146, in test_format_elemcreate
        ('a', 'b'), a='x', b='y'), ("test a b", ("-a", "x", "-b", "y")))
    AssertionError: Tuples differ: ('test a b', ('-b', 'y', '-a',... != ('test a b', ('-a', 'x', '-b',...

I see two remaining issues (which this patch doesn't address):
test test_module failed -- Traceback (most recent call last):
  File "Lib/test/test_module.py", line 79, in test_clear_dict_in_ref_cycle
    self.assertEqual(destroyed, [1])
AssertionError: Lists differ: [] != [1]

test_multiprocessing
Exception AssertionError: AssertionError() in <Finalize object, dead> ignored
msg152309 - (view) Author: Zbyszek Jędrzejewski-Szmek (zbysz) * Date: 2012-01-30 07:15
What about PYTHONHASHSEED= -> off, PYTHONHASHSEED=0 -> random, 
PYTHONHASHSEED=n -> n ? I agree with Jim that it's better to have one 
env. variable than two.
msg152311 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-01-30 07:45
> What about PYTHONHASHSEED= -> off, PYTHONHASHSEED=0 -> random,
> PYTHONHASHSEED=n -> n ? I agree with Jim that it's better to have one
> env. variable than two.

Rather than the "" empty string for off I suggest an explicit string
that makes it clear what the meaning is.  PYTHONHASHSEED="disabled"
perhaps.

Agreed, if we can have a single env var that is preferred.  It is more
obvious that the PYTHONHASHSEED env var. has no effect when it is set
to a special value rather than when it is set to something but it is
configured to be ignored by a _different_ env var.
msg152315 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-01-30 08:16
> Rather than the "" empty string for off I suggest an explicit string
> that makes it clear what the meaning is.  PYTHONHASHSEED="disabled"
> perhaps.
> 
> Agreed, if we can have a single env var that is preferred.  It is more
> obvious that the PYTHONHASHSEED env var. has no effect when it is set
> to a special value rather than when it is set to something but it is
> configured to be ignored by a _different_ env var.

I think this is bike-shedding. The requirements for environment
variables are
a) with no variable set, it must not do randomization
b) there must be a way to seed from the platform's RNG
Having an explicit seed actually is no requirement, so I'd propose
to drop PYTHONHASHSEED instead.

However, I really suggest to let the patch author (Dave Malcolm)
design the API within the constraints.
msg152335 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-30 17:31
It's useful for the selftests, so I've kept PYTHONHASHSEED.  However,
I've removed it from the man page; the only other place it's mentioned
(in Doc/using/cmdline.rst) I now explicitly say that it exists just to
serve the interpreter's own selftests.

Am attaching a revised patch, which has the above change, plus some
tweaks to Lib/test/test_hash.py (adds test coverage for the datetime
hash randomization):
  optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch

Has anyone had a chance to try this patch on Windows?  Martin?  I'm
hoping that it doesn't impose a startup cost in the default
no-randomization cost, and that any startup cost in the -R case is
acceptable.
msg152344 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-01-30 19:55
On Mon, Jan 30, 2012 at 12:31 PM,  Dave Malcolm <dmalcolm@redhat.com>
added the comment:

> It's useful for the selftests, so I've kept PYTHONHASHSEED.

The reason to read PYTHONHASHSEED was so that multiple members of a
cluster could use the same hash.

It would have been nice to have fewer environment variables, but I'll
grant that it is hard to say "use something random that we have *not*
precomputed" without either a config file or a magic value for
PYTHONHASHSEED.

-jJ
msg152352 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-30 22:22
I slightly messed up the test_hash.py changes.

Revised patch attached:
  optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch
msg152362 - (view) Author: Martin (gz) Date: 2012-01-30 23:41
> Has anyone had a chance to try this patch on Windows?  Martin?  I'm
> hoping that it doesn't impose a startup cost in the default
> no-randomization cost, and that any startup cost in the -R case is
> acceptable.

Just tested as requested. Is the patch against 3.1 for a reason? Can't
really be compared to earlier results, but get enough weird outliers
that that may not be useful anyway. Also needed the following change:

-+        chunk = Py_MIN(size, INT_MAX);
++        chunk = size > INT_MAX ? INT_MAX : size;

Summary, looks like extra work in the default case is avoided and
isn't crippling otherwise, though there were a couple of very odd runs
not presented probably due to other disk access.

Vanilla:

>timeit PCbuild\python.exe -c "import sys;print(sys.version)"
3.1.4+ (default, Jan 30 2012, 22:38:52) [MSC v.1500 32 bit (Intel)]

Version Number:   Windows NT 5.1 (Build 2600)
Exit Time:        10:42 pm, Monday, January 30 2012
Elapsed Time:     0:00:00.218
Process Time:     0:00:00.187
System Calls:     3974
Context Switches: 574
Page Faults:      1696
Bytes Read:       480331
Bytes Written:    0
Bytes Other:      190860


Patched:

>timeit PCbuild\python.exe -c "import sys;print(sys.version)"
3.1.4+ (default, Jan 30 2012, 22:55:06) [MSC v.1500 32 bit (Intel)]

Version Number:   Windows NT 5.1 (Build 2600)
Exit Time:        10:55 pm, Monday, January 30 2012
Elapsed Time:     0:00:00.218
Process Time:     0:00:00.187
System Calls:     3560
Context Switches: 441
Page Faults:      1660
Bytes Read:       461956
Bytes Written:    0
Bytes Other:      24926


>timeit PCbuild\python.exe -Rc "import sys;print(sys.version)"
3.1.4+ (default, Jan 30 2012, 22:55:06) [MSC v.1500 32 bit (Intel)]

Version Number:   Windows NT 5.1 (Build 2600)
Exit Time:        11:05 pm, Monday, January 30 2012
Elapsed Time:     0:00:00.249
Process Time:     0:00:00.234
System Calls:     3959
Context Switches: 483
Page Faults:      1847
Bytes Read:       892464
Bytes Written:    0
Bytes Other:      27090
msg152364 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-01-31 01:34
Am attaching a backport of optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch to 2.6

Randomization covers the str, unicode and buffer types; equality of hashes is preserved for these types.
msg152422 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-01 03:29
On Tue, 2012-01-31 at 01:34 +0000, Dave Malcolm wrote:
> Dave Malcolm <dmalcolm@redhat.com> added the comment:
> 
> Am attaching a backport of optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch to 2.6
> 
> Randomization covers the str, unicode and buffer types; equality of hashes is preserved for these types.

I tried benchmarking the 2.6 version of the patch.

I reran "perf.py" 16 times, setting PYTHONHASHRANDOMIZATION=1, and
--inherit_env=PYTHONHASHRANDOMIZATION so that the patched python uses
the randomization, using a different hash for each run.

Some tests are slightly faster with the patch on some runs; some are
slightly slower, and it appears to vary from run to run.  However, the
amount involved is a few percent.  [compare e.g. with msg151078]

Here's the command I used.
(for i in $(seq 16) ; do echo RUN $i ; (PYTHONHASHRANDOMIZATION=1
python ./perf.py
--inherit_env=PYTHONHASHRANDOMIZATION ../cpython-2.6-clean/python ../cpython-2.6-hash-randomization/python) ; done) | tee results-16.txt

Am attaching the results.
msg152452 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-02 01:18
On Mon, 2012-01-30 at 23:41 +0000, Martin wrote:
> Martin <gzlist@googlemail.com> added the comment:
> 
> > Has anyone had a chance to try this patch on Windows?  Martin?  I'm
> > hoping that it doesn't impose a startup cost in the default
> > no-randomization cost, and that any startup cost in the -R case is
> > acceptable.
> 
> Just tested as requested. Is the patch against 3.1 for a reason? Can't
> really be compared to earlier results, but get enough weird outliers
> that that may not be useful anyway. Also needed the following change:
> 
> -+        chunk = Py_MIN(size, INT_MAX);
> ++        chunk = size > INT_MAX ? INT_MAX : size;
> 
> Summary, looks like extra work in the default case is avoided and
> isn't crippling otherwise, though there were a couple of very odd runs
> not presented probably due to other disk access.

Thanks for testing this!

Oops, yes: Py_MIN is only present in "default" [it was added to
Include/Python.h (as PY_MIN) in 72475:8beaa9a37387 for PEP 393, renamed
to Py_MIN in 72489:dacac31460c0, eventually reaching Include/pymacro.h
in 72512:36fc514de7f0]

"orig_size" in win32_urandom was apparently unused, so I've removed it.

I also found and fixed an occasional failure in my 2.6 backport of the
new test_os.URandomTests.get_urandom_subprocess.

Am attaching 4 patches containing the above changes, plus patches to fix
dict/set ordering assumptions that start breaking if you try to run the
test suite with randomization enabled:
   add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch
   fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch
   add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch
   fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch

2.6 also could use the dict-ordering fix for test_symtable that was
fixed in 2.7 as 74256:789d59773801

FWIW I'm seeing a failure this failure in test_urllib2, but I also see
it with a clean checkout of 2.6:
======================================================================
ERROR: test_invalid_redirect (__main__.HandlerTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Lib/test/test_urllib2.py", line 963, in test_invalid_redirect
    MockHeaders({"location": valid_url}))
  File
"/home/david/coding/python-hg/cpython-2.6-hash-randomization/Lib/urllib2.py", line 616, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File
"/home/david/coding/python-hg/cpython-2.6-hash-randomization/Lib/urllib2.py", line 218, in __getattr__
    raise AttributeError, attr
AttributeError: timeout
msg152453 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-02-02 01:30
It looks like it was not yet decided if the CryptoGen API or a weak LCG should be used on Windows. Extract of add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch:

+#ifdef MS_WINDOWS
+#if 1
+        (void)win32_urandom((unsigned char *)secret, secret_size, 0);
+#else
+        /* fast but weak RNG (fast initialization, weak seed) */

Does someone know how to link Python to advapi32.dll (on Windows) to avoid GetModuleHandle("advapi32.dll")?
msg152723 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-02-06 06:11
IIUC, Win9x and NT4 are not supported anymore in any of the target releases of the patch, so calling CryptGenRandom should be fine.

In a security fix release, we shouldn't change the linkage procedures, so I recommend that the LoadLibrary dance remains.
msg152730 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-02-06 09:53
> In a security fix release, we shouldn't change the linkage procedures,
> so I recommend that the LoadLibrary dance remains.

So the overhead in startup time is not an issue?
msg152731 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 10:20
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> In a security fix release, we shouldn't change the linkage procedures,
>> so I recommend that the LoadLibrary dance remains.
> 
> So the overhead in startup time is not an issue?

It is an issue. Not only in terms of startup time, but also
because randomization per default makes Python behave in
non-deterministc ways - which is not what you want from a
programming language or interpreter (unless you explicitly
tell it to behave like that).

I think it would be much better to just let the user
define a hash seed using environment variables for Python
to use and then forget about how this variable value is
determined. If it's not set, Python uses 0 as seed, thereby
disabling the seeding logic.

This approach would have Python behave in a deterministic way
per default and still allow users who wish to use a different
seed, set this to a different value - even on a case by case
basis.

If you absolutely want to add a feature to have the seed set
randomly, you could make a seed value of -1 trigger the use
of a random number source as seed.

I also still firmly believe that the collision counting scheme
should be made available via an environment variable as well.
The user could then set the variable to e.g. 1000 to have it
enabled with limit 1000, or leave it undefined to disable the
collision counting.

With those two tools, users could then choose the method they
find most attractive for their purposes.

By default, they would be disabled, but applications which are
exposed to untrusted user data and use dictionaries for managing
such data could check whether the protections are enabled and
trigger a startup error if needed.
msg152732 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-06 12:22
> It is an issue. Not only in terms of startup time, but also
> because randomization per default makes Python behave in
> non-deterministc ways - which is not what you want from a
> programming language or interpreter (unless you explicitly
> tell it to behave like that).

That's debatable. For example id() is fairly unpredictable accross runs
(except for statically-allocated instances).

> I think it would be much better to just let the user
> define a hash seed using environment variables for Python
> to use and then forget about how this variable value is
> determined. If it's not set, Python uses 0 as seed, thereby
> disabling the seeding logic.
> 
> This approach would have Python behave in a deterministic way
> per default and still allow users who wish to use a different
> seed, set this to a different value - even on a case by case
> basis.
> 
> If you absolutely want to add a feature to have the seed set
> randomly, you could make a seed value of -1 trigger the use
> of a random number source as seed.

Having both may indeed be a good idea.

> I also still firmly believe that the collision counting scheme
> should be made available via an environment variable as well.
> The user could then set the variable to e.g. 1000 to have it
> enabled with limit 1000, or leave it undefined to disable the
> collision counting.
> 
> With those two tools, users could then choose the method they
> find most attractive for their purposes.

It's not about being attractive, it's about fixing the security problem.
The simple collision counting approach leaves a gaping hole open, as
demonstrated by Frank.
msg152734 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 13:12
Antoine Pitrou wrote:
> 
> The simple collision counting approach leaves a gaping hole open, as
> demonstrated by Frank.

Could you elaborate on this ?

Note that I've updated the collision counting patch to cover both
possible attack cases I mentioned in http://bugs.python.org/issue13703#msg150724.
If there's another case I'm unaware of, please let me know.
msg152740 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-02-06 15:47
On Mon, Feb 6, 2012 at 8:12 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
>
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>
> Antoine Pitrou wrote:
>>
>> The simple collision counting approach leaves a gaping hole open, as
>> demonstrated by Frank.

> Could you elaborate on this ?

> Note that I've updated the collision counting patch to cover both
> possible attack cases I mentioned in http://bugs.python.org/issue13703#msg150724.
> If there's another case I'm unaware of, please let me know.

The problematic case is, roughly,

(1)  Find out what N will trigger collision-counting countermeasures.
(2)  Insert N-1 colliding entries, to make it as slow as possible.
(3)  Keep looking up (or updating) the N-1th entry, so that the
slow-as-possible-without-countermeasures path keeps getting rerun.
msg152747 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 17:07
Jim Jewett wrote:
> 
> Jim Jewett <jimjjewett@gmail.com> added the comment:
> 
> On Mon, Feb 6, 2012 at 8:12 AM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
>>
>> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>>
>> Antoine Pitrou wrote:
>>>
>>> The simple collision counting approach leaves a gaping hole open, as
>>> demonstrated by Frank.
> 
>> Could you elaborate on this ?
> 
>> Note that I've updated the collision counting patch to cover both
>> possible attack cases I mentioned in http://bugs.python.org/issue13703#msg150724.
>> If there's another case I'm unaware of, please let me know.
> 
> The problematic case is, roughly,
> 
> (1)  Find out what N will trigger collision-counting countermeasures.
> (2)  Insert N-1 colliding entries, to make it as slow as possible.
> (3)  Keep looking up (or updating) the N-1th entry, so that the
> slow-as-possible-without-countermeasures path keeps getting rerun.

Since N is constant, I don't see how such an "attack" could be used
to trigger the O(n^2) worst-case behavior. Even if you can create n sets
of entries that each fill up N-1 positions, the overall performance
will still be O(n*N*(N-1)/2) = O(n).

So in the end, we're talking about a regular brute force DoS attack,
which requires different measures than dictionary implementation
tricks :-)

BTW: If you set the limit N to e.g. 100 (which is reasonable given
Victor's and my tests), the time it takes to process one of those
sets only takes 0.3 ms on my machine. That's hardly usable as basis
for an effective DoS attack.
msg152753 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-02-06 18:31
On Mon, Feb 6, 2012 at 12:07 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
>
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>
> Jim Jewett wrote:

>> The problematic case is, roughly,

>> (1)  Find out what N will trigger collision-counting countermeasures.
>> (2)  Insert N-1 colliding entries, to make it as slow as possible.
>> (3)  Keep looking up (or updating) the N-1th entry, so that the
>> slow-as-possible-without-countermeasures path keeps getting rerun.

> Since N is constant, I don't see how such an "attack" could be used
> to trigger the O(n^2) worst-case behavior.

Agreed; it tops out with a constant, but if it takes only 16 bytes of
input to force another run through a 1000-long collision, that may
still be too much leverage.

> BTW: If you set the limit N to e.g. 100 (which is reasonable given
> Victor's and my tests),

Agreed.  Frankly, I think 5 would be more than reasonable so long as
there is a fallback.

> the time it takes to process one of those
> sets only takes 0.3 ms on my machine. That's hardly usable as basis
> for an effective DoS attack.

So it would take around 3Mb to cause a minute's delay...
msg152754 - (view) Author: Frank Sievertsen (fx5) Date: 2012-02-06 18:53
> Agreed; it tops out with a constant, but if it takes only 16 bytes of
> input to force another run through a 1000-long collision, that may
> still be too much leverage.

You should prepare the dict so that you have the collisions-run with a one-byte string or better with an even empty string, not a 16 bytes string.

> BTW: If you set the limit N to e.g. 100 (which is reasonable given
> Victor's and my tests),

100 is probably hard to exploit for a DoS attack. However
it makes it much easier to cause unwanted (future?) exceptions in
other apps.

> So it would take around 3Mb to cause a minute's delay...

How did you calculate that?
msg152755 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 18:54
Jim Jewett wrote:
> 
>> BTW: If you set the limit N to e.g. 100 (which is reasonable given
>> Victor's and my tests),
> 
> Agreed.  Frankly, I think 5 would be more than reasonable so long as
> there is a fallback.
> 
>> the time it takes to process one of those
>> sets only takes 0.3 ms on my machine. That's hardly usable as basis
>> for an effective DoS attack.
> 
> So it would take around 3Mb to cause a minute's delay...

I'm not sure how you calculated that number.

Here's what I get: tale a dictionary with 100 integer collisions:
d = dict((x*(2**64 - 1), 1) for x in xrange(1, 100))

The repr(d) has 2713 bytes, which is a good approximation of how
much (string) data you have to send in order to trigger the
problem case.

If you can create 3333 distinct integer sequences, you'll get a
processing time of about 1 second on my slow dev machine. The
resulting dict will likely have a repr() of around
60*3333*2713 = 517MB.

So you need to send 517MB to cause my slow dev machine to consume
1 minute of CPU time. Today's servers are at least 10 times as fast as
my aging machine.

If you then take into account that the integer collision dictionary
is a very efficient collision example (size vs. effect), the attack
doesn't really sound practical anymore.
msg152758 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-06 19:07
On Mon, 2012-02-06 at 06:11 +0000, Martin v. Löwis wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> IIUC, Win9x and NT4 are not supported anymore in any of the target releases of the patch, so calling CryptGenRandom should be fine.
> In a security fix release, we shouldn't change the linkage procedures, so I recommend that the LoadLibrary dance remains.

Thanks.

Am attaching tweaked versions of the 2012-02-01 patches, in which I've
removed the indecisive:
#if 1
       (void)win32_urandom((unsigned char *)secret, secret_size, 0);
#else
       /* fast but weak RNG (fast initialization, weak seed) */
       ...etc...
#endif

stuff, and simply use the first clause (win32_urandom) on Windows:
  add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch 
  fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch 
  add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch 
  fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch
msg152760 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-06 19:11
On Mon, 2012-02-06 at 10:20 +0000, Marc-Andre Lemburg wrote:
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> 
> STINNER Victor wrote:
> > 
> > STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> > 
> >> In a security fix release, we shouldn't change the linkage procedures,
> >> so I recommend that the LoadLibrary dance remains.
> > 
> > So the overhead in startup time is not an issue?
> 
> It is an issue. Not only in terms of startup time, but also

msg152362 indicated that there was negligible impact on startup time
when randomization is disabled.  The impact when it *is* enabled is
unclear, but reported there as "isn't crippling".

> because randomization per default makes Python behave in
> non-deterministc ways - which is not what you want from a
> programming language or interpreter (unless you explicitly
> tell it to behave like that).

The release managers have pronounced:
http://mail.python.org/pipermail/python-dev/2012-January/115892.html
Quoting that email:
> 1. Simple hash randomization is the way to go. We think this has the
> best chance of actually fixing the problem while being fairly
> straightforward such that we're comfortable putting it in a stable
> release.
> 2. It will be off by default in stable releases and enabled by an
> envar at runtime. This will prevent code breakage from dictionary
> order changing as well as people depending on the hash stability.
msg152763 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-02-06 19:34
On Mon, Feb 6, 2012 at 1:53 PM, Frank Sievertsen <report@bugs.python.org> wrote:

>>> BTW: If you set the limit N to e.g. 100 (which is reasonable given
>>> Victor's and my tests),

>> So it would take around 3Mb to cause a minute's delay...

> How did you calculate that?

16 bytes/entry * 3300 entries/second * 60 seconds/minute

But if there is indeed a way to cut that 16 bytes/entry, that is worse.

Switching dict implementations at 5 collisions is still acceptable,
except from a complexity standpoint.

-jJ
msg152764 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 19:44
Dave Malcolm wrote:
> 
>>> So the overhead in startup time is not an issue?
>>
>> It is an issue. Not only in terms of startup time, but also
>... 
>> because randomization per default makes Python behave in
>> non-deterministc ways - which is not what you want from a
>> programming language or interpreter (unless you explicitly
>> tell it to behave like that).
> 
> The release managers have pronounced:
> http://mail.python.org/pipermail/python-dev/2012-January/115892.html
> Quoting that email:
>> 1. Simple hash randomization is the way to go. We think this has the
>> best chance of actually fixing the problem while being fairly
>> straightforward such that we're comfortable putting it in a stable
>> release.
>> 2. It will be off by default in stable releases and enabled by an
>> envar at runtime. This will prevent code breakage from dictionary
>> order changing as well as people depending on the hash stability.

Right, but that doesn't contradict what I wrote about adding
env vars to fix a seed and optionally enable using a random
seed, or adding collision counting as extra protection for
cases that are not addressed by the hash seeding, such as
e.g. collisions caused by 3rd types or numbers.
msg152767 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 20:14
Marc-Andre Lemburg wrote:
> Dave Malcolm wrote:
>> The release managers have pronounced:
>> http://mail.python.org/pipermail/python-dev/2012-January/115892.html
>> Quoting that email:
>>> 1. Simple hash randomization is the way to go. We think this has the
>>> best chance of actually fixing the problem while being fairly
>>> straightforward such that we're comfortable putting it in a stable
>>> release.
>>> 2. It will be off by default in stable releases and enabled by an
>>> envar at runtime. This will prevent code breakage from dictionary
>>> order changing as well as people depending on the hash stability.
> 
> Right, but that doesn't contradict what I wrote about adding
> env vars to fix a seed and optionally enable using a random
> seed, or adding collision counting as extra protection for
> cases that are not addressed by the hash seeding, such as
> e.g. collisions caused by 3rd types or numbers.

... at least I hope not :-)
msg152768 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-06 20:17
> > Right, but that doesn't contradict what I wrote about adding
> > env vars to fix a seed and optionally enable using a random
> > seed, or adding collision counting as extra protection for
> > cases that are not addressed by the hash seeding, such as
> > e.g. collisions caused by 3rd types or numbers.
> 
> ... at least I hope not :-)

I think the env var part is a good idea (except that -1 as a magic value
to enable randomization isn't great).
msg152769 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 20:24
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>>> Right, but that doesn't contradict what I wrote about adding
>>> env vars to fix a seed and optionally enable using a random
>>> seed, or adding collision counting as extra protection for
>>> cases that are not addressed by the hash seeding, such as
>>> e.g. collisions caused by 3rd types or numbers.
>>
>> ... at least I hope not :-)
> 
> I think the env var part is a good idea (except that -1 as a magic value
> to enable randomization isn't great).

Agreed. Since it's an env var, using "random" would be a better choice.
msg152777 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-06 21:18
>
> > The release managers have pronounced:
> > http://mail.python.org/pipermail/python-dev/2012-January/115892.html
> > Quoting that email:
> >> 1. Simple hash randomization is the way to go. We think this has the
> >> best chance of actually fixing the problem while being fairly
> >> straightforward such that we're comfortable putting it in a stable
> >> release.
> >> 2. It will be off by default in stable releases and enabled by an
> >> envar at runtime. This will prevent code breakage from dictionary
> >> order changing as well as people depending on the hash stability.
>
> Right, but that doesn't contradict what I wrote about adding
> env vars to fix a seed and optionally enable using a random
> seed, or adding collision counting as extra protection for
> cases that are not addressed by the hash seeding, such as
> e.g. collisions caused by 3rd types or numbers.

We won't be back-porting anything more than the hash randomization for
2.6/2.7/3.1/3.2 but we are free to do more in 3.3 if someone can
demonstrate it working well and a need for it.

For me, things like collision counting and tree based collision
buckets when the types are all the same and known comparable make
sense but are really sounding like a lot of additional complexity. I'd
*like* to see active black-box design attack code produced that goes
after something like a wsgi web app written in Python with hash
randomization *enabled* to demonstrate the need before we accept
additional protections like this  for 3.3+.

-gps
msg152780 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 21:41
Gregory P. Smith wrote:
> 
> Gregory P. Smith <greg@krypto.org> added the comment:
> 
>>
>>> The release managers have pronounced:
>>> http://mail.python.org/pipermail/python-dev/2012-January/115892.html
>>> Quoting that email:
>>>> 1. Simple hash randomization is the way to go. We think this has the
>>>> best chance of actually fixing the problem while being fairly
>>>> straightforward such that we're comfortable putting it in a stable
>>>> release.
>>>> 2. It will be off by default in stable releases and enabled by an
>>>> envar at runtime. This will prevent code breakage from dictionary
>>>> order changing as well as people depending on the hash stability.
>>
>> Right, but that doesn't contradict what I wrote about adding
>> env vars to fix a seed and optionally enable using a random
>> seed, or adding collision counting as extra protection for
>> cases that are not addressed by the hash seeding, such as
>> e.g. collisions caused by 3rd types or numbers.
> 
> We won't be back-porting anything more than the hash randomization for
> 2.6/2.7/3.1/3.2 but we are free to do more in 3.3 if someone can
> demonstrate it working well and a need for it.
> 
> For me, things like collision counting and tree based collision
> buckets when the types are all the same and known comparable make
> sense but are really sounding like a lot of additional complexity. I'd
> *like* to see active black-box design attack code produced that goes
> after something like a wsgi web app written in Python with hash
> randomization *enabled* to demonstrate the need before we accept
> additional protections like this  for 3.3+.

I posted several examples for the integer collision attack on this
ticket. The current randomization patch does not address this at all,
the collision counting patch does, which is why I think both are
needed.

Note that my comment was more about the desire to *not* recommend
using random hash seeds per default, but instead advocate using
a random but fixed seed, or at least document that using random
seeds that are set during interpreter startup will cause
problems with repeatability of application runs.
msg152781 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-02-06 21:42
On Mon, Feb 6, 2012 at 4:41 PM, Marc-Andre Lemburg
<report@bugs.python.org>wrote:

>
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>
> Gregory P. Smith wrote:
> >
> > Gregory P. Smith <greg@krypto.org> added the comment:
> >
> >>
> >>> The release managers have pronounced:
> >>> http://mail.python.org/pipermail/python-dev/2012-January/115892.html
> >>> Quoting that email:
> >>>> 1. Simple hash randomization is the way to go. We think this has the
> >>>> best chance of actually fixing the problem while being fairly
> >>>> straightforward such that we're comfortable putting it in a stable
> >>>> release.
> >>>> 2. It will be off by default in stable releases and enabled by an
> >>>> envar at runtime. This will prevent code breakage from dictionary
> >>>> order changing as well as people depending on the hash stability.
> >>
> >> Right, but that doesn't contradict what I wrote about adding
> >> env vars to fix a seed and optionally enable using a random
> >> seed, or adding collision counting as extra protection for
> >> cases that are not addressed by the hash seeding, such as
> >> e.g. collisions caused by 3rd types or numbers.
> >
> > We won't be back-porting anything more than the hash randomization for
> > 2.6/2.7/3.1/3.2 but we are free to do more in 3.3 if someone can
> > demonstrate it working well and a need for it.
> >
> > For me, things like collision counting and tree based collision
> > buckets when the types are all the same and known comparable make
> > sense but are really sounding like a lot of additional complexity. I'd
> > *like* to see active black-box design attack code produced that goes
> > after something like a wsgi web app written in Python with hash
> > randomization *enabled* to demonstrate the need before we accept
> > additional protections like this  for 3.3+.
>
> I posted several examples for the integer collision attack on this
> ticket. The current randomization patch does not address this at all,
> the collision counting patch does, which is why I think both are
> needed.
>
> Note that my comment was more about the desire to *not* recommend
> using random hash seeds per default, but instead advocate using
> a random but fixed seed, or at least document that using random
> seeds that are set during interpreter startup will cause
> problems with repeatability of application runs.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
>

Can't randomization just be applied to integers as well?

Alex
msg152784 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-06 21:53
> Can't randomization just be applied to integers as well?
> 

It could, but see http://bugs.python.org/issue13703#msg151847

Would my patches be more or less likely to get reviewed with vs without
an extension of randomization to integers?
msg152787 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 22:04
Alex Gaynor wrote:
> Can't randomization just be applied to integers as well?

A simple seed xor'ed with the hash won't work, since the attacks
I posted will continue to work (just colliding on a different hash
value).

Using a more elaborate hash algorithm would slow down uses of
numbers as dictionary keys and also be difficult to implement for
non-integer types such as float, longs and complex numbers. The
reason is that Python applications expect x == y => hash(x) == hash(y),
e.g. hash(3) == hash(3L) == hash(3.0) == hash(3+0j).

AFAIK, the randomization patch also doesn't cover tuples, which are
rather common as dictionary keys as well, nor any of the other
more esoteric Python built-in hashable data types (e.g. frozenset)
or hashable data types defined by 3rd party extensions or
applications (simply because it can't).
msg152789 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-02-06 22:07
On Mon, Feb 6, 2012 at 5:04 PM, Marc-Andre Lemburg
<report@bugs.python.org>wrote:

>
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
>
> Alex Gaynor wrote:
> > Can't randomization just be applied to integers as well?
>
> A simple seed xor'ed with the hash won't work, since the attacks
> I posted will continue to work (just colliding on a different hash
> value).
>
> Using a more elaborate hash algorithm would slow down uses of
> numbers as dictionary keys and also be difficult to implement for
> non-integer types such as float, longs and complex numbers. The
> reason is that Python applications expect x == y => hash(x) == hash(y),
> e.g. hash(3) == hash(3L) == hash(3.0) == hash(3+0j).
>
> AFAIK, the randomization patch also doesn't cover tuples, which are
> rather common as dictionary keys as well, nor any of the other
> more esoteric Python built-in hashable data types (e.g. frozenset)
> or hashable data types defined by 3rd party extensions or
> applications (simply because it can't).
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
>

There's no need to cover any container types, because if their constituent
types are securely hashable then they will be as well.  And of course if
the constituent types are unsecure then they're directly vulnerable.

Alex
msg152797 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-06 23:00
Alex Gaynor wrote:
> There's no need to cover any container types, because if their constituent
> types are securely hashable then they will be as well.  And of course if
> the constituent types are unsecure then they're directly vulnerable.

I wouldn't necessarily take that for granted: since container
types usually calculate their hash based on the hashes of their
elements, it's possible that a clever combination of elements
could lead to a neutralization of the the hash seed used by
the elements, thereby reenabling the original attack on the
unprotected interpreter.

Still, because we have far more vulnerable hashable types out there,
trying to find such an attack doesn't really make practical
sense, so protecting containers is indeed not as urgent :-)
msg152811 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-07 15:41
On Mon, 2012-02-06 at 23:00 +0000, Marc-Andre Lemburg wrote:
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> 
> Alex Gaynor wrote:
> > There's no need to cover any container types, because if their constituent
> > types are securely hashable then they will be as well.  And of course if
> > the constituent types are unsecure then they're directly vulnerable.
> 
> I wouldn't necessarily take that for granted: since container
> types usually calculate their hash based on the hashes of their
> elements, it's possible that a clever combination of elements
> could lead to a neutralization of the the hash seed used by
> the elements, thereby reenabling the original attack on the
> unprotected interpreter.
> 
> Still, because we have far more vulnerable hashable types out there,
> trying to find such an attack doesn't really make practical
> sense, so protecting containers is indeed not as urgent :-)

FWIW, I'm still awaiting review of my patches.  I don't believe
Marc-Andre's concerns are a sufficient rebuttal to the approach I've
taken.

If anyone is aware of an attack via numeric hashing that's actually
possible, please let me know (privately).  I believe only specific apps
could be affected, and I'm not aware of any such specific apps.
msg152855 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-08 13:10
Dave Malcolm wrote:
> 
> If anyone is aware of an attack via numeric hashing that's actually
> possible, please let me know (privately).  I believe only specific apps
> could be affected, and I'm not aware of any such specific apps.

I'm not sure what you'd like to see.

Any application reading user provided data from a file, database,
web, etc. is vulnerable to the attack, if it uses the read numeric
data as keys in a dictionary.

The most common use case for this is a dictionary mapping codes or
IDs to strings or objects, e.g. for caching purposes, to find a list
of unique IDs, checking for duplicates, etc.

This also works indirectly on 32-bit platforms, e.g. via date/time
or IP address values that get converted to key integers.
msg153055 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-02-10 15:30
So modulo my (small) review comments, David's patches are ready to go in.
msg153074 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-10 19:23
Thanks for reviewing Benjamin.  I'm also reviewing this today.  Sorry
for the delay!

BTW, like Schadenfreude?  A hash collision DOS issue "fix" patch for
PHP5 was done poorly and introduced a new security vulnerability that
was just used to let script kiddies root many servers all around the
web:  http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2012-0830
msg153081 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-02-10 23:01
Review of add-randomization-(...).patch:
 - there is a missing ")" in the doc, near "the types covered by the :option:`-R` option (or its equivalent, :envvar:`PYTHONHASHRANDOMIZATION`."
 - get_hash() in test_hash.py fails completly on Windows: Windows requires some environment variables. Just use env=os.environ.copy() instead of env={}.
 - PYTHONHASHSEED doc is not clear: it should be mentionned that the variable is ignored if PYTHONHASHRANDOMIZATION is not set
 - (Python 2.6) test_hash fails because of "[xxx refs]" in stderr if Python is compiled in debug mode. Add strip_python_stderr() to test_support.py and use it in get_hash().

def strip_python_stderr(stderr):
    """Strip the stderr of a Python process from potential debug output
    emitted by the interpreter.

    This will typically be run on the result of the communicate() method
    of a subprocess.Popen object.
    """
    stderr = re.sub(br"\[\d+ refs\]\r?\n?$", b"", stderr).strip()
    return stderr

Except these minor nits, the patches (2.6 and 3.1) looks good. I didn't read the tests patches: just run the tests to test them :-) (Or our buildbots will do the work for you.)
msg153082 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-02-10 23:49
On Fri, Feb 10, 2012 at 6:02 PM, STINNER Victor

>  - PYTHONHASHSEED doc is not clear: it should be mentionned
> that the variable is ignored if PYTHONHASHRANDOMIZATION
> is not set

*That* is why this two-envvar solution bothers me.

PYTHONHASHSEED has to be a string anyhow, so why not just get rid of
PYTHONHASHRANDOMIZATION?

Use PYTHONHASHSEED=random to use randomization.

Other values that cannot be turned into an integer will be (currently)
undefined.  (You may want to raise a fatal error, on the assumption
that errors should not pass silently.)

A missing PYTHONHASHSEED then has the pleasant interpretation of
defaulting to "0" for backwards compatibility.
msg153140 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-11 23:06
On Fri, 2012-02-10 at 23:02 +0000, STINNER Victor wrote:
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> Review of add-randomization-(...).patch:
>  - there is a missing ")" in the doc, near "the types covered by the :option:`-R` option (or its equivalent, :envvar:`PYTHONHASHRANDOMIZATION`."
>  - get_hash() in test_hash.py fails completly on Windows: Windows requires some environment variables. Just use env=os.environ.copy() instead of env={}.
>  - PYTHONHASHSEED doc is not clear: it should be mentionned that the variable is ignored if PYTHONHASHRANDOMIZATION is not set
>  - (Python 2.6) test_hash fails because of "[xxx refs]" in stderr if Python is compiled in debug mode. Add strip_python_stderr() to test_support.py and use it in get_hash().

I'm attaching revised versions of the "add-randomization" patches
incorporating review feedback:
  add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch
  add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch

The other pair of patches are unchanged from before:
  fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch
  fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch

Changes relative to *-2012-02-06-001.patch:
  * changed the wording of the docs relating to PYTHONHASHSEED in
Doc/using/cmdline.rst to:
    * clarify the interaction with PYTHONHASHRANDOMIZATION and -R
    * mentioning another possible use case: "to allow a cluster of
python processes to share hash values." (as per
http://bugs.python.org/issue13703#msg152344 )
    * rewording the awkward "overrides the other setting"
  * I've added a description of PYTHONHASHSEED back to the man page and
to the --help text
  * grammar fixes for "Fail to" in 2.6 version of the patch (were
already fixed in 3.1)
  * restored __VMS randomization, by porting vms_urandom from
Modules/posixmodule.c to Python/random.c (though I have no way of
testing this)
  * changed env = {} to env = os.environ.copy() in get_hash() as noted
by haypo
  * fixed test_hash --with-pydebug as noted by haypo (and test_os),
adding strip_python_stderr from 2.7

I haven't enabled randomization in the Makefile.pre.in
msg153141 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-11 23:09
I'm not quite sure how that would interact with the -R command-line
option for enabling randomization.

The changes to the docs in the latest patch clarifies the meaning of
what I've implemented (I hope).

My view is that we should simply enable hash randomization by default in
3.3

At that point, PYTHONHASHRANDOMIZATION and the -R option become
meaningless (and could be either removed, or silently ignored), and you
have to set PYTHONHASHSEED=0 to get back the old behavior.
msg153143 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-12 01:37
Should -R be required to take a parameter specifying "on" or "off" so
that code using a #! line continues to work as specified across the a
change in default behavior when upgrading from 3.2 to 3.3?

#!/usr/bin/python3 -R on
#!/usr/bin/python3 -R off

In 3.3 it would be a good idea to have a command line flag to turn
this off.  Rather than introducing a new flag in 3.3 a parameter that
is specific without regards for the default avoids that entirely.

before anyone suggests it: I do *not* think -R should accept a value
to use as the seed.  that is unnecessary.
msg153144 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-12 02:11
Comments to be addressed added on the code review.
msg153297 - (view) Author: Dave Malcolm (dmalcolm) (Python committer) Date: 2012-02-13 20:37
On Sun, 2012-02-12 at 02:11 +0000, Gregory P. Smith wrote:
> Gregory P. Smith <greg@krypto.org> added the comment:
> 
> Comments to be addressed added on the code review.

Thanks.  I'm attaching new patches:
  add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch
  add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch

I incorporated the feedback from Gregory P Smith's review.

I haven't changed the way the command-line options or env variables
work, though.

Changes relative to *-2012-02-11-001.patch:
  * added versionadded 2.6.8 and 3.1.5 to hash_randomization/-R within
Docs/library/sys.rst and Docs/using/cmdline.rst (these will need
changing to "2.7.3" and "3.2.3" in the forward ports to the 2.7 and 3.2
branches)
  * fixed line wrapping within the --help text in Modules/main.c
  * reverted text of urandom__doc__
  * added comments about the specialcasing of length 0:
    /*
      We make the hash of the empty string be 0, rather than using
      (prefix ^ suffix), since this slightly obfuscates the hash secret
    */
    (see discussion in http://bugs.python.org/issue13703#msg151664
onwards)

I didn't change the range of values for PYTHONHASHSEED on 64-bit
msg153301 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-13 20:50
Dave Malcolm wrote:
> [new patch]

Please change how the env vars work as discussed earlier on this ticket.

Quick summary:

We only need one env var for the randomization logic: PYTHONHASHSEED.
If not set, 0 is used as seed. If set to a number, a fixed seed
is used. If set to "random", a random seed is generated at
interpreter startup.

Same for the -R cmd line option.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg153369 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-02-14 20:34
On Mon, Feb 13, 2012 at 3:37 PM,  Dave Malcolm
<dmalcolm@redhat.com> added the comment:

>  * added comments about the specialcasing of length 0:
>    /*
>      We make the hash of the empty string be 0, rather than using
>      (prefix ^ suffix), since this slightly obfuscates the hash secret
>    */

Frankly, other short strings may give away even more, because you can
put several into the same dict.

I would prefer that the randomization not kick in until strings are at
least 8 characters, but I think excluding length 1 is a pretty obvious
win.
msg153395 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-02-15 08:25
> Frankly, other short strings may give away even more, because you can
> put several into the same dict.

Please don't make such claims without some reasonable security analysis:
how *exactly* would you derive the hash seed when you have the hash
values of all 256 one-byte strings (or all 2**20 one-char Unicode
strings)?

> I would prefer that the randomization not kick in until strings are at
> least 8 characters, but I think excluding length 1 is a pretty obvious
> win.

-1. It is very easy to create a good number of hash collisions already
with 6-character strings. You are opening the security hole again that
we are attempting to close.
msg153682 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-02-19 09:14
Attaching reviewed version for 3.1 with unified env var PYTHONHASHSEED and encompassing Antoine's and Greg's review comments.
msg153683 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-02-19 09:21
New version, with the hope that it gets a "review" link.
msg153690 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-02-19 10:00
New patch fixes failures due to sys.flags backwards compatibility.

With PYTHONHASHSEED=random, at least those tests still fail:
test_descr test_json test_set test_ttk_textonly test_urllib

Do we want to fix them in 3.1?
msg153695 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-19 10:22
> With PYTHONHASHSEED=random, at least those tests still fail:
> test_descr test_json test_set test_ttk_textonly test_urllib
>
> Do we want to fix them in 3.1?

It the failures are caused by the test depending on dict order (i.e. not real bugs, not changed behavior), then I think we can live with them.
msg153750 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-20 01:05
> With PYTHONHASHSEED=random, at least those tests still fail:
> test_descr test_json test_set test_ttk_textonly test_urllib
> 
> Do we want to fix them in 3.1?

I don't know, but we'll have to fix them in 3.2 to avoid breaking the
buildbots. So we might also fix them in 3.1.
msg153753 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-02-20 01:31
+1 for fixing all tests.
msg153798 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-20 19:01
New changeset f4b7ecf8a5f8 by Georg Brandl in branch '3.1':
Issue #13703: add a way to randomize the hash values of basic types (str, bytes, datetime)
http://hg.python.org/cpython/rev/f4b7ecf8a5f8
msg153802 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-20 20:41
New changeset 4a31f6b11e7a by Georg Brandl in branch '3.2':
Merge from 3.1: Issue #13703: add a way to randomize the hash values of basic types (str, bytes, datetime)
http://hg.python.org/cpython/rev/4a31f6b11e7a
msg153817 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-20 23:37
New changeset ed76dc34b39d by Georg Brandl in branch 'default':
Merge 3.2: Issue #13703 plus some related test suite fixes.
http://hg.python.org/cpython/rev/ed76dc34b39d
msg153833 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-21 01:44
New changeset 6b7704fe1be1 by Barry Warsaw in branch '2.6':
- Issue #13703: oCERT-2011-003: add -R command-line option and PYTHONHASHSEED
http://hg.python.org/cpython/rev/6b7704fe1be1
msg153848 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-21 06:01
Roundup Robot didn't seem to notice it, but this has also been committed in 2.7:

http://hg.python.org/cpython/rev/a0f43f4481e0
msg153849 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-21 06:03
Yep, the bot only looks at commit messages, it does not inspect merges or other topographical information.  That’s why some of us make sure to repeat bug numbers in our merge commit messages.
msg153850 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-02-21 06:12
But since our workflow is such that commits in X.Y branches always show up in X.Y+1, it doesn't really matter.
msg153852 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-21 06:35
The bug report is the easiest thing to search for and follow when checking when something is resolved so it is nice to have a link to the relevant patch(es) for each branch.  I just wanted to note the major commit here so that all planned branches had a note recorded.  I don't care that it wasn't automatic. :)

For observers: There have been several more commits related to fixing this (test dict/set order fixes, bug/typo/merge oops fixes for the linked to patches, etc). Anyone interested in seeing the full list of diffs should look at their specific branch on our around the time of the linked to changelists.  Too many to list here.
msg153853 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-21 06:40
Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0?  It is now.

Saying yes "working as intended" is fine by me.

sys.flags.hash_randomization seems to simply indicate that doing something with the hash seed was explicitly specified as opposed to defaulting to off, not that the hash seed was actually chosen randomly.

What this implies for 3.3 after we make hash randomization default to on is that sys.flags.hash_randomization will always be 1.
msg153854 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-02-21 06:47
That is a good question.  I don't really care either way, but let's say +0 for turning it off when seed == 0.

-R still needs to be made default in 3.3 - that's one reason this issue is still open.
msg153860 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-02-21 09:47
> Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0?  It is now.
>
> Saying yes "working as intended" is fine by me.

It is documented that PYTHONHASHSEED=0 disables the randomization, so
sys.flags.hash_randomization must be False (0).
msg153861 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-21 09:48
Gregory P. Smith wrote:
> 
> Gregory P. Smith <greg@krypto.org> added the comment:
> 
> Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0?  It is now.

The flag should probably be removed - simply because
the env var is not a flag, it's a configuration parameter.

Exposing the seed value as sys.hashseed would be better and more useful
to applications.
msg153862 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2012-02-21 09:50
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@gmail.com> added the comment:
> 
>> Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0?  It is now.
>>
>> Saying yes "working as intended" is fine by me.
> 
> It is documented that PYTHONHASHSEED=0 disables the randomization, so
> sys.flags.hash_randomization must be False (0).

PYTHONHASHSEED=1 will disable randomization as well :-)

Only setting PYTHONHASHSEED=random actually enables randomization.
msg153868 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-02-21 11:37
> That is a good question.  I don't really care either way, but let's
> say +0 for turning it off when seed == 0.

+1
msg153872 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-02-21 15:33
On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote:

>Exposing the seed value as sys.hashseed would be better and more useful
>to applications.

That makes the most sense to me.
msg153873 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-02-21 15:42
On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote:

>The flag should probably be removed - simply because
>the env var is not a flag, it's a configuration parameter.
>
>Exposing the seed value as sys.hashseed would be better and more useful
>to applications.

Okay, after chatting with __ap__ on irc, here's what I think the behavior
should be:

sys.flags.hash_randomization should contain just the value given by the -R
flag.  It should only be True if the flag is present, False otherwise.

sys.hash_seed contains the hash seed, set by virtue of the flag or envar.  It
should contain the *actual* seed value used.  E.g. it might be zero, the
explicitly set integer, or the randomly selected seed value in use during this
Python execution if a random seed was requested.

If you really need the envar value, getenv('PYTHONHASHSEED') is good enough
for that.
msg153877 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-21 16:28
+1 to what barry and __ap__ discussed and settled on.
msg153975 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-02-22 17:46
I have to amend my suggestion about sys.flags.hash_randomization.  It needs to be non-zero even if $PYTHONHASHSEED is given instead of -R.  Many other flags that also have envars work the same way, e.g. -O and $PYTHONOPTIMIZE.  So hash_randomization has to work the same way.

I'll still work on a patch for exposing the seed in sys.
msg153980 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2012-02-22 18:12
Never mind about sys.hash_seed.  See my follow up in python-dev.  I consider this issue is closed wrt the 2.6 branch.
msg154428 - (view) Author: Roger Serwy (roger.serwy) * (Python committer) Date: 2012-02-27 04:34
After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7.
msg154430 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2012-02-27 05:01
Can you paste the error you're getting?

2012/2/26 Roger Serwy <report@bugs.python.org>:
>
> Roger Serwy <roger.serwy@gmail.com> added the comment:
>
> After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7.
>
> ----------
> nosy: +serwy
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue13703>
> _______________________________________
msg154432 - (view) Author: Roger Serwy (roger.serwy) * (Python committer) Date: 2012-02-27 05:22
It was a false alarm. I didn't recompile python before running it with the latest /Lib files. My apologies.
msg154853 - (view) Author: Chris Rebert (cvrebert) * Date: 2012-03-03 20:36
The Design and History FAQ (will) need a minor corresponding update:
http://docs.python.org/dev/faq/design.html#how-are-dictionaries-implemented
msg155293 - (view) Author: Kurt Seifried (kseifried@redhat.com) Date: 2012-03-10 05:59
I have assigned CVE-2012-1150 for this issue as per http://www.openwall.com/lists/oss-security/2012/03/10/3
msg155472 - (view) Author: Jon Vaughan (jsvaughan) Date: 2012-03-12 20:37
FWIW I upgraded to ubuntu pangolin beta over the weekend, which includes 2.7.3rc1, and I'm also experiencing a problem with urandom.

  File "/usr/lib/python2.7/email/utils.py", line 27, in <module>
    import random
  File "/usr/lib/python2.7/random.py", line 47, in <module>
    from os import urandom as _urandom
ImportError: cannot import name urandom

Given Roger Serwy's comment it sounds like a beta ubuntu problem, but thought it worth mentioning.
msg155527 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-03-12 23:51
> FWIW I upgraded to ubuntu pangolin beta over the weekend,
> which includes 2.7.3rc1, ...
>
>  File "/usr/lib/python2.7/random.py", line 47, in <module>
>    from os import urandom as _urandom
> ImportError: cannot import name urandom

It looks like you are using random.py of Python 2.7.3 with the Python program 2.7.2, because os.urandom() is now always available in Python 2.7.3.
msg155680 - (view) Author: Jon Vaughan (jsvaughan) Date: 2012-03-13 22:15
Victor - yes that was it; a mixture of a 2.7.2 virtual env and 2.7.3.  Apologies for any nuisance caused.
msg155681 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-03-13 22:18
Can we close this issue?
msg155682 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-03-13 22:25
I believe so.  This is in all of the release candidates.

The expat/xmlparse.c hash collision DoS issue is being handled on its own via http://bugs.python.org/issue14234.
History
Date User Action Args
2012-03-13 22:25:45gregory.p.smithsetstatus: open -> closed
resolution: fixed
messages: + msg155682
2012-03-13 22:18:57hayposetmessages: + msg155681
2012-03-13 22:15:44jsvaughansetmessages: + msg155680
2012-03-12 23:51:17hayposetmessages: + msg155527
2012-03-12 20:37:34jsvaughansetnosy: + jsvaughan
messages: + msg155472
2012-03-10 05:59:40kseifried@redhat.comsetnosy: - kseifried@redhat.com
2012-03-10 05:59:10kseifried@redhat.comsetnosy: + kseifried@redhat.com
messages: + msg155293
2012-03-03 20:36:36cvrebertsetmessages: + msg154853
2012-02-27 05:22:27roger.serwysetmessages: + msg154432
2012-02-27 05:01:39benjamin.petersonsetmessages: + msg154430
2012-02-27 04:34:41roger.serwysetnosy: + roger.serwy
messages: + msg154428
2012-02-23 21:49:36cvrebertsetnosy: + cvrebert
2012-02-22 18:12:06barrysetmessages: + msg153980
2012-02-22 17:46:48barrysetmessages: + msg153975
2012-02-21 16:28:11gregory.p.smithsetmessages: + msg153877
2012-02-21 15:42:33barrysetmessages: + msg153873
2012-02-21 15:33:37barrysetmessages: + msg153872
2012-02-21 11:37:41pitrousetmessages: + msg153868
2012-02-21 09:50:25lemburgsetmessages: + msg153862
2012-02-21 09:48:43lemburgsetmessages: + msg153861
2012-02-21 09:47:32hayposetmessages: + msg153860
2012-02-21 06:47:35georg.brandlsetmessages: + msg153854
2012-02-21 06:40:31gregory.p.smithsetmessages: + msg153853
2012-02-21 06:35:48gregory.p.smithsetmessages: + msg153852
2012-02-21 06:12:32georg.brandlsetmessages: + msg153850
2012-02-21 06:03:38eric.araujosetmessages: + msg153849
2012-02-21 06:01:56gregory.p.smithsetmessages: + msg153848
2012-02-21 01:44:32python-devsetmessages: + msg153833
2012-02-20 23:37:06python-devsetmessages: + msg153817
2012-02-20 20:41:43python-devsetmessages: + msg153802
2012-02-20 19:01:40python-devsetnosy: + python-dev
messages: + msg153798
2012-02-20 01:31:03benjamin.petersonsetmessages: + msg153753
2012-02-20 01:05:03pitrousetmessages: + msg153750
2012-02-19 10:22:02eric.araujosetmessages: + msg153695
2012-02-19 10:00:46georg.brandlsetfiles: + hash-patch-3.1-gb-03.patch

messages: + msg153690
2012-02-19 09:59:46georg.brandlsetfiles: - hash-patch-3.1-gb.patch
2012-02-19 09:21:53georg.brandlsetfiles: + hash-patch-3.1-gb.patch

messages: + msg153683
2012-02-19 09:21:32georg.brandlsetfiles: - hash-patch-3.1-gb.diff
2012-02-19 09:14:32georg.brandlsetfiles: + hash-patch-3.1-gb.diff

messages: + msg153682
2012-02-15 08:25:01loewissetmessages: + msg153395
2012-02-14 20:34:56Jim.Jewettsetmessages: + msg153369
2012-02-13 20:50:09lemburgsetmessages: + msg153301
2012-02-13 20:37:13dmalcolmsetfiles: + add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch, add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch

messages: + msg153297
2012-02-12 02:11:27gregory.p.smithsetmessages: + msg153144
2012-02-12 01:37:26gregory.p.smithsetmessages: + msg153143
2012-02-11 23:09:26dmalcolmsetmessages: + msg153141
2012-02-11 23:06:24dmalcolmsetfiles: + add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch, add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch

messages: + msg153140
2012-02-10 23:49:17Jim.Jewettsetmessages: + msg153082
2012-02-10 23:02:00hayposetmessages: + msg153081
2012-02-10 19:23:56gregory.p.smithsetmessages: + msg153074
2012-02-10 15:30:01benjamin.petersonsetmessages: + msg153055
2012-02-08 13:10:39lemburgsetmessages: + msg152855
2012-02-07 15:41:36dmalcolmsetmessages: + msg152811
2012-02-06 23:00:03lemburgsetmessages: + msg152797
2012-02-06 22:07:39alexsetmessages: + msg152789
2012-02-06 22:04:28lemburgsetmessages: + msg152787
2012-02-06 21:53:17dmalcolmsetmessages: + msg152784
2012-02-06 21:42:27alexsetmessages: + msg152781
2012-02-06 21:41:04lemburgsetmessages: + msg152780
2012-02-06 21:18:22gregory.p.smithsetmessages: + msg152777
2012-02-06 20:24:15lemburgsetmessages: + msg152769
2012-02-06 20:17:47pitrousetmessages: + msg152768
2012-02-06 20:14:40lemburgsetmessages: + msg152767
2012-02-06 19:44:53lemburgsetmessages: + msg152764
2012-02-06 19:34:15Jim.Jewettsetmessages: + msg152763
2012-02-06 19:11:43dmalcolmsetmessages: + msg152760
2012-02-06 19:07:45dmalcolmsetfiles: + add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch, fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch, add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch, fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch

messages: + msg152758
2012-02-06 18:54:50lemburgsetmessages: + msg152755
2012-02-06 18:53:40fx5setmessages: + msg152754
2012-02-06 18:31:37Jim.Jewettsetmessages: + msg152753
2012-02-06 17:07:34lemburgsetmessages: + msg152747
2012-02-06 15:47:07Jim.Jewettsetmessages: + msg152740
2012-02-06 13:12:40lemburgsetmessages: + msg152734
2012-02-06 12:22:27pitrousetmessages: + msg152732
2012-02-06 10:20:34lemburgsetmessages: + msg152731
2012-02-06 09:53:02hayposetmessages: + msg152730
2012-02-06 06:11:13loewissetmessages: + msg152723
2012-02-02 01:30:44hayposetmessages: + msg152453
2012-02-02 01:18:27dmalcolmsetfiles: + add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch, fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch, add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch, fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch

messages: + msg152452
2012-02-01 03:29:15dmalcolmsetfiles: + results-16.txt

messages: + msg152422
2012-01-31 01:34:15dmalcolmsetfiles: + optin-hash-randomization-for-2.6-dmalcolm-2012-01-30-001.patch

messages: + msg152364
2012-01-30 23:41:44gzsetmessages: + msg152362
2012-01-30 22:22:46dmalcolmsetfiles: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch

messages: + msg152352
2012-01-30 19:55:53Jim.Jewettsetmessages: + msg152344
2012-01-30 17:31:17dmalcolmsetfiles: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch

messages: + msg152335
2012-01-30 08:16:05loewissetmessages: + msg152315
2012-01-30 07:45:49gregory.p.smithsetmessages: + msg152311
2012-01-30 07:15:04zbyszsetmessages: + msg152309
2012-01-30 01:44:15dmalcolmsetfiles: + unnamed

messages: + msg152300
2012-01-30 01:39:23dmalcolmsetfiles: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch

messages: + msg152299
2012-01-29 22:51:25loewissetmessages: + msg152276
2012-01-29 22:50:20Mark.Shannonsetmessages: + msg152275
2012-01-29 22:39:15Jim.Jewettsetmessages: + msg152271
2012-01-29 22:36:59barrysetmessages: + msg152270
2012-01-29 00:06:29dmalcolmsetmessages: + msg152204
2012-01-28 23:56:24terry.reedysetmessages: + msg152203
2012-01-28 23:24:41pitrousetmessages: + msg152200
2012-01-28 23:14:28dmalcolmsetfiles: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch

messages: + msg152199
2012-01-28 20:05:10benjamin.petersonsetmessages: + msg152186
2012-01-28 19:26:04dmalcolmsetmessages: + msg152183
2012-01-28 05:13:39dmalcolmsetfiles: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-27-001.patch

messages: + msg152149
2012-01-28 03:03:11benjamin.petersonsetmessages: + msg152146
2012-01-27 21:42:37dmalcolmsetmessages: + msg152125
2012-01-27 21:02:34loewissetmessages: + msg152118
2012-01-27 20:59:39skorgusetnosy: + skorgu
2012-01-27 20:25:13pitrousetmessages: + msg152117
2012-01-27 19:32:10loewissetmessages: + msg152112
2012-01-27 17:45:10Jim.Jewettsetmessages: + msg152104
2012-01-27 08:42:52loewissetmessages: + msg152070
2012-01-27 06:25:19gregory.p.smithsetmessages: + msg152066
2012-01-27 02:26:28loewissetmessages: + msg152060
2012-01-27 01:19:14pitrousetmessages: + msg152057
2012-01-26 23:43:50loewissetmessages: + msg152051
2012-01-26 23:22:32alexsetmessages: + msg152046
2012-01-26 23:03:35loewissetmessages: + msg152043
2012-01-26 22:43:57alexsetmessages: + msg152041
2012-01-26 22:42:24loewissetmessages: + msg152040
2012-01-26 22:34:51loewissetmessages: + msg152039
2012-01-26 22:13:19dmalcolmsetmessages: + msg152037
2012-01-26 21:04:28alexsetmessages: + msg152033
2012-01-26 21:00:16loewissetnosy: + loewis
messages: + msg152030
2012-01-25 23:14:03pitrousetmessages: + msg151984
2012-01-25 21:34:39fx5setmessages: + msg151977
2012-01-25 20:23:40dmalcolmsetmessages: + msg151973
2012-01-25 19:28:09pitrousetmessages: + msg151970
2012-01-25 19:19:31dmalcolmsetmessages: + msg151967
2012-01-25 19:13:06pitrousetmessages: + msg151966
2012-01-25 19:04:24Jim.Jewettsetmessages: + msg151965
2012-01-25 18:29:41Jim.Jewettsetmessages: + msg151961
2012-01-25 18:14:07Jim.Jewettsetmessages: + msg151960
2012-01-25 18:05:26pitrousetmessages: + msg151959
2012-01-25 17:49:18dmalcolmsetfiles: + hybrid-approach-dmalcolm-2012-01-25-002.patch

messages: + msg151956
2012-01-25 13:12:24fx5setmessages: + msg151944
2012-01-25 12:47:36alexsetmessages: + msg151942
2012-01-25 12:45:34dmalcolmsetmessages: + msg151941
2012-01-25 11:06:01dmalcolmsetfiles: + hybrid-approach-dmalcolm-2012-01-25-001.patch

messages: + msg151939
2012-01-24 00:44:44gregory.p.smithsetmessages: + msg151870
2012-01-24 00:42:45Jim.Jewettsetmessages: + msg151869
2012-01-24 00:14:31PaulMcMillansetmessages: + msg151867
2012-01-23 21:39:31lemburgsetmessages: + msg151850
2012-01-23 21:31:59dmalcolmsetfiles: + backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch

messages: + msg151847
2012-01-23 16:45:03lemburgsetmessages: + msg151826
2012-01-23 16:43:24lemburgsetfiles: + hash-attack-3.patch, integercollision.py

messages: + msg151825
2012-01-23 13:56:33pitrousetmessages: + msg151815
2012-01-23 13:40:27pitrousetmessages: + msg151814
2012-01-23 13:38:26lemburgsetmessages: + msg151813
2012-01-23 13:07:25lemburgsetfiles: + hash-attack-2.patch

messages: + msg151812
2012-01-23 04:04:42dmalcolmsetmessages: + msg151798
2012-01-23 03:48:50dmalcolmsetmessages: + msg151796
2012-01-23 00:22:36hayposetmessages: + msg151794
2012-01-22 11:40:31hayposetfiles: - random-5.patch
2012-01-22 11:40:30hayposetfiles: - random-7.patch
2012-01-22 11:40:16hayposetfiles: - random-fix_tests.patch
2012-01-22 11:40:12hayposetfiles: - random-6.patch
2012-01-22 03:43:37PaulMcMillansetmessages: + msg151758
2012-01-22 02:13:47dmalcolmsetmessages: + msg151756
2012-01-21 23:47:57alexsetmessages: + msg151754
2012-01-21 23:42:30gregory.p.smithsetmessages: + msg151753
2012-01-21 22:45:29pitrousetmessages: + msg151748
2012-01-21 22:41:58dmalcolmsetmessages: + msg151747
2012-01-21 22:20:47pitrousetmessages: + msg151745
2012-01-21 21:07:41dmalcolmsetmessages: + msg151744
2012-01-21 18:57:38pitrousetmessages: + msg151739
2012-01-21 17:07:56dmalcolmsetmessages: + msg151737
2012-01-21 17:02:55dmalcolmsetfiles: + amortized-probe-counting-dmalcolm-2012-01-21-003.patch

messages: + msg151735
2012-01-21 15:36:01zbyszsetmessages: + msg151734
2012-01-21 14:27:09pitrousetmessages: + msg151731
2012-01-21 03:16:24dmalcolmsetfiles: + amortized-probe-counting-dmalcolm-2012-01-20-002.patch

messages: + msg151714
2012-01-20 22:55:15dmalcolmsetfiles: + hash-collision-counting-dmalcolm-2012-01-20-001.patch

messages: + msg151707
2012-01-20 18:11:34hayposetmessages: + msg151703
2012-01-20 17:42:07Jim.Jewettsetmessages: + msg151701
2012-01-20 17:39:25gvanrossumsetmessages: + msg151700
2012-01-20 17:31:08Jim.Jewettsetmessages: + msg151699
2012-01-20 14:42:49neologixsetmessages: + msg151691
2012-01-20 12:58:04hayposetmessages: + msg151689
2012-01-20 11:17:32lemburgsetmessages: + msg151685
2012-01-20 10:52:35neologixsetmessages: + msg151684
2012-01-20 10:43:09fx5setmessages: + msg151682
2012-01-20 10:39:46neologixsetmessages: + msg151681
2012-01-20 09:30:41fx5setmessages: + msg151680
2012-01-20 09:03:16neologixsetmessages: + msg151679
2012-01-20 04:58:36fx5setmessages: + msg151677
2012-01-20 01:11:24hayposetmessages: + msg151664
2012-01-20 00:38:01lemburgsetmessages: + msg151662
2012-01-19 18:05:52fx5setmessages: + msg151647
2012-01-19 15:13:20lemburgsetmessages: + msg151633
2012-01-19 15:11:54lemburgsetmessages: + msg151632
2012-01-19 14:43:52alexsetmessages: + msg151629
2012-01-19 14:37:53lemburgsetmessages: + msg151628
2012-01-19 14:31:43pitrousetmessages: + msg151626
2012-01-19 14:27:36lemburgsetmessages: + msg151625
2012-01-19 13:13:42hayposetmessages: + msg151620
2012-01-19 13:03:16eric.araujosetmessages: + msg151617
2012-01-19 01:15:24terry.reedysetmessages: + msg151604
2012-01-19 00:46:44gvanrossumsetmessages: + msg151596
2012-01-18 23:46:12pitrousetmessages: + msg151590
2012-01-18 23:44:34gvanrossumsetmessages: + msg151589
2012-01-18 23:37:47terry.reedysetmessages: + msg151586
2012-01-18 23:31:25gregory.p.smithsetmessages: + msg151585
2012-01-18 23:30:12pitrousetmessages: + msg151584
2012-01-18 23:25:37gregory.p.smithsetmessages: + msg151583
2012-01-18 23:23:12terry.reedysetmessages: + msg151582
2012-01-18 22:52:46hayposetmessages: + msg151574
2012-01-18 21:14:01pitrousetmessages: + msg151567
2012-01-18 21:10:50gvanrossumsetmessages: + msg151566
2012-01-18 21:05:30pitrousetmessages: + msg151565
2012-01-18 19:08:19gvanrossumsetmessages: + msg151561
2012-01-18 18:59:56lemburgsetmessages: + msg151560
2012-01-18 10:01:42hayposetmessages: + msg151528
2012-01-18 06:16:55gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg151519
2012-01-17 19:59:51Jim.Jewettsetnosy: + Jim.Jewett
messages: + msg151484
2012-01-17 16:46:05eric.araujosetmessages: + msg151474
2012-01-17 16:35:50hayposetmessages: + msg151472
2012-01-17 16:23:22eric.araujosetmessages: + msg151468
2012-01-17 12:36:34hayposetmessages: + msg151449
2012-01-17 12:21:33hayposetfiles: + random-8.patch

messages: + msg151448
2012-01-17 02:10:41hayposetmessages: + msg151422
2012-01-17 01:57:17hayposetfiles: + random-fix_tests.patch
2012-01-17 01:53:50hayposetfiles: + random-7.patch

messages: + msg151419
2012-01-16 18:58:52lemburgsetmessages: + msg151402
2012-01-16 18:29:00eric.snowsetnosy: + eric.snow
messages: + msg151401
2012-01-16 12:45:16hayposetmessages: + msg151353
2012-01-13 10:17:28zbyszsetmessages: + msg151167
2012-01-13 00:48:55hayposetmessages: + msg151159
2012-01-13 00:36:06hayposetfiles: + bench_startup.py

messages: + msg151158
2012-01-13 00:08:23hayposetfiles: + random-6.patch

messages: + msg151157
2012-01-12 10:02:06grahamdsetnosy: + grahamd
messages: + msg151122
2012-01-12 09:27:35lemburgsetmessages: + msg151121
2012-01-12 08:53:23fx5setnosy: + fx5
messages: + msg151120
2012-01-11 21:46:12neologixsetnosy: + neologix
messages: + msg151092
2012-01-11 19:07:03pitrousetmessages: + msg151078
2012-01-11 18:18:16pitrousetmessages: + msg151074
2012-01-11 18:05:28lemburgsetmessages: + msg151073
2012-01-11 17:38:10lemburgsetmessages: + msg151071
2012-01-11 17:34:32mark.dickinsonsetnosy: + mark.dickinson
messages: + msg151070
2012-01-11 17:28:00pitrousetmessages: + msg151069
2012-01-11 16:03:19lemburgsetmessages: + msg151065
2012-01-11 15:41:09lemburgsetmessages: + msg151064
2012-01-11 14:55:54Mark.Shannonsetmessages: + msg151063
2012-01-11 14:45:34pitrousetmessages: + msg151062
2012-01-11 14:34:17lemburgsetmessages: + msg151061
2012-01-11 09:56:11hayposetmessages: + msg151048
2012-01-11 09:28:30lemburgsetmessages: + msg151047
2012-01-10 23:07:45hayposetfiles: - random-4.patch
2012-01-10 23:07:43hayposetfiles: - random-3.patch
2012-01-10 23:07:40hayposetfiles: - random-2.patch
2012-01-10 23:07:37hayposetfiles: - random.patch
2012-01-10 23:07:08hayposetfiles: + random-5.patch

messages: + msg151033
2012-01-10 22:15:05hayposetfiles: + random-4.patch

messages: + msg151031
2012-01-10 14:26:57pitrousetmessages: + msg151017
2012-01-10 11:37:59hayposetfiles: + random-3.patch

messages: + msg151012
2012-01-09 18:21:49terry.reedysetmessages: - msg150846
2012-01-09 12:16:13lemburgsetmessages: + msg150934
2012-01-09 09:35:36zbyszsetnosy: + zbysz
2012-01-08 14:26:13pitrousetmessages: + msg150866
2012-01-08 14:23:09pitrousetmessages: + msg150865
2012-01-08 12:36:35terry.reedysetmessages: - msg150837
2012-01-08 12:35:48terry.reedysetmessages: - msg150848
2012-01-08 11:47:18lemburgsetmessages: + msg150859
2012-01-08 11:33:27lemburgsetmessages: + msg150857
2012-01-08 10:20:27PaulMcMillansetmessages: + msg150856
2012-01-08 05:55:10v+pythonsetmessages: + msg150848
2012-01-08 05:37:00christian.heimessetmessages: + msg150847
2012-01-08 05:18:55v+pythonsetmessages: + msg150846
2012-01-08 02:40:41PaulMcMillansetmessages: + msg150840
2012-01-08 00:32:59v+pythonsetmessages: + msg150837
2012-01-08 00:21:48alexsetmessages: + msg150836
2012-01-08 00:19:15v+pythonsetfiles: + SafeDict.py

messages: + msg150835
2012-01-07 23:53:44gzsetnosy: + gz
messages: + msg150832
2012-01-07 23:24:48tim.peterssetnosy: + tim.peters
messages: + msg150829
2012-01-07 13:17:34lemburgsetmessages: + msg150795
2012-01-06 22:03:46skrahsetnosy: + skrah
2012-01-06 21:53:34pitrousetmessages: + msg150771
2012-01-06 20:56:41PaulMcMillansetmessages: + msg150769
2012-01-06 20:50:22terry.reedysetmessages: + msg150768
2012-01-06 20:48:08Arachsetnosy: + Arach
2012-01-06 19:53:31PaulMcMillansetmessages: + msg150766
2012-01-06 17:59:39lemburgsetmessages: + msg150756
2012-01-06 17:03:08lemburgsetmessages: + msg150748
2012-01-06 16:35:04hayposetmessages: + msg150738
2012-01-06 12:56:48lemburgsetmessages: + msg150727
2012-01-06 12:56:08lemburgsetmessages: + msg150726
2012-01-06 12:52:20lemburgsetfiles: + hash-attack.patch

messages: + msg150725
2012-01-06 12:49:16lemburgsetmessages: + msg150724
2012-01-06 09:31:12Mark.Shannonsetmessages: + msg150719
2012-01-06 09:08:10Mark.Shannonsetmessages: + msg150718
2012-01-06 02:57:40terry.reedysetmessages: + msg150713
2012-01-06 02:50:28PaulMcMillansetmessages: + msg150712
2012-01-06 01:50:07alexsetmessages: + msg150708
2012-01-06 01:44:17christian.heimessetmessages: + msg150707
2012-01-06 01:09:47hayposetmessages: + msg150706
2012-01-06 00:23:10hayposetmessages: + msg150702
2012-01-05 22:49:32hayposetmessages: + msg150699
2012-01-05 21:40:03PaulMcMillansetmessages: + msg150694
2012-01-05 20:21:21v+pythonsetnosy: + v+python
2012-01-05 12:41:26pitrousetmessages: + msg150668
2012-01-05 10:41:40Mark.Shannonsetmessages: + msg150665
2012-01-05 10:20:26christian.heimessetmessages: + msg150662
2012-01-05 09:43:32Mark.Shannonsetmessages: + msg150659
2012-01-05 09:01:14lemburgsetmessages: + msg150656
2012-01-05 06:25:12Huzaifa.Sidhpurwalasetnosy: + Huzaifa.Sidhpurwala
messages: + msg150655
2012-01-05 01:17:03christian.heimessetmessages: + msg150652
2012-01-05 01:09:05hayposetfiles: + random-2.patch

messages: + msg150651
2012-01-05 01:05:58hayposetmessages: + msg150650
2012-01-05 00:58:43hayposetmessages: + msg150649
2012-01-05 00:58:38christian.heimessetmessages: + msg150648
2012-01-05 00:57:04PaulMcMillansetmessages: + msg150647
2012-01-05 00:53:57christian.heimessetmessages: + msg150646
2012-01-05 00:49:03hayposetmessages: + msg150645
2012-01-05 00:44:25PaulMcMillansetmessages: + msg150644
2012-01-05 00:39:32pitrousetmessages: + msg150643
2012-01-05 00:36:51christian.heimessetmessages: + msg150642
2012-01-05 00:36:10hayposetmessages: + msg150641
2012-01-05 00:31:43PaulMcMillansetmessages: + msg150639
2012-01-05 00:11:02hayposetmessages: + msg150638
2012-01-05 00:02:47hayposetmessages: + msg150637
2012-01-05 00:01:02pitrousetmessages: + msg150636
2012-01-04 23:54:25hayposetmessages: + msg150635
2012-01-04 23:42:51hayposetfiles: + random.patch
keywords: + patch
messages: + msg150634
2012-01-04 17:58:10lemburgsetmessages: + msg150625
2012-01-04 17:44:50alexsetmessages: + msg150622
2012-01-04 17:41:21terry.reedysetmessages: + msg150621
2012-01-04 17:22:42lemburgsetmessages: + msg150620
2012-01-04 17:18:30lemburgsetmessages: + msg150619
2012-01-04 16:42:05lemburgsetnosy: + lemburg
messages: + msg150616
2012-01-04 15:08:36barrysetmessages: + msg150613
2012-01-04 14:52:27eric.araujosetnosy: + eric.araujo
messages: + msg150609
2012-01-04 11:02:59pitrousetmessages: + msg150601
2012-01-04 09:59:35Mark.Shannonsetnosy: + Mark.Shannon
2012-01-04 06:00:38PaulMcMillansetmessages: + msg150592
2012-01-04 05:09:59hayposetmessages: + msg150589
2012-01-04 05:00:38jceasetnosy: + jcea
2012-01-04 03:08:14hayposetmessages: + msg150577
2012-01-04 02:16:27Zhiping.Dengsetnosy: + Zhiping.Deng
2012-01-04 02:14:54pitrousetmessages: + msg150570
2012-01-04 01:58:04pitrousetmessages: + msg150569
2012-01-04 01:54:52hayposetmessages: + msg150568
2012-01-04 01:30:01pitrousetmessages: + msg150565
2012-01-04 01:00:55hayposetmessages: + msg150563
2012-01-04 00:55:05terry.reedysetnosy: + terry.reedy
messages: + msg150562
2012-01-04 00:38:29christian.heimessetmessages: + msg150560
2012-01-04 00:33:10Arfreversetnosy: + Arfrever
2012-01-04 00:22:36hayposetmessages: + msg150559
2012-01-03 23:52:47PaulMcMillansetnosy: + PaulMcMillan
messages: + msg150558
2012-01-03 22:19:51alexsetnosy: + alex
2012-01-03 22:08:19christian.heimessetmessages: + msg150543
2012-01-03 22:02:45barrysetmessages: + msg150541
2012-01-03 21:48:21dmalcolmsetnosy: + dmalcolm
2012-01-03 21:43:34benjamin.petersonsetmessages: + msg150534
2012-01-03 21:20:59hayposetmessages: + msg150533
2012-01-03 20:56:19hayposetnosy: + haypo
2012-01-03 20:49:39barrysetmessages: + msg150532
2012-01-03 20:47:44gvanrossumsetmessages: + msg150531
2012-01-03 20:31:16christian.heimessetdependencies: + Random number generator in Python core
messages: + msg150529
stage: needs patch
2012-01-03 20:24:32pitrousetmessages: + msg150526
2012-01-03 20:19:25christian.heimessetmessages: + msg150525
stage: needs patch -> (no value)
2012-01-03 19:48:53pitrousetnosy: + pitrou, christian.heimes

stage: needs patch
2012-01-03 19:43:25gvanrossumsetnosy: + gvanrossum
2012-01-03 19:36:49barrycreate