Issue 13703: Hash collision security issue

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57912

classification

Title:	Hash collision security issue
Type:	security	Stage:	needs patch
Components:	Interpreter Core	Versions:	Python 3.11

process

Status:	closed	Resolution:	fixed
Dependencies:	13704	Superseder:
Assigned To:		Nosy List:	Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, cvrebert, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, jsvaughan, lemburg, loewis, mark.dickinson, neologix, pitrou, python-dev, roger.serwy, skorgu, skrah, terry.reedy, tim.peters, v+python, zbysz
Priority:	release blocker	Keywords:	patch

Created on 2012-01-03 19:36 by barry, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
hash-attack.patch	lemburg, 2012-01-06 12:52
SafeDict.py	v+python, 2012-01-08 00:19	SafeDict implementation
bench_startup.py	vstinner, 2012-01-13 00:36
random-8.patch	vstinner, 2012-01-17 12:21		review
hash-collision-counting-dmalcolm-2012-01-20-001.patch	dmalcolm, 2012-01-20 22:55		review
amortized-probe-counting-dmalcolm-2012-01-20-002.patch	dmalcolm, 2012-01-21 03:16		review
amortized-probe-counting-dmalcolm-2012-01-21-003.patch	dmalcolm, 2012-01-21 17:02		review
hash-attack-2.patch	lemburg, 2012-01-23 13:07
hash-attack-3.patch	lemburg, 2012-01-23 16:43
integercollision.py	lemburg, 2012-01-23 16:43
backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch	dmalcolm, 2012-01-23 21:31	Backport of haypo's random-8.patch to 2.7	review
hybrid-approach-dmalcolm-2012-01-25-001.patch	dmalcolm, 2012-01-25 11:05	Hybrid approach to solving dict DoS attack	review
hybrid-approach-dmalcolm-2012-01-25-002.patch	dmalcolm, 2012-01-25 17:49		review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-27-001.patch	dmalcolm, 2012-01-28 05:13		review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch	dmalcolm, 2012-01-28 23:14		review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch	dmalcolm, 2012-01-30 01:39		review
unnamed	dmalcolm, 2012-01-30 01:44		review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch	dmalcolm, 2012-01-30 17:31		review
optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch	dmalcolm, 2012-01-30 22:22		review
optin-hash-randomization-for-2.6-dmalcolm-2012-01-30-001.patch	dmalcolm, 2012-01-31 01:34		review
results-16.txt	dmalcolm, 2012-02-01 03:29
add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch	dmalcolm, 2012-02-02 01:18		review
fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch	dmalcolm, 2012-02-02 01:18		review
add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch	dmalcolm, 2012-02-02 01:18		review
fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch	dmalcolm, 2012-02-02 01:18		review
add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch	dmalcolm, 2012-02-06 19:07		review
fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch	dmalcolm, 2012-02-06 19:07		review
add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch	dmalcolm, 2012-02-06 19:07		review
fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch	dmalcolm, 2012-02-06 19:07		review
add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch	dmalcolm, 2012-02-11 23:06		review
add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch	dmalcolm, 2012-02-11 23:06		review
add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch	dmalcolm, 2012-02-13 20:37		review
add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch	dmalcolm, 2012-02-13 20:37		review
hash-patch-3.1-gb-03.patch	georg.brandl, 2012-02-19 10:00		review

Messages (328)
msg150522 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-01-03 19:36
This is already publicly known and in deep discussion on python-dev. The proper fix is still TBD. Essentially, hash collisions can be exploited to DoS a web framework that automatically parses input forms into dictionaries. Start here: http://mail.python.org/pipermail/python-dev/2011-December/115116.html
msg150525 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-03 20:19
I had a short chat with Guido yesterday. I'll try to sum up the conversation. Guido, please correct me if I got something wrong or missed a point. Guido wants the fix as simple and less intrusive as possible as he wants to provide/apply a patch for Python 2.4 to 3.3. This means any new stuff is off the table unless it's really, really necessary. Say goodbye to my experimental MurmurHash3 patch. We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release. The env var PYTHONRANDOMHASH=1 would enable the randomization. It's simple to set the env var in e.g. Apache for mod_python and mod_wsgi.
msg150526 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-03 20:24
> We haven't agreed whether the randomization should be enabled by > default or disabled by default. IMHO it should be disabled for all > releases except for the upcoming 3.3 release. I think on the contrary it must be enabled by default. Leaving security holes open is wrong.
msg150529 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-03 20:31
> I think on the contrary it must be enabled by default. Leaving security > holes open is wrong. We can't foresee the implications of the randomization and only a small number of deployments is affected by the problem. But I won't start a fight on the matter. ;)
msg150531 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2012-01-03 20:47
I'm with Antoine -- turn it on by default. Maybe there should be a release candidate to test the waters.
msg150532 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-01-03 20:49
On Jan 03, 2012, at 08:24 PM, Antoine Pitrou wrote: >I think on the contrary it must be enabled by default. Leaving security >holes open is wrong. Unless there's evidence of performance regressions or backward incompatibilities, I agree.
msg150533 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-03 21:20
> Unless there's evidence of performance regressions > or backward incompatibilities, I agree. If hash() is modified, str(dict) and str(set) will change for example. It may break doctests. Can we consider that the application should not rely (indirectly) on hash and so fix (for example) their doctests? Or is it a backward incompatibility? hash() was already modified in major Python versions. For this specific issue, I consider that security is more important than str(dict).
msg150534 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-01-03 21:43
Barry, when this gets fixed, shall we coordinate release times?
msg150541 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-01-03 22:02
On Jan 03, 2012, at 09:43 PM, Benjamin Peterson wrote: >Barry, when this gets fixed, shall we coordinate release times? Yes!
msg150543 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-03 22:08
Randomized hashing destabilizes the unit tests of Python, too. Here are the outputs of four test runs: 11 tests failed: test_collections test_dbm test_dis test_gdb test_inspect test_packaging test_set test_symtable test_ttk_textonly test_urllib test_urlparse 9 tests failed: test_dbm test_dis test_gdb test_json test_packaging test_set test_symtable test_urllib test_urlparse 10 tests failed: test_dbm test_dis test_gdb test_inspect test_packaging test_set test_symtable test_ttk_textonly test_urllib test_urlparse 9 tests failed: test_collections test_dbm test_dict test_dis test_gdb test_packaging test_symtable test_urllib test_urlparse
msg150558 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-03 23:52
I agree that we should enable randomness by default, and provide an easy way for users to disable it if necessary (unit test suites that explicitly depend on order being an obvious candidate). I'll link my proposed algorithm change here, for the record: https://gist.github.com/0a91e52efa74f61858b5 I've gotten confirmation from several other sources that the fix recommended by the presenters (just a random initialization seed) only prevents the most basic form of the attack.
msg150559 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-04 00:22
Christian Heimes proposes the following change in its randomhash branch (see issue #13704): - x = (Py_uhash_t) p << 7; + x = Py_RndHashSeed + ((Py_uhash_t) p << 7); for (i = 0; i < len; i++) x = (1000003U * x) ^ (Py_uhash_t) p++; x ^= (Py_uhash_t) len; This change doesn't add any security if the attacker can inject any string and retreive the hash value. You can retreive directly Py_RndHashSeed using: Py_RndHashSeed = intmask((hash("a") ^ len("a") ^ ord("a")) DIVIDE) - (ord("a") << 7) where intmask() truncates to a long (x mod 2^(long bits)) and DIVIDE = 1/1000003 mod 2^(long bits). For example, DIVIDE=2021759595 for 32 bits long.
msg150560 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-04 00:38
Victor, please ignore my code related to hash randomization for now. I've deliberately not linked my branch to this bug report. I'm well aware that it's not secure and that it's pretty easy to reverse engineer the seed from a hash of a short string. The code is a proof of concept to detect failing tests and other issues. I'm in private contact with Paul and we are working together. He has done extended research and I'll gladly follow his expertise. I've already discussed the issue with small strings, but I can't recall if it was a private mail to Paul or a public one to the dev list. Paul: I still think that you should special case short strings (five or few chars sound good). An attacker can't do much harm with one to five char strings but such short strings may make it too easy to calculate the seed. 16kb of seed is still a lot. Most CPUs have about 16 to 32, maybe 64kb L1 cache for data. 1024 to 4096 bytes should increase cache locality and reduce speed impacts. PS: I'm going to reply to your last mail tomorrow.
msg150562 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-01-04 00:55
In #13707 I suggest a change to the current hash() entry which is needed independently of this issue, because the default hash (for object()), being tied to id() is already limited to an object's lifetime. But this change will become more imperative if hash() is made run-dependent for numbers and strings. There does not seems to presently be a security hole for 64 bit builds, so if there is any noticeable slowdown on 64 bit builds and it is sensibly easy to tie the default to the bitness, I would think it should be off for such builds.
msg150563 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-04 01:00
Paul first proposition (on python-dev) was to replace: ... x = (ord(s[0]) << 7) while i < length: x = intmask((1000003x) ^ ord(s[i])) ... by: ... x = (ord(s[0]) << 7) while i < length: x = intmask((1000003x) ^ ord(s[i])) ^ r[x % len_r] ... This change has a vulnerability similar than the one of Christian's suggested changed. The "r" array can be retreived directly with: r2 = [] for i in xrange(len(r)): s = chr(intmask(i * UNSHIFT7) % len(r)) h = intmask(hash(s) ^ len(s) ^ ord(s) ^ ((ord(s) << 7) * MOD)) r2.append(chr(h)) r2 = ''.join(r2) where UNSHIFT7 = 1/2**7 mod 2^(long bits). By the way, this change always use r[0] to hash all string of one ASCII character (U+0000-U+007F).
msg150565 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-04 01:30
> I'm in private contact with Paul and we are working together. He has > done extended research and I'll gladly follow his expertise. I've > already discussed the issue with small strings, but I can't recall if > it was a private mail to Paul or a public one to the dev list. Can all this be discussed on this issue now that it's the official point of reference? It will avoid the repetition of arguments we see here and there. (I don't think special-casing small strings makes sense, because then you have two algorithms to audit rather than one)
msg150568 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-04 01:54
> https://gist.github.com/0a91e52efa74f61858b5 Please, attach directly a file to the issue, or copy/paste the code in your comment. Interesting part the code: --- #Proposed replacement #-------------------------------------- import os, array size_exponent = 14 #adjust as a memory/security tradeoff r = array.array('l', os.urandom(2*size_exponent)) len_r = len(r) def _hash_string2(s): """The algorithm behind compute_hash() for a string or a unicode.""" length = len(s) #print s if length == 0: return -1 x = (ord(s[0]) << 7) ^ r[length % len_r] i = 0 while i < length: x = intmask((1000003x) ^ ord(s[i])) x ^= r[x % len_r] i += 1 x ^= length return intmask(x) --- > r = array.array('l', os.urandom(2size_exponent)) > len_r = len(r) r size should not depend on the size of a long. You should write something like: sizeof_long = ctypes.sizeof(ctypes.c_long) r_bits = 8 r = array.array('l', os.urandom((2r_bits) * sizeof_long)) r_mask = 2**r_bits-1 and then replace "% len_r" by "& r_mask". What is the minimum value of r_bits? For example, would it be safe to use a single long integer? (r_bits=1)
msg150569 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-04 01:58
> > r = array.array('l', os.urandom(2size_exponent)) > > len_r = len(r) > > r size should not depend on the size of a long. You should write something like: > > sizeof_long = ctypes.sizeof(ctypes.c_long) > r_bits = 8 > r = array.array('l', os.urandom((2r_bits) * sizeof_long)) > r_mask = 2**r_bits-1 The final code will be in C and will use neither ctypes nor array.array. Arguing about this looks quite pointless IMO.
msg150570 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-04 02:14
For the record, here is what "man urandom" says about random seed size: “[...] no cryptographic primitive available today can hope to promise more than 256 bits of security, so if any program reads more than 256 bits (32 bytes) from the kernel random pool per invocation, or per reasonable reseed interval (not less than one minute), that should be taken as a sign that its cryptography is not skilfully implemented.” In that light, reading a 64 bytes seed from /dev/urandom is already a lot, and 4096 bytes is simply insane.
msg150577 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-04 03:08
I read that the attack cannot be computed with actual computers (it's too expensive) against Python 64 bits. I tried to change str.__hash__ in Python 32 bits to compute the hash in 64 bits and than truncate the hash to 32 bits: it doesn't change anything, the hash values are the same, so it doesn't improve the security.
msg150589 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-04 05:09
Yet another random hash function, simplified version of Paul's function. It always use exactly 256 bits of entropy and so 32 bytes of memory, and doesn't keep the loop. I don't expect my function to be secure, but just give more work to the attacker to compute the data for an attack against our dict implementation. --- import os, array, struct sizeof_long = struct.calcsize("l") r_bits = 256 r_len = r_bits // (sizeof_long * 8) r_mask = r_len - 1 r = array.array('l', os.urandom(r_len * sizeof_long)) def randomhash(s): length = len(s) if length == 0: return -2 x = ord(s[0]) x ^= r[x & r_mask] x <<= 7 for ch in s: x = intmask(1000003 * x) x ^= ord(ch) x ^= length x ^= r[x & r_mask] return intmask(x) --- The first "x ^= r[x & r_mask]" may be replaced by "x ^= r[(x ^ length) & r_mask]". The binary shift is done after the first xor with r, because 27 and r_len are not coprime (r_len is a multipler of 27), and so (ord(s[0] << 7) & r_mask is always zero. randomhash(s)==hash(s) if we used twice the same index in the r array. I don't know if this case gives useful information.
msg150592 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-04 06:00
A couple of things here: First, my proposed change is not cryptographically secure. There simply aren't any cryptographic hashing algorithms available that are in the performance class we need. My proposal does make the collision attack quite difficult to carry out, even if the raw output values from the hash are available to the attacker (they should not usually be). I favor using the same algorithm between 32 and 64 bit builds for consistency of behavior. Developers would be startled to find that ordering stays consistent on a 64 bit build but varies on 32 bit builds. Additionally, the impracticality of attacking of 64 bit builds rests on the fact that these particular researchers didn't devise a way to do it. I'd hate to make this change and then have a clever mathematician publish some elegant point requiring us to go fix the problem all over again. I could be convinced either way on small strings. I like that they can't be used to attack the secret. At the same time, I worry that combining 2 different hashing routines into the same output space may introduce unexpected collisions and other difficult to debug edge-case conditions. It also means that the order of the hashes of long strings will vary while the order of short strings will not - another inconsistency which will encourage bugs. Thank you Victor for the improvements to the python demonstration code. As Antoine said, it's only demo code, but better demo code is always good. Antoine: That section of the manpage is referring to the overall security of a key generated using urandom. 256 bits is overkill for this application. We could take 256 bits and use them to generate a key using a cryptographically appropriate algorithm, but it's simpler to read more bits and use them directly as the key. Additionally, that verbiage has been in the man page for urandom for quite some time (probably since the earliest version in the mid 90's). The PRNG has been improved since then. Minimum length of r is a hard question. The shorter it is, the higher the correlation of the output. In my tests, 16kb was the amount necessary to generally do reasonably well on my test suite for randomness even with problematic input. Obviously our existing output isn't random, so it doesn't pass those tests at all. Using a fairly small value (4k) should not make the results much worse from a security perspective, but might be problematic from a collision/distribution standpoint. It's clear that we don't need cryptographically good randomness here, but passing the test suite is not a bad thing when considering the distribution. When we settle on a C implementation, I'd like to run it through the smhasher set of tests to make sure we aren't making distribution worse, especially for very small values of r.
msg150601 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-04 11:02
> Using a fairly small value (4k) should not make the results much worse > from a security perspective, but might be problematic from a > collision/distribution standpoint. Keep in mind the average L1 data cache size is between 16KB and 64KB. 4KB is already a significant chunk of that. Given a hash function's typical loop is to feed back the current result into the next computation, I don't see why a small value (e.g. 256 bytes) would be detrimental.
msg150609 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2012-01-04 14:52
If test_packaging fails because it relies on dict order / hash details, that’s a bug. Can you copy the full tb (possibly in another report, I can fix it independently of this issue)?
msg150613 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-01-04 15:08
On Jan 04, 2012, at 06:00 AM, Paul McMillan wrote: >Developers would be startled to find that ordering stays consistent on a 64 >bit build but varies on 32 bit builds. Well, one positive outcome of this issue is that users will finally viscerally understand that dictionary (and set) order should never be relied upon, even between successive runs of the same Python executable.
msg150616 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-04 16:42
Some comments: 1. The security implications in all this is being somewhat overemphasized. There are many ways you can do a DoS attack on web servers. It's the responsibility of the used web frameworks and servers to deal with the possible cases. It's a good idea to provide some way to protect against hash collision attacks, but that will only solve one possible way of causing a resource attack on a server. There are other ways you can generate lots of CPU overhead with little data input (e.g. think of targeting the search feature on many Zope/Plone sites). In order to protect against such attacks in general, we'd have to provide a way to control CPU time and e.g. raise an exception if too much time is being spent on a simple operation such as a key insertion. This can be done using timers, signals or even under OS control. The easiest way to protect against the hash collision attack is by limiting the POST/GET/HEAD request size. The second best way would be to limit the number of parameters that a web framework accepts for POST/GET/HEAD request. 2. Changing the semantics of hashing in a dot release is not allowed. If randomization of the hash start vector or some other method is enabled by default in a dot release, this will change the semantics of any application switching to that dot release. The hash values of Python objects are not only used by the Python dictionary implementation, but also by other storage mechanisms such as on-disk dictionaries, inter-process object exchange via share memory, memcache, etc. Hence, if changed, the hash change should be disabled per default for dot releases and enabled for 3.3. 3. Changing the way strings are hashed doesn't solve the problem. Hash values of other types can easily be guessed as well, e.g. take integers which use a trivial hash function. We'd have to adapt all hash functions of the basic types in Python or come up with a generic solution using e.g. double-hashing in the dictionary/set implementations. 4. By just using a random start vector you change the absolute hash values for specific objects, but not the overall hash sequence or its period. An attacker only needs to create many hash collisions, not specific ones. It's the period of the hash function that's important in such attacks and that doesn't change when moving to a different start vector. 5. Hashing needs to be fast. It's one of the most used operations in Python. Please get experts into the boat like Tim Peters and Christian Tismer, who both have worked on the dict implementation and the hash functions, before experimenting with ad-hoc fixes. 6. Counting collisions could solve the issue without having to change hashing. Another idea would be counting the collisions and raising an exception if the number of collisions exceed a certain threshold. Such a change would work for all hashable Python objects and protect against the attack without changing any hash function. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg150619 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-04 17:18
Marc-Andre Lemburg wrote: > > 3. Changing the way strings are hashed doesn't solve the problem. > > Hash values of other types can easily be guessed as well, e.g. > take integers which use a trivial hash function. Here's an example for integers on a 64-bit machine: >>> g = ((x(264 - 1), hash(x(2**64 - 1))) for x in xrange(1, 1000000)) >>> d = dict(g) This takes ages to complete and only uses very little memory. The input data has some 32MB if written down in decimal numbers - not all that much data either. 32397634
msg150620 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-04 17:22
The email interface ate part of my reply: >>> g = ((x(264 - 1), hash(x(2*64 - 1))) for x in xrange(1, 1000000)) >>> s = ''.join(str(x) for x in g) >>> len(s) 32397634 >>> g = ((x(2*64 - 1), hash(x(2**64 - 1))) for x in xrange(1, 1000000)) >>> d = dict(g) ... lots of time for coffee, pizza, taking a walk, etc. :-)
msg150621 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-01-04 17:41
To expand on Marc-Andre's point 1: the DOS attack on web servers is possible because servers are generally dumb at the first stage. Upon receiving a post request, all key=value pairs are mindlessly packaged into a hash table that is then passed on to a page handler that typically ignores the invalid keys. However, most pages do not need any key,value pairs and forms that do have a pre-defined set of expected and recognized keys. If there were a possibly empty set of keys associated with each page, and the set were checked against posted keys, then a DOS post with thousands of effectively random keys could quickly (in O(1) time) be rejected as erroneous. In Python, the same effect could be accomplished by associating a class with slots with each page and having the server create an instance of the class. Attempts to create an undefined attribute would then raise an exception. Either way, checking input data for face validity before processing it in a time-consuming way is one possible solution for nearly all web pages and at least some other applications.
msg150622 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-04 17:44
Except, it's a totally non-scalable approach. People have vulnerabilities all over their sites which they don't realize. Some examples: django-taggit (an application I wrote for handling tags) parses tags out an input, it stores these in a set to check for duplicates. It's vulnerable. Another site I'm writing accepts JSON POSTs, you can put arbitrary keys in the JSON. It's vulnerable.
msg150625 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-04 17:58
Marc-Andre Lemburg wrote: > > 1. The security implications in all this is being somewhat overemphasized. > > There are many ways you can do a DoS attack on web servers. It's the > responsibility of the used web frameworks and servers to deal with > the possible cases. > > It's a good idea to provide some way to protect against hash > collision attacks, but that will only solve one possible way of > causing a resource attack on a server. > > There are other ways you can generate lots of CPU overhead with > little data input (e.g. think of targeting the search feature on > many Zope/Plone sites). > > In order to protect against such attacks in general, we'd have to > provide a way to control CPU time and e.g. raise an exception if too > much time is being spent on a simple operation such as a key insertion. > This can be done using timers, signals or even under OS control. > > The easiest way to protect against the hash collision attack is by > limiting the POST/GET/HEAD request size. For GET and HEAD, web servers normally already apply such limitations at rather low levels: http://stackoverflow.com/questions/686217/maximum-on-http-header-values So only HTTP methods which carry data in the body part of the HTTP request are effected, e.g. POST and various WebDAV methods. > The second best way would be to limit the number of parameters that a > web framework accepts for POST/GET/HEAD request. Depending on how parsers are implemented, applications taking XML/JSON/XML-RPC/etc. as data input may also be vulnerable, e.g. non validating XML parsers which place element attributes into a dictionary or a JSON parser that has to read the JSON version of the dict I generated earlier on.
msg150634 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-04 23:42
Work-in-progress patch implementing my randomized hash function (random.patch): - add PyOS_URandom() using CryptoGen, SSL (only on VMS!!) or /dev/urandom, will a fallback on a dummy LCG if the OS urandom failed - posix.urandom() is always defined and reuses PyOS_URandom() - hash(str) is now randomized using two random Py_hash_t values: don't touch the critical loop, only add a prefix and a suffix Notes: - PyOS_URandom() reuses mostly code from Modules/posixmodule.c, except dev_urandom() and fallback_urandom() which are new - I removed memset(PyBytes_AS_STRING(result), 0, howMany); from win32_urandom() because it doesn't really change anything because the LCG is used if win32_urandom() fails - Python refuses to start if the OS urandom is missing. - Python/random.c code may be moved into Python/pythonrun.c if it is an issue to add a new file in old Python versions. - If the OS urandom fails to generate the unicode hash secret, no warning is emitted (because the LCG is used). I don't know if a warning is needed in this case. - os.urandom() argument is now a Py_ssize_t instead of an int TODO: - add an environment option to ignore the OS urandom and only uses the LCG - fix all tests broken because of the randomized hash(str) - PyOS_URandom() raises exceptions whereas it is called before creating the interpreter state. I suppose that it cannot work like this. - review and test PyOS_URandom() - review and test the new randomized hash(str)
msg150635 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-04 23:54
> add PyOS_URandom() using CryptoGen, SSL (only on VMS!!) > or /dev/urandom Oh, OpenSSL (RAND_pseudo_bytes) should be used on Windows, Linux, Mac OS X, etc. if OpenSSL is available. I was just too lazy to add a define or pyconfig.h option to indicate if OpenSSL is available or not. FYI RAND_pseudo_bytes() is now exposed in the ssl module of Python 3.3. > will a fallback on a dummy LCG It's the Linear congruent generator (LCG) used by Microsoft Visual C++ and PHP: x(n+1) = (x(n) * 214013 + 2531011) % 2^32 I only use bits 23..16 (bits 15..0 are not really random).
msg150636 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-05 00:01
> > add PyOS_URandom() using CryptoGen, SSL (only on VMS!!) > > or /dev/urandom > > Oh, OpenSSL (RAND_pseudo_bytes) should be used on Windows, Linux, Mac > OS X, etc. if OpenSSL is available. Apart from the large dependency, the OpenSSL license is not GPL-compatible which may be a problem for some Python-embedding applications: http://en.wikipedia.org/wiki/OpenSSL#Licensing > > will a fallback on a dummy LCG > > It's the Linear congruent generator (LCG) used by Microsoft Visual C++ > and PHP: > > x(n+1) = (x(n) * 214013 + 2531011) % 2^32 > > I only use bits 23..16 (bits 15..0 are not really random). If PHP uses it, I'm confident it is secure.
msg150637 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-05 00:02
+ printf("read %i bytes\n", size); Oops, I forgot a debug message.
msg150638 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-05 00:11
> If PHP uses it, I'm confident it is secure. If I remember correctly, it is only used for the Windows version of PHP, but PHP doesn't implement it correctly because it uses all bits.
msg150639 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-05 00:31
This is not something that can be fixed by limiting the size of POST/GET. Parsing documents (even offline) can generate these problems. I can create books that calibre (a Python-based ebook format shifting tool) can't convert, but are otherwise perfectly valid for non-python devices. If I'm allowed to insert usernames into a database and you ever retrieve those in a dict, you're vulnerable. If I can post things one at a time that eventually get parsed into a dict (like the tag example), you're vulnerable. I can generate web traffic that creates log files that are unparsable (even offline) in Python if dicts are used anywhere. Any application that accepts data from users needs to be considered. Even if the web framework has a dictionary implementation that randomizes the hashes so it's not vulnerable, the entire python standard library uses dicts all over the place. If this is a problem which must be fixed by the framework, they must reinvent every standard library function they hope to use. Any non-trivial python application which parses data needs the fix. The entire standard library needs the fix if is to be relied upon by applications which accept data. It makes sense to fix Python. Of course we must fix all the basic hashing functions in python, not just the string hash. There aren't that many. Marc-Andre: If you look at my proposed code, you'll notice that we do more than simply shift the period of the hash. It's not trivial for an attacker to create colliding hash functions without knowing the key. Since speed is a concern, I think that the proposal to avoid using the random hash for short strings is a good idea. Additionally, randomizing only some of the characters in longer strings will allow us to improve security without compromising speed significantly. I suggest that we don't randomize strings shorter than 6 characters. For longer strings, we randomize the first and last 5 characters. This means we're only adding additional work to a max of 10 rounds of the hash, and only for longer strings. Collisions with the hash from short strings should be minimal.
msg150641 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-05 00:36
"Since speed is a concern, I think that the proposal to avoid using the random hash for short strings is a good idea." My proposition only adds two XOR to hash(str) (outside the loop on Unicode characters), so I expect a ridiculous overhead. I don't know yet how hard it is to guess the secret from hash(str) output.
msg150642 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-05 00:36
Thanks Victor! > - hash(str) is now randomized using two random Py_hash_t values: > don't touch the critical loop, only add a prefix and a suffix At least for Python 2.x hash(str) and hash(unicode) have to yield the same result for ASCII only strings. > - PyOS_URandom() raises exceptions whereas it is called before > creating the interpreter state. I suppose that it cannot work like this. My patch compensates for the issue and calls Py_FatalError() when the random seed hasn't been initialized yet. You aren't special casing small strings. I fear that an attacker may guess the seed from several small strings. How about using another initial seed for strings shorter than 4 code points?
msg150643 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-05 00:39
> You aren't special casing small strings. I fear that an attacker may > guess the seed from several small strings. How would (s)he do?
msg150644 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-05 00:44
> My proposition only adds two XOR to hash(str) (outside the loop on Unicode characters), so I expect a ridiculous overhead. I don't know yet how hard it is to guess the secret from hash(str) output. It doesn't work much better than a single random seed. Calculating the hash of a null byte gives you the xor of your two seeds. An attacker can still cause collisions inside the vulnerable hash function, your change doesn't negate those internal collisions. Also, strings of all null bytes collide trivially.
msg150645 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-05 00:49
> I fear that an attacker may guess the seed from several small strings hash(a) ^ hash(b) "removes" the suffix, but I don't see how to guess the prefix from this new value. It doesn't mean that it is not possible, just that I don't have a strong background in crytography :-) I don't expect that adding 2 XOR would change our dummy (fast but unsafe) hash function into a cryptographic hash function. We cannot have security for free. If we want a strong cryptographic hash function, it would be much slower (Paul wrote that it would be 4x slower). But we prefer speed over security, so we have to do compromise. I don't know if you can retreive hash values in practice. I suppose that you can only get hash(str) & (size - 1) with size=size of the dict internal array, so only the lower bits. Using a large dict, you may be able to retreive more bits of the hash value.
msg150646 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-05 00:53
Given that a user has an application with an oracle function that returns the hash of a unicode string, an attacker can probe tenth of thousand one and two character unicode strings. That should give him/her enough data to calculate both seeds. hash("") already gives away lots of infomration about the seeds, too. - hash("") should always return 0 - for small strings we could use a different seed than for larger strings - for larger strings we could use Paul's algorithm but limit the XOR op to the first and last 16 elements instead of all elements.
msg150647 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-05 00:57
> - for small strings we could use a different seed than for larger strings Or just leave them unseeded with our existing algorithm. Shifting them into a different part of the hash space doesn't really gain us much. > - for larger strings we could use Paul's algorithm but limit the XOR op to the first and last 16 elements instead of all elements. Agreed. It does have to be both the first and the last though. We can't just do one or the other.
msg150648 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-05 00:58
Paul wrote: > I suggest that we don't randomize strings shorter than 6 characters. For longer strings, we randomize the first and last 5 characters. This means we're only adding additional work to a max of 10 rounds of the hash, and only for longer strings. Collisions with the hash from short strings should be minimal. It's too surprising for developers when just the strings with 6 or more chars are randomized. Barry made a good point http://bugs.python.org/issue13703#msg150613
msg150649 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-05 00:58
"Calculating the hash of a null byte gives you the xor of your two seeds." Not directly because prefix is first multiplied by 1000003. So hash("\0") gives you (prefix1000003) % 2^32 ^ suffix. Example: $ ./python secret={b7abfbbf, db6cbb4d} Python 3.3.0a0 (default:547e918d7bf5+, Jan 5 2012, 01:36:39) >>> hash("") 1824997618 >>> hash("\0") -227042383 >>> hash("\0"2) 1946249080 >>> 0xb7abfbbf ^ 0xdb6cbb4d 1824997618 >>> (0xb7abfbbf * 1000003) & 0xffffffff ^ 0xdb6cbb4d 4067924912 >>> hash("\0") & 0xffffffff 4067924913
msg150650 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-05 01:05
> At least for Python 2.x hash(str) and hash(unicode) have to yield > the same result for ASCII only strings. Ah yes, I forgot Python 2: I wrote my patch for Python 3.3. The two hash functions should be modified to be randomized. > hash("") should always return 0 Ok, I can add a special case. Antoine told me that hash("") gives prefix ^ suffix, which is too much information for the attacker :-) > for small strings we could use a different seed > than for larger strings Why? The attack doesn't work with short strings? What do you call a "short string"?
msg150651 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-05 01:09
Patch version 2: - hash("") is always 0 - Remove a debug message
msg150652 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-05 01:17
In reply to MAL's message http://bugs.python.org/issue13703#msg150616 > 2. Changing the semantics of hashing in a dot release is not allowed. I concur with Marc. The change is too intrusive and may cause too much trouble for the issue. Also it seems to be unnecessary for platforms with 64bit hash. Marc: Fred told me that ZODB isn't affected. One thing less to worry. ;) > 5. Hashing needs to be fast. Good point, we should include Tim and Christian Tiesmer once we have a solution we can agree upon PS: I'm missing "Reply to message" and a threaded view for lengthy topics
msg150655 - (view)	Author: Huzaifa Sidhpurwala (Huzaifa.Sidhpurwala)	Date: 2012-01-05 06:25
I am wondering if a CVE id has been assigned to this security issue yet?
msg150656 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-05 09:01
Paul McMillan wrote: > > This is not something that can be fixed by limiting the size of POST/GET. > > Parsing documents (even offline) can generate these problems. I can create books that calibre (a Python-based ebook format shifting tool) can't convert, but are otherwise perfectly valid for non-python devices. If I'm allowed to insert usernames into a database and you ever retrieve those in a dict, you're vulnerable. If I can post things one at a time that eventually get parsed into a dict (like the tag example), you're vulnerable. I can generate web traffic that creates log files that are unparsable (even offline) in Python if dicts are used anywhere. Any application that accepts data from users needs to be considered. > > Even if the web framework has a dictionary implementation that randomizes the hashes so it's not vulnerable, the entire python standard library uses dicts all over the place. If this is a problem which must be fixed by the framework, they must reinvent every standard library function they hope to use. > > Any non-trivial python application which parses data needs the fix. The entire standard library needs the fix if is to be relied upon by applications which accept data. It makes sense to fix Python. Agreed: Limiting the size of POST requests only applies to web applications. Other applications will need other fixes. Trying to fix the problem in general by tweaking the hash function to (apparently) make it hard for an attacker to guess a good set of colliding strings/integers/etc. is not really a good solution. You'd only be making it harder for script kiddies, but as soon as someone crypt-analysis the used hash algorithm, you're lost again. You'd need to use crypto hash functions or universal hash functions if you want to achieve good security, but that's not an option for Python objects, since the hash functions need to be as fast as possible (which rules out crypto hash functions) and cannot easily drop the invariant "a=b => hash(a)=hash(b)" (which rules out universal hash functions, AFAICT). IMO, the strategy to simply cap the number of allowed collisions is a better way to achieve protection against this particular resource attack. The probability of having valid data reach such a limit is low and, if configurable, can be made 0. > Of course we must fix all the basic hashing functions in python, not just the string hash. There aren't that many. ... not in Python itself, but if you consider all the types in Python extensions and classes implementing __hash__ in user code, the number of hash functions to fix quickly becomes unmanageable. > Marc-Andre: > If you look at my proposed code, you'll notice that we do more than simply shift the period of the hash. It's not trivial for an attacker to create colliding hash functions without knowing the key. Could you post it on the ticket ? BTW: I wonder how long it's going to take before someone figures out that our merge sort based list.sort() is vulnerable as well... its worst-case performance is O(n log n), making attacks somewhat harder. The popular quicksort which Python used for a long time has O(n²), making it much easier to attack, but fortunately, we replaced it with merge sort in Python 2.3, before anyone noticed ;-)
msg150659 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2012-01-05 09:43
What is the mechanism by which the attacker can determine the seeds? The actual hash value is not directly observable externally. The attacker can only determine the timing effects of multiple insertions into a dict, or have I missed something? > - hash("") should always return 0 Why should hash("") always return 0? I can't find it in the docs anywhere.
msg150662 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-05 10:20
It's quite possible that a user has created a function (by mistake or deliberately) that gives away the hash of an arbitrary string. We haven't taught developers that (s)he shouldn't disclose the hash of a string. > Why should hash("") always return 0? > I can't find it in the docs anywhere. hash("") should return something constant that doesn't reveal information about the random seeds. 0 is an arbitrary choice that is as good as anything else. hash("") already returns 0, hence my suggestion for 0.
msg150665 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2012-01-05 10:41
But that's not the issue we are supposed to be dealing with. A single (genuinely random) seed will deal with the attack described in the talk and it is (almost) as fast as using 0 as a seed. Why make things complicated dealing with a hypothetical problem? >> Why should hash("") always return 0? >> I can't find it in the docs anywhere. > > hash("") should return something constant that doesn't reveal information about the random seeds. 0 is an arbitrary choice that is as good as anything else. hash("") already returns 0, hence my suggestion for 0. Is special casing arbitrary values really any more secure? If we special case "", the attacker will just start using "\0" and so on... > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________
msg150668 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-05 12:41
> I concur with Marc. The change is too intrusive and may cause too much > trouble for the issue. Do you know if mod_wsgi et al. are tackling the issue on their side? > Also it seems to be unnecessary for platforms with 64bit hash. We still support Python on 32-bit platforms, so this can't be a serious argument. If you think that no-one runs a server on a 32-bit kernel nowadays, I would point out that "no-one" apparently doesn't include ourselves ;-)
msg150694 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-05 21:40
Marc-Andre: Victor already pasted the relevant part of my code: http://bugs.python.org/issue13703#msg150568 The link to the fuller version, with revision history and a copy of the code before I modified it is here: https://gist.github.com/0a91e52efa74f61858b5 >Why? The attack doesn't work with short strings? What do you call a "short string"? Well, the demonstrated collision is for 16 character ascii strings. Worst case UTF-8, we're looking at 3 manipulable bytes per character, but they may be harder to collide since some of those bytes are fixed. > only be making it harder for script kiddies, but as soon as someone > crypt-analysis the used hash algorithm, you're lost again. Not true. What I propose is to make the amount of information necessary to analyze and generate collisions impractically large. My proposed hash function is certainly broken if you brute force the lookup table. There are undoubtedly other problems with it too. The point is that it's hard enough. We aren't going for perfect security - we're going for enough to make this attack impractical. What are the downsides to counting collisions? For one thing, it's something that needs to be kept track of on a per-dict basis, and can't be cached the way the hash results are. How do you choose a good value for the limit? If you set it to something conservative, you still pay the collision price every time a dict is created to discover that the keys collide. This means that it's possible to feed to bad data up to exactly the limit, and suddenly the python app is inexplicably slow. If you set the limit too aggressively, then sometimes valid data gets caught, and python randomly dies in hard to debug ways with an error the programmer has never seen in testing and cannot reproduce. It adds a new way to kill most python applications, and so programs are going to have to be re-written to cope with it. It also introduces a new place to cause errors - if the WSGI server dies, it's hard for my application to catch that and recover gracefully. >... not in Python itself, but if you consider all the types in Python > extensions and classes implementing __hash__ in user code, the number > of hash functions to fix quickly becomes unmanageable. When we looked at the Django project, we wouldn't have anything to fix since ours end up relying on the python internal values eventually. I suspect a lot of other code is similar. Mark said: >What is the mechanism by which the attacker can determine the seeds? The biggest information leak is probably the ordering in which dict entries are returned. This can be used to deduce the underlying hash values. This is much easier than trying to do it via timing. > But that's not the issue we are supposed to be dealing with. > A single (genuinely random) seed will deal with the attack described in > the talk and it is (almost) as fast as using 0 as a seed. This is not true. A single random seed shifts the hash table, but does not actually prevent an attacker from generating collisions. Please see my other posts on the topic here and on the mailing list.
msg150699 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-05 22:49
> What I propose is to make the amount of information necessary > to analyze and generate collisions impractically large. Not only: the attacker has to compute the collisions for the new seed. I don't know how long it is, the code to generate collisions is not public yet. I suppose than generating collisions is longer if we change the hash function to add more instructions (I don't know how much). If generating the collisions requires a farm of computers / GPUs / something else and 7 days, it doesn't matter if it's easy to retreive the secret. If the attack wants to precompute collisions for all possible seeds, (s)he will also have to store them. With 64 bits of entropy, if an attack is 1 byte long, you have to store 2^64 bytes (16,777,216 TB). It is a problem if it takes less than a day with a desktop PC to generate data for an attack. In this case, it should be difficult to compute the secret.
msg150702 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-06 00:23
Note for myself, random-2.patch: _PyRandom_Init() must generate a prefix and a suffix different than zero (call PyOS_URandom in a loop, and fail after 100 tries).
msg150706 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-06 01:09
"Given that a user has an application with an oracle function that returns the hash of a unicode string, an attacker can probe tenth of thousand one and two character unicode strings. That should give him/her enough data to calculate both seeds. hash("") already gives away lots of infomration about the seeds, too." Sorry, but I don't see how you compute the secret using these data. You are right, hash("\0") gives some information about the secret. With my patch, hash("\0")^1 gives: ((prefix * 1000003) & HASH_MASK) ^ suffix. (hash("\0")^1) ^ (hash("\0\0")^2) gives ((prefix * 1000003) & HASH_MASK) ^ ((prefix * 1000003**2) & HASH_MASK).
msg150707 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-06 01:44
Either we are really paranoid (I know that I am g) or Perl's and Ruby's randomized hashing function suffer from the issues we are worried about. They don't compensate for hash(''), hash(n * '\0') or hash(shortstring). Perl 5.12.4 hv.h: #define PERL_HASH(hash,str,len) \ STMT_START { \ register const char * const s_PeRlHaSh_tmp = str; \ register const unsigned char s_PeRlHaSh = (const unsigned char )s_PeRlHaSh_tmp; \ register I32 i_PeRlHaSh = len; \ register U32 hash_PeRlHaSh = PERL_HASH_SEED; \ while (i_PeRlHaSh--) { \ hash_PeRlHaSh += s_PeRlHaSh++; \ hash_PeRlHaSh += (hash_PeRlHaSh << 10); \ hash_PeRlHaSh ^= (hash_PeRlHaSh >> 6); \ } \ hash_PeRlHaSh += (hash_PeRlHaSh << 3); \ hash_PeRlHaSh ^= (hash_PeRlHaSh >> 11); \ (hash) = (hash_PeRlHaSh + (hash_PeRlHaSh << 15)); \ } STMT_END Ruby 1.8.7-p357 st.c:strhash() #define CHAR_BIT 8 hash_seed = rb_genrand_int32() # Mersenne Twister register unsigned long val = hash_seed; while ((c = string++) != '\0') { val = val997 + c; val = (val << 13) \| (val >> (sizeof(st_data_t) CHAR_BIT - 13)); } return val + (val>>5); I wasn't able to find Java's fix quickly. Anybody else?
msg150708 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-06 01:50
Perl is so paranoid they obscure their variable names! In all seriousness, both Perl and Ruby are vulnerable to the timing attacks, and as far as I know the JVM is not patching this themselves, but telling applications to fix it themselves (I know JRuby switched to Murmurhash).
msg150712 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-06 02:50
As Alex said, Java has refused to fix the issue. I believe that Ruby 1.9 (at least the master branch code that I looked at) is using murmurhash2 with a random seed. In either case, yes, these functions are vulnerable to a number of attacks. We're solving the problem more completely than they did.
msg150713 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-01-06 02:57
Those who use or advocate a simple randomized starting hash (Perl, Ruby, perhaps MS, and the CCC presenters) are presuming that the randomized hash values are kept private. Indeed, they should be (and the docs could note this) unless an attacker has direct access to the interpreter. An attacker who does, as in a Python programming class, can much more easily freeze the interpreter by 'accidentally' writing code equivalent to "while True: pass". I do not think we, as Python developers, should be concerned about esoteric timing attacks. They strike me as a site issue rather than a language issue. As I understand them, they require large numbers of probes coupled with responses based on the same hash function. So a site being so probed already has bit of a problem. And if hashing were randomized per process, and probes were randomly distributed among processes, and processes were periodically killed and restarted with new seeds, could such an attack get anywhere (besides the DOS effect of the probing)? The point of the CCC talk was that with one constant known hash, one could lock up a server for a long while with just one upload. So I think we should copy Perl and Ruby, do the easy thing, and add a random seed to 3.3 hashing, subject to keeping equality for equal numbers. Let whatever thereby fails, fail, and be updated. For prior versions, add an option for strings and perhaps numbers, and document that some tests will fail if enabled. We could also consider, for 3.3, making the output of hash() be different from the internal values used for dicts, perhaps by switching random seeds in hash(). So even if someone does return hash(x) values to potential attackers, they are not the values used in dicts. (This would require a slight change in the doc.)
msg150718 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2012-01-06 09:08
I agree. +1 for strings. -0 for numbers. This might cause problems with dict subclasses and the like, so I'm -1 on this.
msg150719 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2012-01-06 09:31
Without the context, that last message didn't make much sense. I agree with Terry that we should copy Perl and Ruby (for strings). I'm -1 on hash() returning a different value than dict uses internally.
msg150724 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-06 12:49
Before continuing down the road of adding randomness to hash functions, please have a good read of the existing dictionary implementation: """ Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases: [0, 1, 2, 3] >>> map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462] >>> This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable. ... """ There's also a file called dictnotes.txt which has more interesting details about how the implementation is designed. Please note that the term "collision" is used in a slightly different way: it refers to trying to find an empty slot in the dictionary table. Having a collision implies that the hash values of two distinct objects are the same, but you also get collisions in case two distinct objects with different hash values get mapped to the same table entry. An attack can be based on trying to find many objects with the same hash value, or trying to find many objects that, as they get inserted into a dictionary, very often cause collisions due to the collision resolution algorithm not finding a free slot. In both cases, the (slow) object comparisons needed to find an empty slot is what makes the attack practical, if the application puts too much trust into large blobs of input data - which is the actual security issues we're trying to work around here... Given the dictionary implementation notes, I'm even less certain that the randomization change is a good idea. It will likely introduce a performance hit due to both the added complexity in calculating the hash as well as the reduced cache locality of the data in the dict table. I'll upload a patch that demonstrates the collisions counting strategy to show that detecting the problem is easy. Whether just raising an exception is a good idea, is another issue. It may be better to change the tp_hash slot in Python 3.3 to take an argument, so that the dict implementation can use the hash function as universal hash family function (see http://en.wikipedia.org/wiki/Universal_hash). The dict implementation could then alter the hash parameter and recreate the dict table in case the number of collisions exceeds a certain limit, thereby actively taking action instead of just relying on randomness solving the issue in most cases.
msg150725 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-06 12:52
Demo patch implementing the collision limit idea for Python 2.7.
msg150726 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-06 12:56
The hash-attack.patch solves the problem for the integer case I posted earlier on and doesn't cause any problems with the test suite. Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'too many hash collisions' It also doesn't change the hashing or dict repr in existing applications.
msg150727 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-06 12:56
Stupid email interface again... here's the full text: The hash-attack.patch solves the problem for the integer case I posted earlier on and doesn't cause any problems with the test suite. >>> d = dict((x(264 - 1), hash(x(2*64 - 1))) for x in xrange(1, 100)) >>> d = dict((x(2*64 - 1), hash(x(2**64 - 1))) for x in xrange(1, 1000)) Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'too many hash collisions' It also doesn't change the hashing or dict repr in existing applications.
msg150738 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-06 16:35
hash-attack.patch does never decrement the collision counter.
msg150748 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-06 17:03
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > hash-attack.patch does never decrement the collision counter. Why should it ? It's only used as local variable in the lookup function. Note that the limit only triggers on a per-key basis. It's not a limit on the total number of collisions in the table, so you don't need to keep the number of collisions stored on the object.
msg150756 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-06 17:59
Here's an example of hash-attack.patch finding an on-purpose programming error (hashing all objects to the same value): http://stackoverflow.com/questions/4865325/counting-collisions-in-a-python-dictionary (see the second example on the page for @Winston Ewert's solution) With the patch you get: Traceback (most recent call last): File "testcollisons.py", line 20, in <module> d[o] = 1 KeyError: 'too many hash collisions'
msg150766 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-06 19:53
> Those who use or advocate a simple randomized starting hash (Perl, Ruby, perhaps MS, and the CCC presenters) are presuming that the randomized hash values are kept private. Indeed, they should be (and the docs could note this) unless an attacker has direct access to the interpreter. Except that this is patently untrue. Anytime any programmer iterates over a dictionary and returns the ordered result to the user in some form, they're leaking information about the hash value. I hope you're not suggesting that any programmer who is concerned about security will make sure to sort the results of every iteration before making it public in some fashion. > I do not think we, as Python developers, should be concerned about esoteric timing attacks. Timing attacks are less esoteric than you think they are. This issue gets worse, not better, as the internet moves (for better or worse) towards virtualized computing. > And if hashing were randomized per process, and probes were randomly distributed among processes, and processes were periodically killed and restarted with new seeds, could such an attack get anywhere... You're suggesting that in order for a Python application to be secure, it's a requirement that we randomly kill and restart processes from time to time? I thought we were going for a good solution here, not a hacky workaround. > We could also consider, for 3.3, making the output of hash() be different from the internal values used for dicts, perhaps by switching random seeds in hash(). So even if someone does return hash(x) values to potential attackers, they are not the values used in dicts. (This would require a slight change in the doc.) This isn't a bad idea, but I'd be fine documenting that the output of hash() shouldn't be made public.
msg150768 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-01-06 20:50
"You're suggesting that in order for a Python application to be secure, it's a requirement that we randomly kill and restart processes from time to time?" No, that is not what I said.
msg150769 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-06 20:56
> An attack can be based on trying to find many objects with the same > hash value, or trying to find many objects that, as they get inserted > into a dictionary, very often cause collisions due to the collision > resolution algorithm not finding a free slot. Yep. Allowing an attacker to produce very large dictionaries is also bad. > if the application > puts too much trust into large blobs of input data - which is > the actual security issues we're trying to work around here... To be very clear the issue is ANY large blob of data anywhere in the application, not just on input. An attack could happen after whatever transform your application runs on the data before returning it. > I'll upload a patch that demonstrates the collisions counting > strategy to show that detecting the problem is easy. Whether > just raising an exception is a good idea, is another issue. I'm in cautious agreement that collision counting is a better strategy. The dict implementation performance would suffer from randomization. > The dict implementation could then alter the hash parameter > and recreate the dict table in case the number of collisions > exceeds a certain limit, thereby actively taking action > instead of just relying on randomness solving the issue in > most cases. This is clever. You basically neuter the attack as you notice it but everything else is business as usual. I'm concerned that this may end up being costly in some edge cases (e.g. look up how many collisions it takes to force the recreation, and then aim for just that many collisions many times). Unfortunately, each dict object has to discover for itself that it's full of offending hashes. Another approach would be to neuter the offending object by changing its hash, but this would require either returning multiple values, or fixing up existing dictionaries, neither of which seems feasible.
msg150771 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-06 21:53
> I'm in cautious agreement that collision counting is a better > strategy. Disagreed. Raising randomly is unacceptable (false positives), especially in a bugfix release. > The dict implementation performance would suffer from > randomization. Benchmarks please. http://hg.python.org/benchmarks/ for example.
msg150795 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-07 13:17
Paul McMillan wrote: > >> I'll upload a patch that demonstrates the collisions counting >> strategy to show that detecting the problem is easy. Whether >> just raising an exception is a good idea, is another issue. > > I'm in cautious agreement that collision counting is a better > strategy. The dict implementation performance would suffer from > randomization. > >> The dict implementation could then alter the hash parameter >> and recreate the dict table in case the number of collisions >> exceeds a certain limit, thereby actively taking action >> instead of just relying on randomness solving the issue in >> most cases. > > This is clever. You basically neuter the attack as you notice it but > everything else is business as usual. I'm concerned that this may end > up being costly in some edge cases (e.g. look up how many collisions > it takes to force the recreation, and then aim for just that many > collisions many times). Unfortunately, each dict object has to > discover for itself that it's full of offending hashes. Another > approach would be to neuter the offending object by changing its hash, > but this would require either returning multiple values, or fixing up > existing dictionaries, neither of which seems feasible. I ran some experiments with the collision counting patch and could not trigger it in normal applications, not even in cases that are documented in the dict implementation to have a poor collision resolution behavior (integers with zeros the the low bits). The probability of having to deal with dictionaries that create over a thousand collisions for one of the key objects in a real life application appears to be very very low. Still, it may cause problems with existing applications for the Python dot releases, so it's probably safer to add it in a disabled-per-default form there (using an environment variable to adjust the setting). For 3.3 it could be enabled per default and it would also make sense to allow customizing the limit using a sys module setting. The idea with adding a parameter to the hash method/slot in order to have objects provide a hash family function instead of a fixed unparametrized hash function would probably have to be implemented as additional hash method, e.g. .__uhash__() and tp_uhash ("u" for universal). The builtin types should then grow such methods in order to make hashing safe against such attacks. For objects defined in 3rd party extensions, we would need to encourage implementing the slot/method as well. If it's not implemented, the dict implementation would have to fallback to raising an exception. Please note that I'm just sketching things here. I don't have time to work on a full-blown patch, just wanted to show what I meant with the collision counting idea and demonstrate that it actually works as intended.
msg150829 - (view)	Author: Tim Peters (tim.peters) *	Date: 2012-01-07 23:24
[Marc-Andre] > BTW: I wonder how long it's going to take before > someone figures out that our merge sort based > list.sort() is vulnerable as well... its worst- > case performance is O(n log n), making attacks > somewhat harder. I wouldn't worry about that, because nobody could stir up anguish about it by writing a paper ;-) 1. O(n log n) is enormously more forgiving than O(n**2). 2. An attacker need not be clever at all: O(n log n) is not only sort()'s worst case, it's also its _expected_ case when fed randomly ordered data. 3. It's provable that no comparison-based sorting algorithm can have better worst-case asymptotic behavior when fed randomly ordered data. So if anyone whines about this, tell 'em to go do something useful instead :-)
msg150832 - (view)	Author: Martin (gz) *	Date: 2012-01-07 23:53
I built random-2.patch on my windows xp box (updating the project and fixing some compile errors in random.c required), and initialising crypto has a noticeable impact on startup time. The numbers vary a fair bit naturally, two representative runs are as follows: changeset 52796:1ea8b7233fd7 on default branch: >timeit %PY3K% -c "import sys;print(sys.version)" 3.3.0a0 (default, Jan 7 2012, 00:12:45) [MSC v.1500 32 bit (Intel)] Version Number: Windows NT 5.1 (Build 2600) Exit Time: 0:16 am, Saturday, January 7 2012 Elapsed Time: 0:00:00.218 Process Time: 0:00:00.187 System Calls: 4193 Context Switches: 445 Page Faults: 1886 Bytes Read: 642542 Bytes Written: 272 Bytes Other: 31896 with random-2.patch and fixes applied: >timeit %PY3K% -c "import sys;print(sys.version)" 3.3.0a0 (default, Jan 7 2012, 00:58:32) [MSC v.1500 32 bit (Intel)] Version Number: Windows NT 5.1 (Build 2600) Exit Time: 0:59 am, Saturday, January 7 2012 Elapsed Time: 0:00:00.296 Process Time: 0:00:00.234 System Calls: 4712 Context Switches: 642 Page Faults: 2049 Bytes Read: 1059381 Bytes Written: 272 Bytes Other: 34544 This is with hot caches, cold will likely be worse, but a smaller percentage change. On a faster box, or with an SSD, or win 7, the delta will likely be smaller too. A 50-100ms slow down is consistent with the difference on Python 2.7 between calling `os.urandom(1)` or not. However, the baseline is faster with Python 2, frequently dipping under 100ms, so there this change could double the runtime of trivial scripts.
msg150835 - (view)	Author: Glenn Linderman (v+python) *	Date: 2012-01-08 00:19
Given Martin's comment (msg150832) I guess I should add my suggestion to this issue, at least for the record. Rather than change hash functions, randomization could be added to those dicts that are subject to attack by wanting to store user-supplied key values. The list so far seems to be urllib.parse, cgi, json Some have claimed there are many more, but without enumeration. These three are clearly related to the documented issue. The technique would be to wrap dict and add a short random prefix to each key value, preventing the attacker from supplier keys that are known to collide... and even if he successfully stumbles on a set that does collide on one request, it is unlikely to collide on a subsequent request with a different prefix string. The technique is fully backward compatible with all applications except those that contain potential vulnerabilities as described by the researchers. The technique adds no startup or runtime overhead to any application that doesn't contain the potential vulnerabilities. Due to the per-request randomization, the complexity of creating a sequence of sets of keys that may collide is enormous, and requires that such a set of keys happen to arrive on a request in the right sequence where the predicted prefix randomization would be used to cause the collisions to occur. This might be possible on a lightly loaded system, but is less likely on a system with heavy load, which are more interesting to attack. Serhiy Storchaka provided a sample implementation on the python-dev, copied below, and attached as a file (but is not a patch). # -- coding: utf-8 -- from collections import MutableMapping import random class SafeDict(dict, MutableMapping): def __init__(self, args, kwds): dict.__init__(self) self._prefix = str(random.getrandbits(64)) self.update(args, **kwds) def clear(self): dict.clear(self) self._prefix = str(random.getrandbits(64)) def _safe_key(self, key): return self._prefix + repr(key), key def __getitem__(self, key): try: return dict.__getitem__(self, self._safe_key(key)) except KeyError as e: e.args = (key,) raise e def __setitem__(self, key, value): dict.__setitem__(self, self._safe_key(key), value) def __delitem__(self, key): try: dict.__delitem__(self, self._safe_key(key)) except KeyError as e: e.args = (key,) raise e def __iter__(self): for skey, key in dict.__iter__(self): yield key def __contains__(self, key): return dict.__contains__(self, self._safe_key(key)) setdefault = MutableMapping.setdefault update = MutableMapping.update pop = MutableMapping.pop popitem = MutableMapping.popitem keys = MutableMapping.keys values = MutableMapping.values items = MutableMapping.items def __repr__(self): return '{%s}' % ', '.join('%s: %s' % (repr(k), repr(v)) for k, v in self.items()) def copy(self): return self.__class__(self) @classmethod def fromkeys(cls, iterable, value=None): d = cls() for key in iterable: d[key] = value return d def __eq__(self, other): return all(k in other and other[k] == v for k, v in self.items()) and \ all(k in self and self[k] == v for k, v in other.items()) def __ne__(self, other): return not self == other
msg150836 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-08 00:21
You're seriously underestimating the number of vulnerable dicts. It has nothing to do with the module, and everything to do with the origin of the data. There's tons of user code that's vulnerable too.
msg150840 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-08 02:40
> Alex, I agree the issue has to do with the origin of the data, but the modules listed are the ones that deal with the data supplied by this particular attack. They deal directly with the data. Do any of them pass the data further, or does the data stop with them? A short and very incomplete list of vulnerable standard lib modules includes: every single parsing library (json, xml, html, plus all the third party libraries that do that), all of numpy (because it processes data which probably came from a user [yes, integers can trigger the vulnerability]), difflib, the math module, most database adaptors, anything that parses metadata (including commonly used third party libs like PIL), the tarfile lib along with other compressed format handlers, the csv module, robotparser, plistlib, argparse, pretty much everything under the heading of "18. Internet Data Handling" (email, mailbox, mimetypes, etc.), "19. Structured Markup Processing Tools", "20. Internet Protocols and Support", "21. Multimedia Services", "22. Internationalization", TKinter, and all the os calls that handle filenames. The list is impossibly large, even if we completely ignore user code. This MUST be fixed at a language level. I challenge you to find me 15 standard lib components that are certain to never handle user-controlled input. > Note that changing the hash algorithm for a persistent process, even though each process may have a different seed or randomized source, allows attacks for the life of that process, if an attack vector can be created during its lifetime. This is not a problem for systems where each request is handled by a different process, but is a problem for systems where processes are long-running and handle many requests. This point has been made many times now. I urge you to read the entire thread on the mailing list. Your implementation is impractical because your "safe" implementation completely ignores all hash caching (each entry must be re-hashed for that dict). Your implementation is still vulnerable in exactly the way you mentioned if you ever have any kind of long-lived dict in your program thread. > You have entered the class of people that claim lots of vulnerabilities, without enumeration. I have enumerated. Stop making this argument.
msg150847 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2012-01-08 05:36
Glenn, you have reached a point where you stop bike-shedding and start to troll by attacking people. Please calm down. I'm sure that you are just worried about the future of Python and all the bad things, that might be introduced by a fix for the issue. Please trust us! Paul, Victor, Antoine and several more involved developers are professional Python devs and have been for years. Most of them do Python development for a living. We won't kill the snake that pays our bills. ;) Ultimately it's Guido's choice, too. Martin: Ouch, the startup impact is large! Have we reached a point where "one size fits all" doesn't work any longer? It's getting harder to have just one executable for 500ms scripts and server processes that last for weeks. Marc-Andre: Have you profiled your suggestion? I'm interested in the speed implications. My gut feeling is that your idea could be slower, since you have added more instructions to a tight loop, that is execute on every lookup, insert, update and deletion of a dict key. The hash modification could have a smaller impact, since the hash is cached. I'm merely speculating here until we have some numbers to compare.
msg150856 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-08 10:20
> Christian Heimes added the comment: > Ouch, the startup impact is large! Have we reached a point where "one size fits all" doesn't work any longer? It's getting harder to have just one executable for 500ms scripts and server processes that last for weeks. This concerns me too, and is one reason I think the collision counting code might be the winning solution. Randomness is hard to do correctly and is expensive. If we can avoid it, we should try very hard to do so... > Christian Heimes said to Marc-Andre: > Have you profiled your suggestion? I'm interested in the speed implications. My gut feeling is that your idea could be slower, since you have added more instructions to a tight loop, that is execute on every lookup, insert, update and deletion of a dict key. The hash modification could have a smaller impact, since the hash is cached. I'm merely speculating here until we have some numbers to compare. Interesting point, though I think we might be able to work it out so that we're only adding instructions when there's actually a detected collision. I'll be interested to see what the benchmarks (and real world) have to say about the impacts of randomization as compared to the existing black-magic optimization of the hash function.
msg150857 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-08 11:33
Tim Peters wrote: > > Tim Peters <tim.peters@gmail.com> added the comment: > > [Marc-Andre] >> BTW: I wonder how long it's going to take before >> someone figures out that our merge sort based >> list.sort() is vulnerable as well... its worst- >> case performance is O(n log n), making attacks >> somewhat harder. > > I wouldn't worry about that, because nobody could stir up anguish > about it by writing a paper ;-) > > 1. O(n log n) is enormously more forgiving than O(n**2). > > 2. An attacker need not be clever at all: O(n log n) is not only > sort()'s worst case, it's also its _expected_ case when fed randomly > ordered data. > > 3. It's provable that no comparison-based sorting algorithm can have > better worst-case asymptotic behavior when fed randomly ordered data. > > So if anyone whines about this, tell 'em to go do something useful instead :-) Right on all accounts :-)
msg150859 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-08 11:47
Christian Heimes wrote: > Marc-Andre: > Have you profiled your suggestion? I'm interested in the speed implications. My gut feeling is that your idea could be slower, since you have added more instructions to a tight loop, that is execute on every lookup, insert, update and deletion of a dict key. The hash modification could have a smaller impact, since the hash is cached. I'm merely speculating here until we have some numbers to compare. I haven't done any profiling on this yet, but will run some tests. The lookup functions in the dict implementation are optimized to make the first non-collision case fast. The patch doesn't touch this loop. The only change is in the collision case, where an increment and comparison is added (and then only after the comparison which is the real cost factor in the loop). I did add a printf() to see how often this case occurs - it's a surprisingly rare case, which suggests that Tim, Christian and all the others that have invested considerable time into the implementation have done a really good job here. BTW: I noticed that a rather obvious optimization appears to be missing from the Python dict initialization code: when passing in a list of (key, value) pairs, the implementation doesn't make use of the available length information and still starts with an empty (small) dict table and then iterates over the pairs, increasing the table size as necessary. It would be better to start with a table that is presized to O(len(data)). The dict implementation already provides such a function, but it's not being used in the case dict(pair_list). Anyway, just an aside.
msg150865 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-08 14:23
> Randomness is hard to do correctly > and is expensive. If we can avoid it, we should try very hard to do > so... os.urandom() is actually cheaper on Windows 7 here: 1000000 loops, best of 3: 1.78 usec per loop than on Linux: $ ./python -m timeit -s "import os" "os.urandom(16)" 100000 loops, best of 3: 4.85 usec per loop $ ./python -m timeit -s "import os; f=os.open('/dev/urandom', os.O_RDONLY)" "os.read(f, 16)" 100000 loops, best of 3: 2.35 usec per loop (note that the os.read timing is optimistic since I'm not checking the return value!) I don't know if the patch's startup overhead has to do with initializing the crypo context or simply with looking up the symbols in advapi32.dll. Perhaps we should link explicitly against advapi32.dll as suggested by Martin?
msg150866 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-08 14:26
Again, Roundup ate up some of the text: >PCbuild\amd64\python.exe -m timeit -s "import os" "os.urandom(16)" 1000000 loops, best of 3: 1.81 usec per loop (for the record, the Roundup issue is at http://psf.upfronthosting.co.za/roundup/meta/issue264 )
msg150934 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-09 12:16
Marc-Andre Lemburg wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > Christian Heimes wrote: >> Marc-Andre: >> Have you profiled your suggestion? I'm interested in the speed implications. My gut feeling is that your idea could be slower, since you have added more instructions to a tight loop, that is execute on every lookup, insert, update and deletion of a dict key. The hash modification could have a smaller impact, since the hash is cached. I'm merely speculating here until we have some numbers to compare. > > I haven't done any profiling on this yet, but will run some > tests. I ran pybench and pystone: neither shows a significant change. I wish we had a simple to run benchmark based on Django to allow checking such changes against real world applications. Not that I expect different results from such a benchmark... To check the real world impact, I guess it would be best to run a few websites with the patch for a week and see whether the collision exception gets raised.
msg151012 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-10 11:37
Version 3 of my patch: - Add PYTHONHASHSEED environment variable to get a fixed seed or to disable the randomized hash function (PYTHONHASHSEED=0) - Add tests on the randomized hash function - Add more tests on os.urandom()
msg151017 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-10 14:26
> Version 3 of my patch: > - Add PYTHONHASHSEED environment variable to get a fixed seed or to > disable the randomized hash function (PYTHONHASHSEED=0) > - Add tests on the randomized hash function > - Add more tests on os.urandom() You forgot random.c. + PyErr_SetString(PyExc_RuntimeError, "Fail to generate random bytes"); I would put an OSError and preserve the errno. + def test_null_hash(self): + # PYTHONHASHSEED=0 disables the randomized hash + self.assertEqual(self.get_hash("abc", 0), -1600925533) + + def test_fixed_hash(self): + # test a fixed seed for the randomized hash + self.assertEqual(self.get_hash("abc", 42), -206076799) This is portable on both 32-bit and 64-bit builds?
msg151031 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-10 22:15
Patch version 4: - os.urandom() raises again exceptions on failure - drop support of VMS (which used RAND_pseudo_bytes from OpenSSL): I don't see how to link Python/random.c to libcrypto on VMS, I don't have VMS, and it don't see how it was working because posixmodule.c was neither linked to libcrypto !? - fix test_dict, test_gdb, test_builtin - win32_urandom() handles size bigger than INT_MAX using a loop (it may be DWORD max instead?) - _PyRandom_Init() does nothing it is called twice to fix a _testembed failure (don't change the Unicode secret because Python stores some strings somewhere and never destroy them)
msg151033 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-10 23:07
Patch version 5 fixes test_unicode for 64-bit system.
msg151047 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-11 09:28
STINNER Victor wrote: > > Patch version 5 fixes test_unicode for 64-bit system. Victor, I don't think the randomization idea is going anywhere. The code has many issues: * it is exceedingly complex * the method would need to be implemented for all hashable Python types * it causes startup time to increase (you need urandom data for every single hashable Python data type) * it causes run-time to increase due to changes in the hash algorithm (more operations in the tight loop) * causes different processes in a multi-process setup to use different hashes for the same object * doesn't appear to work well in embedded interpreters that regularly restarted interpreters (AFAIK, some objects persist across restarts and those will have wrong hash values in the newly started instances) The most important issue, though, is that it doesn't really protect Python against the attack - it only makes it less likely that an adversary will find the init vector (or a way around having to find it via crypt analysis). OTOH, the collision counting patch is very simple, doesn't have the performance issues and provides real protection against the attack. Even better still, it can detect programming errors in hash method implementations. IMO, it would be better to put efforts into refining the collision detection patch (perhaps adding support for the universal hash method slot I mentioned) and run some real life tests with it.
msg151048 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-11 09:56
> * it is exceedingly complex Which part exactly? For hash(str), it just add two extra XOR. > * the method would need to be implemented for all hashable Python types It was already discussed, and it was said that only hash(str) need to be modified. > * it causes startup time to increase (you need urandom data for > every single hashable Python data type) My patch reads 8 or 16 bytes from /dev/urandom which doesn't block. Do you have a benchmark showing a difference? I didn't try my patch on Windows yet. > * it causes run-time to increase due to changes in the hash > algorithm (more operations in the tight loop) I posted a micro-benchmark on hash(str) on python-dev: the overhead is nul. Did you have numbers showing that the overhead is not nul? > * causes different processes in a multi-process setup to use different > hashes for the same object Correct. If you need to get the same hash, you can disable the randomized hash (PYTHONHASHSEED=0) or use a fixed seed (e.g. PYTHONHASHSEED=42). > * doesn't appear to work well in embedded interpreters that > regularly restarted interpreters (AFAIK, some objects persist across > restarts and those will have wrong hash values in the newly started > instances) test_capi runs _testembed which restarts a embedded interpreters 3 times, and the test pass (with my patch version 5). Can you write a script showing the problem if there is a real problem? In an older version of my patch, the hash secret was recreated at each initiliazation. I changed my patch to only generate the secret once. > The most important issue, though, is that it doesn't really > protect Python against the attack - it only makes it less > likely that an adversary will find the init vector (or a way > around having to find it via crypt analysis). I agree that the patch is not perfect. As written in the patch, it just makes the attack more complex. I consider that it is enough. Perl has a simpler protection than the one proposed in my patch. Is Perl vulnerable to the hash collision vulnerability?
msg151061 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-11 14:34
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> * it is exceedingly complex > > Which part exactly? For hash(str), it just add two extra XOR. I'm not talking specifically about your patch, but the whole idea and the needed changes in general. >> * the method would need to be implemented for all hashable Python types > > It was already discussed, and it was said that only hash(str) need to > be modified. Really ? What about the much simpler attack on integer hash values ? You only have to send a specially crafted JSON dictionary with integer keys to a Python web server providing JSON interfaces in order to trigger the integer hash attack. The same goes for the other Python data types. >> * it causes startup time to increase (you need urandom data for >> every single hashable Python data type) > > My patch reads 8 or 16 bytes from /dev/urandom which doesn't block. Do > you have a benchmark showing a difference? > > I didn't try my patch on Windows yet. Your patch only implements the simple idea of adding an init vector and a fixed suffix vector (which you don't need since it doesn't prevent hash collisions). I don't think that's good enough, since it doesn't change how the hash algorithm works on the actual data, but instead just shifts the algorithm to a different sequence. If you apply the same logic to the integer hash function, you'll see that more clearly. Paul's algorithm is much more secure in this respect, but it requires more random startup data. >> * it causes run-time to increase due to changes in the hash >> algorithm (more operations in the tight loop) > > I posted a micro-benchmark on hash(str) on python-dev: the overhead is > nul. Did you have numbers showing that the overhead is not nul? For the simple solution, that's an expected result, but if you want more safety, then you'll see a hit due to the random data getting XOR'ed in every single loop. >> * causes different processes in a multi-process setup to use different >> hashes for the same object > > Correct. If you need to get the same hash, you can disable the > randomized hash (PYTHONHASHSEED=0) or use a fixed seed (e.g. > PYTHONHASHSEED=42). So you have the choice of being able to work in a multi-process environment and be vulnerable to the attack or not. I think we can do better :-) Note that web servers written in Python tend to be long running processes, so an attacker has lots of time to test various seeds. >> * doesn't appear to work well in embedded interpreters that >> regularly restarted interpreters (AFAIK, some objects persist across >> restarts and those will have wrong hash values in the newly started >> instances) > > test_capi runs _testembed which restarts a embedded interpreters 3 > times, and the test pass (with my patch version 5). Can you write a > script showing the problem if there is a real problem? > > In an older version of my patch, the hash secret was recreated at each > initiliazation. I changed my patch to only generate the secret once. Ok, that should fix the case. Two more issue that I forgot: * enabling randomized hashing can make debugging a lot harder, since it's rather difficult to reproduce the same state in a controlled way (unless you record the hash seed somewhere in the logs) and even though applications should not rely on the order of dict repr()s or str()s, they do often enough: * randomized hashing will result in repr() and str() of dictionaries to be random as well >> The most important issue, though, is that it doesn't really >> protect Python against the attack - it only makes it less >> likely that an adversary will find the init vector (or a way >> around having to find it via crypt analysis). > > I agree that the patch is not perfect. As written in the patch, it > just makes the attack more complex. I consider that it is enough. Wouldn't you rather see a fix that works for all hash functions and Python objects ? One that doesn't cause performance issues ? The collision counting idea has this potential. > Perl has a simpler protection than the one proposed in my patch. Is > Perl vulnerable to the hash collision vulnerability? I don't know what Perl did or how hashing works in Perl, so cannot comment on the effect of their fix. FWIW, I don't think that we should use Perl or Java as reference here.
msg151062 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-11 14:45
> OTOH, the collision counting patch is very simple, doesn't have > the performance issues and provides real protection against the > attack. I don't know about real protection: you can still slow down dict construction by 1000x (the number of allowed collisions per lookup), which can be enough combined with a brute-force DOS. Also, how about false positives? Having legitimate programs break because of legitimate data would be a disaster.
msg151063 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2012-01-11 14:55
>>> * the method would need to be implemented for all hashable Python types >> It was already discussed, and it was said that only hash(str) need to >> be modified. > > Really ? What about the much simpler attack on integer hash values ? > > You only have to send a specially crafted JSON dictionary with integer > keys to a Python web server providing JSON interfaces in order to > trigger the integer hash attack. JSON objects are decoded as dicts with string keys, integers keys are not possible. >>> json.loads(json.dumps({1:2})) {'1': 2}
msg151064 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-11 15:41
Mark Shannon wrote: > > Mark Shannon <mark@hotpy.org> added the comment: > >>>> * the method would need to be implemented for all hashable Python types >>> It was already discussed, and it was said that only hash(str) need to >>> be modified. >> >> Really ? What about the much simpler attack on integer hash values ? >> >> You only have to send a specially crafted JSON dictionary with integer >> keys to a Python web server providing JSON interfaces in order to >> trigger the integer hash attack. > > JSON objects are decoded as dicts with string keys, integers keys are > not possible. > > >>> json.loads(json.dumps({1:2})) > {'1': 2} Thanks for the correction. Looks like XML-RPC also doesn't accept integers as dict keys. That's good :-) However, as Paul already noted, such attacks can also occur in other places or parsers in an application, e.g. when decoding FORM parameters that use integers to signal a line or parameter position (example: value_1=2&value_2=3...) which are then converted into a dictionary mapping the position integer to the data. marshal and pickle are vulnerable, but then you normally don't expose those to untrusted data.
msg151065 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-11 16:03
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> OTOH, the collision counting patch is very simple, doesn't have >> the performance issues and provides real protection against the >> attack. > > I don't know about real protection: you can still slow down dict > construction by 1000x (the number of allowed collisions per lookup), > which can be enough combined with a brute-force DOS. On my slow dev machine 1000 collisions run in around 22ms: python2.7 -m timeit -n 100 "dict((x(264 - 1), 1) for x in xrange(1, 1000))" 100 loops, best of 3: 22.4 msec per loop Using this for a DOS attack would be rather noisy, much unlike sending a single POST. Note that the choice of 1000 as limit is rather arbitrary. I just chose it because it's high enough because it's very unlikely to be hit by an application that is not written to trigger it and it's low enough to still provide a good run-time behavior. Perhaps an even lower figure would be better. > Also, how about false positives? Having legitimate programs break > because of legitimate data would be a disaster. Yes, which is why the patch should be disabled by default (using an env var) in dot-releases. It's probably also a good idea to make the limit configurable to adjust to ones needs. Still, it is very* unlikely that you run into real data causing more than 1000 collisions for a single insert. For full protection the universal hash method idea would have to be implemented (adding a parameter to the hash methods, so that they can be parametrized). This would then allow switching the dict to an alternative hash implementation resolving the collision problem, in case the implementation detects high number of collisions.
msg151069 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-11 17:28
> On my slow dev machine 1000 collisions run in around 22ms: > > python2.7 -m timeit -n 100 "dict((x(264 - 1), 1) for x in xrange(1, 1000))" > 100 loops, best of 3: 22.4 msec per loop > > Using this for a DOS attack would be rather noisy, much unlike > sending a single POST. Note that sending one POST is not enough, unless the attacker is content with blocking one* worker process for a couple of seconds or minutes (which is a rather tiny attack if you ask me :-)). Also, you can combine many dicts in a single JSON list, so that the 1000 limit isn't overreached for any of the dicts. So in all cases the attacker would have to send many of these POST requests in order to overwhelm the target machine. That's how DOS attacks work AFAIK. > Yes, which is why the patch should be disabled by default (using > an env var) in dot-releases. It's probably also a good idea to > make the limit configurable to adjust to ones needs. Agreed if it's disabled by default then it's not a problem, but then Python is vulnerable by default...
msg151070 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2012-01-11 17:34
[Antoine] > Also, how about false positives? Having legitimate programs break > because of legitimate data would be a disaster. This worries me, too. [MAL] > Yes, which is why the patch should be disabled by default (using > an env var) in dot-releases. Are you proposing having it enabled by default in Python 3.3?
msg151071 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-11 17:38
Mark Dickinson wrote: > > Mark Dickinson <dickinsm@gmail.com> added the comment: > > [Antoine] >> Also, how about false positives? Having legitimate programs break >> because of legitimate data would be a disaster. > > This worries me, too. > > [MAL] >> Yes, which is why the patch should be disabled by default (using >> an env var) in dot-releases. > > Are you proposing having it enabled by default in Python 3.3? Possibly, yes. Depends on whether anyone comes up with a problem in the alpha, beta, RC release cycle. It would be great to have the universal hash method approach for Python 3.3. That way Python could self-heal itself in case it finds too many collisions. My guess is that it's still better to raise an exception, though, since it would uncover either attacks or programming errors.
msg151073 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-11 18:05
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> On my slow dev machine 1000 collisions run in around 22ms: >> >> python2.7 -m timeit -n 100 "dict((x(264 - 1), 1) for x in xrange(1, 1000))" >> 100 loops, best of 3: 22.4 msec per loop >> >> Using this for a DOS attack would be rather noisy, much unlike >> sending a single POST. > > Note that sending one POST is not enough, unless the attacker is content > with blocking one* worker process for a couple of seconds or minutes > (which is a rather tiny attack if you ask me :-)). Also, you can combine > many dicts in a single JSON list, so that the 1000 limit isn't > overreached for any of the dicts. Right, but such an approach only scales linearly and doesn't exhibit the quadric nature of the collision resolution. The above with 10000 items takes 5 seconds on my machine. The same with 100000 items is still running after 16 minutes. > So in all cases the attacker would have to send many of these POST > requests in order to overwhelm the target machine. That's how DOS > attacks work AFAIK. Depends :-) Hiding a few tens of such requests in the input stream of a busy server is easy. Doing the same with thousands of requests is a lot harder. FWIW: The above dict string version just has some 263kB for the 100000 case, 114kB if gzip compressed. >> Yes, which is why the patch should be disabled by default (using >> an env var) in dot-releases. It's probably also a good idea to >> make the limit configurable to adjust to ones needs. > > Agreed if it's disabled by default then it's not a problem, but then > Python is vulnerable by default... Yes, but at least the user has an option to switch on the added protection. We'd need some field data to come to a decision.
msg151074 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-11 18:18
> [MAL] > > Yes, which is why the patch should be disabled by default (using > > an env var) in dot-releases. > > Are you proposing having it enabled by default in Python 3.3? I would personally prefer 3.3 and even 3.2 to have proper randomization (either Paul's or Victor's or another proposal). Victor's proposal makes fixing other hash functions very simple (there could even be helper macros). The only serious concern IMO is startup time under Windows; someone with Windows-fu should investigate that. 2.x maintainers might want to be more conservative, although disabling a fix (the collision counter) by default doesn't sound very wise or helpful to me. (for completeness, the collision counter must also be added to sets, btw) It would be nice to hear from distro maintainers here.
msg151078 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-11 19:07
I've benchmarked Victor's patch and got the following results: Report on Linux localhost.localdomain 2.6.38.8-desktop-9.mga #1 SMP Tue Dec 20 09:45:44 UTC 2011 x86_64 x86_64 Total CPU cores: 4 ### call_simple ### Min: 0.223778 -> 0.209204: 1.07x faster Avg: 0.227634 -> 0.212437: 1.07x faster Significant (t=15.40) Stddev: 0.00291 -> 0.00248: 1.1768x smaller Timeline: http://tinyurl.com/87vkdps ### fastpickle ### Min: 0.484052 -> 0.499832: 1.03x slower Avg: 0.487370 -> 0.507909: 1.04x slower Significant (t=-8.40) Stddev: 0.00261 -> 0.00481: 1.8446x larger Timeline: http://tinyurl.com/7ntcudz ### float ### Min: 0.052819 -> 0.051540: 1.02x faster Avg: 0.054304 -> 0.052922: 1.03x faster Significant (t=3.89) Stddev: 0.00125 -> 0.00126: 1.0101x larger Timeline: http://tinyurl.com/7rqfurw ### formatted_logging ### Min: 0.252709 -> 0.257303: 1.02x slower Avg: 0.254741 -> 0.259967: 1.02x slower Significant (t=-4.90) Stddev: 0.00155 -> 0.00181: 1.1733x larger Timeline: http://tinyurl.com/8xu2zdt ### normal_startup ### Min: 0.450661 -> 0.435943: 1.03x faster Avg: 0.454536 -> 0.438212: 1.04x faster Significant (t=9.41) Stddev: 0.00327 -> 0.00209: 1.5661x smaller Timeline: http://tinyurl.com/8ygw272 ### nqueens ### Min: 0.269426 -> 0.255306: 1.06x faster Avg: 0.270105 -> 0.255844: 1.06x faster Significant (t=28.63) Stddev: 0.00071 -> 0.00086: 1.2219x larger Timeline: http://tinyurl.com/823dwzo ### regex_compile ### Min: 0.390307 -> 0.380736: 1.03x faster Avg: 0.391959 -> 0.382025: 1.03x faster Significant (t=8.93) Stddev: 0.00194 -> 0.00156: 1.2395x smaller Timeline: http://tinyurl.com/72shbzh ### silent_logging ### Min: 0.060115 -> 0.057777: 1.04x faster Avg: 0.060241 -> 0.058019: 1.04x faster Significant (t=13.29) Stddev: 0.00010 -> 0.00036: 3.4695x larger Timeline: http://tinyurl.com/76bfguf ### unpack_sequence ### Min: 0.000043 -> 0.000046: 1.07x slower Avg: 0.000044 -> 0.000047: 1.06x slower Significant (t=-107.47) Stddev: 0.00000 -> 0.00000: 1.1231x larger Timeline: http://tinyurl.com/6us6yys The following not significant results are hidden, use -v to show them: call_method, call_method_slots, call_method_unknown, fastunpickle, iterative_count, json_dump, json_load, nbody, pidigits, regex_effbot, regex_v8, richards, simple_logging, startup_nosite, threaded_count. In short, any difference is in the noise.
msg151092 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-01-11 21:46
I must be missing something, but how is raising an exception when a collision threshold is reached a good thing? Basically, we're just exchanging a DoS for another (just feed the server process with ad-hoc data and he'll commit suicide). Sure, the caller can catch the exception to detect this, but what for? Restart the process, so that the attacker can just try again? Also, there's the potential of perfectly legit applications breaking. IMHO, randomization is the way to go, so that an attacker cannot generate a set of colliding values beforehand, which renders the attack impracticle. The same idea is behind ASLR used in modern kernels, and AFAICT, has been chosen by other implementations. If a such patch has a negligible performance impact, then it should definitely be enabled by default. People who want deterministic hashing (maybe to bypass an application bug, or just because the want determinism) can disable it if they really want to.
msg151120 - (view)	Author: Frank Sievertsen (fx5)	Date: 2012-01-12 08:53
I'd like to add a few notes: 1. both 32-bit and 64-bit python are vulnerable 2. collision-counting will break other things 3. imho, randomization is the way to go, enabled by default. 4. do we need a steady hash-function later again? I created ~500KB of colliding strings for both 32-bit and 64-bit python. It works impressively good: 32bit: ~500KB payload keep django busy for >30 minutes. 64bit: ~500KB payload keep django busy for 5 minutes. Django is more vulnerable than python-dict alone, because it * converts the strings to unicode first, making the comparision more expensive * does 5 dict-lookups per key. So Python's dict of str alone is probably ~10x faster. Of course it's much harder to create the payload for 64-bit python than for 32-bit, but it works for both. The collision-counting idea makes some sense in the web environment, but for other software types it can causes serious problems. I don't want my software to stop working because someone managed to enter 1000 bad strings into it. Think of a software that handles names of customers or filenames. We don't want it to break completely just because someone entered a few clever names. Randomization fixes most of these problems. However, it breaks the steadiness of hash(X) between two runs of the same software. There's probably code out there that assumes that hash(X) always returns the same value: database- or serialization-modules, for example. There might be good reasons to also have a steady hash-function available. The broken code is hard to fix if no such a function is available at all. Maybe it's possible to add a second steady hash-functions later again? For the moment I think the best way is to turn on randomization of hash() by default, but having a way to turn it off.
msg151121 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-12 09:27
Frank Sievertsen wrote: > > I don't want my software to stop working because someone managed to enter 1000 bad strings into it. Think of a software that handles names of customers or filenames. We don't want it to break completely just because someone entered a few clever names. Collision counting is just a simple way to trigger an action. As I mentioned in my proposal on this ticket, raising an exception is just one way to deal with the problem in case excessive collisions are found. A better way is to add a universal hash method, so that the dict can adapt to the data and modify the hash functions for just that dict (without breaking other dicts or changing the standard hash functions). Note that raising an exception doesn't completely break your software. It just signals a severe problem with the input data and a likely attack on your software. As such, it's no different than turning on DOS attack prevention in your router. In case you do get an exception, a web server will simply return a 500 error and continue working normally. For other applications, you may see a failure notice in your logs. If you're sure that there are no possible ways to attack the application using such data, then you can simply disable the feature to prevent such exceptions. > Randomization fixes most of these problems. See my list of issues with this approach (further up on this ticket). > However, it breaks the steadiness of hash(X) between two runs of the same software. There's probably code out there that assumes that hash(X) always returns the same value: database- or serialization-modules, for example. > > There might be good reasons to also have a steady hash-function available. The broken code is hard to fix if no such a function is available at all. Maybe it's possible to add a second steady hash-functions later again? This is one of the issues I mentioned. > For the moment I think the best way is to turn on randomization of hash() by default, but having a way to turn it off.
msg151122 - (view)	Author: Graham Dumpleton (grahamd)	Date: 2012-01-12 10:02
Right back at the start it was said: """ We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release. The env var PYTHONRANDOMHASH=1 would enable the randomization. It's simple to set the env var in e.g. Apache for mod_python and mod_wsgi. """ with a environment variable PYTHONHASHSEED still being mentioned towards the end. Be aware that a user being able to set an environment variable which is used on Python interpreter initialisation when using mod_python or mod_wsgi is not as trivial as made out in leading comment. To set an environment variable would require the setting of the environment variable to be done in the Apache etc init.d scripts, or if the Apache distro still follows Apache Software Foundation conventions, in the 'envvars' file. Having to do this requires root access and is inconvenient, especially since where it needs to be done differs between every distro. Where there are other environment variables that are useful to set for interpreter initialisation, mod_wsgi has been changed in the past to add specific directives for the Apache configuration file to set them prior to interpreter initialisation. This at least makes it somewhat easier, but still only of help where you are the admin of the server. If that approach is necessary, then although mod_wsgi could eventually add such a directive, as mod_python is dead it will never happen for it. As to another question posed about whether mod_wsgi itself is doing anything to combat this, the answer is no as don't believe there is anything it can do. Values like the query string or post data is simply passed through as is and always pulled apart by the application.
msg151157 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-13 00:08
Patch version 6: - remove a debug code in dev_urandom() (did always raise an exception for testing) - dev_urandom() raises an exception if open() fails - os.urandom() uses again the right exception type and message (instead of a generic exception) - os.urandom() is not more linked to PYTHONHASHSEED - replace uint32_t by unsigned int in lcg_urandom() because Visual Studio 8 doesn't provide this type. "unsigned __int32" is available but I prefer to use a more common type. 32 or 64-bit types are supposed to generate the same sequence number (I didn't test). - fix more tests - regrtest.py restarts the process with PYTHONHASHSEED=randomseed if -r --randomseed=SEED is used - fix compilation on Windows (add random.c to the Visual Studio project file)
msg151158 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-13 00:36
I wrote bench_startup.py to measure the startup time on Windows. The precision of the script is quite bad because Windows timer has a bad resolution (e.g. 15.6 ms on Windows 7) :-/ In release mode, the best startup time is 45.2 ms without random, 50.9 ms with random: an overhead of 5.6 ms (12%). My script uses PYTHONHASHSEED=0 to disable the initialization of CryptoGen. You may modify the script to compare an unpatched Python with a patched Python for better numbers. An overhead 12% is important. random-6.patch contains a faster (but also weaker) RNG on Windows, disable at compilation time. Search "#if 1" at the end of random.c. It uses my linear congruential generator (LCG) initialized with gettimeofday() and getpid() (which are known to be weak) instead of CryptoGen. Using the LCG, the startup overhead is between 1 and 2%.
msg151159 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-13 00:48
SafeDict.py: with this solution, the hash of key has to be recomputed at each access to the dict (creation, get, set), the hash is not cached in the string object.
msg151167 - (view)	Author: Zbyszek Jędrzejewski-Szmek (zbysz) *	Date: 2012-01-13 10:17
Added some small comments in http://bugs.python.org/review/13703/show.
msg151353 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-16 12:45
The vulnerability is known since 2003 (Usenix 2003): read "Denial of Service via Algorithmic Complexity Attacks" by Scott A. Crosby and Dan S. Wallach. http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003.pdf This paper compares Perl 5.8 hash function, MD5, UHASH (UMAC universal), CW (Carter-Wegman) and XOR12. Read more about UMAC: http://en.wikipedia.org/wiki/UMAC "A UMAC has provable cryptographic strength and is usually a lot less computationally intensive than other MACs." oCERT advisory #2011-003: multiple implementations denial-of-service via hash algorithm collision http://www.ocert.org/advisories/ocert-2011-003.html nRuns advisory: http://www.nruns.com/_downloads/advisory28122011.pdf CRuby 1.8.7 fix (use a randomized hash function): http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/branches/ruby_1_8_7/string.c?r1=34151&r2=34150&pathrev=34151 http://svn.ruby-lang.org/cgi-bin/viewvc.cgi?view=revision&revision=34151 JRuby uses Murmurhash and a hash (random) "seed" since JRuby 1.6.5.1: https://github.com/jruby/jruby/commit/c1c9f95ed29cb93806fbc90e9eaabb9c406581e5 https://github.com/jruby/jruby/commit/2fc3a13c4af99be7f25f7dfb6ae3459505bb7c61 http://jruby.org/2011/12/27/jruby-1-6-5-1 JRUBY-6324: random seed for srand is not initialized properly: https://github.com/jruby/jruby/commit/f7041c2636f46e398e3994fba2045e14a890fc14 Murmurhash: https://sites.google.com/site/murmurhash/ pyhash implements Murmurhash: http://code.google.com/p/pyfasthash/
msg151401 - (view)	Author: Eric Snow (eric.snow) *	Date: 2012-01-16 18:29
> The vulnerability is known since 2003 (Usenix 2003): read "Denial of > Service via Algorithmic Complexity Attacks" by Scott A. Crosby and Dan > S. Wallach. Crosby started a meaningful thread on python-dev at that time similar to the current one: http://mail.python.org/pipermail/python-dev/2003-May/035874.html It includes a some good insight into the problem.
msg151402 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-16 18:58
Eric Snow wrote: > > Eric Snow <ericsnowcurrently@gmail.com> added the comment: > >> The vulnerability is known since 2003 (Usenix 2003): read "Denial of >> Service via Algorithmic Complexity Attacks" by Scott A. Crosby and Dan >> S. Wallach. > > Crosby started a meaningful thread on python-dev at that time similar to the current one: > > http://mail.python.org/pipermail/python-dev/2003-May/035874.html > > It includes a some good insight into the problem. Thanks for the pointer. Some interesting postings... Vulnerability of applications: http://mail.python.org/pipermail/python-dev/2003-May/035887.html Speed of hashing, portability and practical aspects: http://mail.python.org/pipermail/python-dev/2003-May/035902.html Changing the hash function: http://mail.python.org/pipermail/python-dev/2003-May/035911.html http://mail.python.org/pipermail/python-dev/2003-May/035915.html
msg151419 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-17 01:53
Patch version 7: - Make PyOS_URandom() private (renamed to _PyOS_URandom) - os.urandom() releases the GIL for I/O operation for its implementation reading /dev/urandom - move _Py_unicode_hash_secret_t documentation into unicode_hash() I moved also fixes for tests in a separated patch: random_fix-tests.patch.
msg151422 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-17 02:10
Some tests are still failing with my 2 patches: - test_dis - test_inspect - test_json - test_packaging - test_ttk_textonly - test_urllib
msg151448 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-17 12:21
Patch version 8: the whole test suite now pass successfully. The remaining question is if CryptoGen should be used instead of the weak LCG initialized by gettimeofday() and getpid(). According to Martin von Loewis, we must link statically Python to advapi32.dll. It should speed up the startup.
msg151449 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-17 12:36
Hum, test_runpy fails something with a segfault and/or a recursion limit because of my hack to rerun regrtest.py to set PYTHONHASHSEED environment variable. The fork should be defined if main() of regrtest.py is called directly. Example: diff --git a/Lib/test/regrtest.py b/Lib/test/regrtest.py --- a/Lib/test/regrtest.py +++ b/Lib/test/regrtest.py @@ -258,7 +258,7 @@ def main(tests=None, testdir=None, verbo findleaks=False, use_resources=None, trace=False, coverdir='coverage', runleaks=False, huntrleaks=False, verbose2=False, print_slow=False, random_seed=None, use_mp=None, verbose3=False, forever=False, - header=False, failfast=False, match_tests=None): + header=False, failfast=False, match_tests=None, allow_fork=False): """Execute a test suite. This also parses command-line options and modifies its behavior @@ -559,6 +559,11 @@ def main(tests=None, testdir=None, verbo except ValueError: print("Couldn't find starting test (%s), using all tests" % start) if randomize: + hashseed = os.getenv('PYTHONHASHSEED') + if (not hashseed and allow_fork): + os.environ['PYTHONHASHSEED'] = str(random_seed) + os.execv(sys.executable, [sys.executable] + sys.argv) + return random.seed(random_seed) print("Using random seed", random_seed) random.shuffle(selected) @@ -1809,4 +1814,4 @@ if __name__ == '__main__': # change the CWD, the original CWD will be used. The original CWD is # available from support.SAVEDCWD. with support.temp_cwd(TESTCWD, quiet=True): - main() + main(allow_fork=True) As Antoine wrote on IRC, regrtest.py should be changed later.
msg151468 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2012-01-17 16:23
#13712 contains a patch for test_packaging.
msg151472 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-17 16:35
> #13712 contains a patch for test_packaging. It doesn't look related to randomized hash function. random-8.patch contains a fix to test_packaging.
msg151474 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2012-01-17 16:46
>> #13712 contains a patch for test_packaging. > It doesn't look related to randomized hash function. Trust me. (If you read the whole report you’ll see why it looks unrelated: instead of sorting things like your patch does mine addresses a more serious behavior bug). > random-8.patch contains a fix to test_packaging. I know, but mine is a bit better.
msg151484 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-17 19:59
To be more explicit about Martin A. Lemburg's msg151121 (which I agree with): Count the collisions on a single lookup. If they exceed a threshhold, do something different. Martin's strawman proposal was threshhold=1000, and raise. It would be just as easy to say "whoa! 5 collisions -- time to use the alternative hash instead" (and, possibly, to issue a warning). Even that slight tuning removes the biggest objection, because it won't ever actually fail. Note that the use of a (presumably stronger 2nd) hash wouldn't come into play until (and unless) there was a problem for that specific key in that specific dictionary. For the normal case, nothing changes -- unless we take advantage of the existence of a 2nd hash to simplify the first few rounds of collision resolution. (Linear probing is more cache-friendly, but also more vulnerable to worst-case behavior -- but if probing stops at 4 or 8, that may not matter much.) For quick scripts, the 2nd hash will almost certainly never be needed, so startup won't pay the penalty. The only down side I see is that the 2nd (presumably randomized) hash won't be cached without another slot, which takes more memory and shouldn't be done in a bugfix release.
msg151519 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-01-18 06:16
I like what you've done in #13704 better than what I see in random-8.patch so far. see the code review comments i've left on both issues.
msg151528 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-18 10:01
> I like what you've done in #13704 better than what I see in random-8.patch so far. see the code review comments i've left on both issues. I didn't write "3106cc0a2024.diff" patch attached to #13704, I just clicked on the button to generate a patch from the repository. Christian Heimes wrote the patch. I don't really like "3106cc0a2024.diff", we don't need Mersenne Twister to initialize the hash secret. The patch doesn't allow to set a fixed secret if you need the same secret for a group of processes.
msg151560 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-18 18:59
STINNER Victor wrote: > > Patch version 7: > - Make PyOS_URandom() private (renamed to _PyOS_URandom) > - os.urandom() releases the GIL for I/O operation for its implementation reading /dev/urandom > - move _Py_unicode_hash_secret_t documentation into unicode_hash() > > I moved also fixes for tests in a separated patch: random_fix-tests.patch. Don't you think that the number of corrections you have to apply in order to get the tests working again shows how much impact such a change would have in real-world applications ? Perhaps we should start to think about a compromise: make both the collision counting and the hash seeding optional and let the user decide which option is best. BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix which needlessly complicates the code and doesn't any additional protection against hash value collisions.
msg151561 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2012-01-18 19:08
On Wed, Jan 18, 2012 at 10:59 AM, Marc-Andre Lemburg <report@bugs.python.org > wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > STINNER Victor wrote: > > > > Patch version 7: > > - Make PyOS_URandom() private (renamed to _PyOS_URandom) > > - os.urandom() releases the GIL for I/O operation for its > implementation reading /dev/urandom > > - move _Py_unicode_hash_secret_t documentation into unicode_hash() > > > > I moved also fixes for tests in a separated patch: > random_fix-tests.patch. > > Don't you think that the number of corrections you have to apply in order > to get the tests working again shows how much impact such a change would > have in real-world applications ? > > Perhaps we should start to think about a compromise: make both the > collision counting and the hash seeding optional and let the user > decide which option is best. > I like this, esp. if for old releases the collision counting is on by default and the hash seeding is off by default, while in 3.3 both should be on by default. Different env vars or flags should be used to enable/disable them. > BTW: The patch still includes the unnecessary > _Py_unicode_hash_secret.suffix > which needlessly complicates the code and doesn't any additional > protection against hash value collisions. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________ >
msg151565 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-18 21:05
> I like this, esp. if for old releases the collision counting is on by > default and the hash seeding is off by default, while in 3.3 both should be > on by default. Different env vars or flags should be used to enable/disable > them. I would hope 3.3 only gets randomized hashing. Collision counting is a hack to make bugfix releases 99.999%-compatible instead of 99.9% ;)
msg151566 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2012-01-18 21:10
On Wed, Jan 18, 2012 at 1:05 PM, Antoine Pitrou <report@bugs.python.org>wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > > I like this, esp. if for old releases the collision counting is on by > > default and the hash seeding is off by default, while in 3.3 both should > be > > on by default. Different env vars or flags should be used to > enable/disable > > them. > > I would hope 3.3 only gets randomized hashing. Collision counting is a > hack to make bugfix releases 99.999%-compatible instead of 99.9% ;) > Really? I'd expect the difference to be more than 2 nines. The randomized hashing has two problems: (a) change in dict order; (b) hash varies between processes. I cannot imagine counterexamples to the collision counting that weren't constructed specifically as counterexamples.
msg151567 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-18 21:14
> Really? I'd expect the difference to be more than 2 nines. The randomized > hashing has two problems: (a) change in dict order; (b) hash varies between > processes. Personally I don't think the change in dict order is a problem (hashing already changes between 32-bit and 64-bit builds, and we sometimes change the calculation too: it might change more often with random hashes, while it went unnoticed in some cases before). So only (b) is a problem and I don't think it affects more than 0.01% of applications/users :)
msg151574 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-18 22:52
> Don't you think that the number of corrections you have to apply in order > to get the tests working again shows how much impact such a change would > have in real-world applications ? Let see the diffstat: Doc/using/cmdline.rst \| 7 Include/pythonrun.h \| 2 Include/unicodeobject.h \| 6 Lib/json/__init__.py \| 4 Lib/os.py \| 17 - Lib/packaging/create.py \| 7 Lib/packaging/tests/test_create.py \| 18 - Lib/test/mapping_tests.py \| 2 Lib/test/regrtest.py \| 5 Lib/test/test_builtin.py \| 1 Lib/test/test_dis.py \| 36 ++- Lib/test/test_gdb.py \| 11 - Lib/test/test_inspect.py \| 1 Lib/test/test_os.py \| 35 ++- Lib/test/test_set.py \| 25 ++ Lib/test/test_unicode.py \| 39 ++++ Lib/test/test_urllib.py \| 16 - Lib/test/test_urlparse.py \| 6 Lib/tkinter/test/test_ttk/test_functions.py \| 2 Makefile.pre.in \| 1 Modules/posixmodule.c \| 126 ++----------- Objects/unicodeobject.c \| 20 +- PCbuild/pythoncore.vcproj \| 4 Python/pythonrun.c \| 3 Python/random.c \| 268 ++++++++++++++++++++++++++++ 25 files changed, 488 insertions(+), 174 deletions(-) Except Lib/packaging/create.py, all other changes are related to the introduction of the randomized hash function, or fix tests... Even Lib/packaging/create.py change is related to fixing tests. The test can be changed differently, but I like the idea of having always the same output in packaging (e.g. it is more readable for the user if files are sorted). I expected to have to do something on multiprocessing, but nope, it doesn't care of the hash value. So I expect something similar in applications: no change in the applications, but a lot of hacks/tricks in tests. > Perhaps we should start to think about a compromise: make both the > collision counting and the hash seeding optional and let the user > decide which option is best. I don't think that we need two fixes for a single vulnerability (in the same Python version), one is enough. If we decide to count collisions, the randomized hash idea can be simply dropped. But we may use a different fix for Python 3.3 and for stable versions (e.g. count collisions for stable versions and use randomized hash for 3.3). > BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix > which needlessly complicates the code and doesn't any additional > protection against hash value collisions How does it complicate the code? It adds an extra XOR to hash(str) and 4 or 8 bytes in memory, that's all. It is more difficult to compute the secret from hash(str) output if there is a prefix and a suffix. If there is only a prefix, knowning a single hash(str) value is just enough to retrieve directly the secret. . > I don't think it affects more than 0.01% of applications/users :) It would help to try a patched Python on a real world application like Django to realize how much code is broken (or not) by a randomized hash function.
msg151582 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-01-18 23:23
A possible advantage of having the 3.3 fix available in earlier versions is that people will be able to turn it on and have that be the only change -- just as with __future__ imports done one at a time.
msg151583 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-01-18 23:25
On Wed, Jan 18, 2012 at 1:10 PM, Guido van Rossum <report@bugs.python.org> wrote: > On Wed, Jan 18, 2012 at 1:05 PM, Antoine Pitrou <report@bugs.python.org>wrote: > > > > I would hope 3.3 only gets randomized hashing. Collision counting is a > > hack to make bugfix releases 99.999%-compatible instead of 99.9% ;) > > > > Really? I'd expect the difference to be more than 2 nines. The randomized > hashing has two problems: (a) change in dict order; (b) hash varies between > processes. I cannot imagine counterexamples to the collision counting that > weren't constructed specifically as counterexamples. For the purposes of 3.3 I'd prefer to just have randomized hashing and not the collision counting in order to keep things from getting too complicated. But I will not object if we opt to do both. As much as the counting idea rubs me wrong, even if it were on by default I agree that most non-contrived things will never encounter it and it is easy to document how to work around it by disabling it should anyone actually be impeded by it. The concern I have with that approach from a web service point of view is that it too can be gamed in the more rare server situation of someone managing to fill a persistent data structure up with enough colliding values to be _close_ to the limit such that the application then dies while trying to process all future requests that _hit_ the limit (a persisting 500 error DOS rather than an exception raised only in one offending request that deserved that 500 error anyways). Not nearly as likely a scenario but it is one I'd keep an eye open for with an attacker hat on. MvL's suggestion of using AVL trees for hash bucket slots instead of our linear slot finding algorithm is a better way to fix the ultimate problem by never devolving into linear behavior at all. It is naturally more complicated but could likely even be done while maintaining ABI compatibility. I haven't pondered designs and performance costs for that. Possibly a memory hit and one or two extra indirect lookups in the normal case and some additional complexity in the iteration case. -gps
msg151584 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-18 23:30
> MvL's suggestion of using AVL trees for hash bucket slots instead of > our linear slot finding algorithm is a better way to fix the ultimate > problem by never devolving into linear behavior at all. A dict can contain non-orderable keys, I don't know how an AVL tree can fit into that.
msg151585 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-01-18 23:31
> A dict can contain non-orderable keys, I don't know how an AVL tree can > fit into that. good point!
msg151586 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-01-18 23:37
> As much as the counting idea rubs me wrong, FWIW, the original 2003 paper reported that the url-caching system that they tested used collision-counting to evade attacks.
msg151589 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2012-01-18 23:44
On Wed, Jan 18, 2012 at 3:37 PM, Terry J. Reedy <report@bugs.python.org>wrote: > > Terry J. Reedy <tjreedy@udel.edu> added the comment: > > > As much as the counting idea rubs me wrong, > > FWIW, the original 2003 paper reported that the url-caching system that > they tested used collision-counting to evade attacks. You mean as a fix or that they successfully attacked a collision-counting system?
msg151590 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-18 23:46
> > As much as the counting idea rubs me wrong, > > FWIW, the original 2003 paper reported that the url-caching system that > they tested used collision-counting to evade attacks. I think that was DJB's DNS server/cache actually. But deciding to limit collisions in a specific application is not the same as limiting them in the general case. Python dicts have a lot of use cases that are not limited to storing URL parameters, domain names or instance attributes: there is a greater risk of meeting pathological cases with legitimate keys.
msg151596 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2012-01-19 00:46
On Wed, Jan 18, 2012 at 3:46 PM, Antoine Pitrou <report@bugs.python.org>wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > > > As much as the counting idea rubs me wrong, > > > > FWIW, the original 2003 paper reported that the url-caching system that > > they tested used collision-counting to evade attacks. > > I think that was DJB's DNS server/cache actually. > But deciding to limit collisions in a specific application is not the > same as limiting them in the general case. Python dicts have a lot of > use cases that are not limited to storing URL parameters, domain names > or instance attributes: there is a greater risk of meeting pathological > cases with legitimate keys. > Really? This sounds like FUD to me.
msg151604 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-01-19 01:15
> You mean as a fix or that they successfully attacked a collision-counting > system? Successful anticipation and blocking of hash attack: after a chain of 100 DNS 'treats the request as a cache miss'. What is somewhat special for this app is being able to bail at that point. Crosby & Wallach still think 'his fix could be improved', I presume by using one of their recommended hashes. http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003.pdf section 3.2, DJB DNS server; section 5, fixes
msg151617 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2012-01-19 13:03
> Even Lib/packaging/create.py change is related to fixing tests. The test can be changed > differently, but I like the idea of having always the same output in packaging (e.g. it is > more readable for the user if files are sorted). See #13712 for why this is a fake fix.
msg151620 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-19 13:13
I tried the collision counting with a low number of collisions: less than 15 collisions ----------------------- Fail at startup. 5 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f 10 collisions (32 buckets, 21 used=65.6%): hash=ceb3152f => f dict((str(k), 0) for k in range(2000000)) ----------------------------------------- 15 collisions (32,768 buckets, 18024 used=55.0%): hash=0e4631d2 => 31d2 20 collisions (131,072 buckets, 81568 used=62.2%): hash=12660719 => 719 25 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21 30 collisions (1,048,576 buckets, 643992 used=61.4%): hash=6a1f6d21 => f6d21 35 collisions => ? (more than 10,000,000 integers) random_dict('', 50000, charset, 1, 3) -------------------------------------- charset = 'abcdefghijklmnopqrstuvwxyz0123456789' 15 collisions (8192 buckets, 5083 used=62.0%): hash=1526677a => 77a 20 collisions (32768 buckets, 19098 used=58.3%): hash=5d7760e6 => 60e6 25 collisions => <unable to generate a new key> random_dict('', 50000, charset, 1, 3) -------------------------------------- charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%' 15 collisions (32768 buckets, 20572 used=62.8%): hash=789fe1e6 => 61e6 20 collisions (2048 buckets, 1297 used=63.3%): hash=2052533d => 33d 25 collisions => nope random_dict('', 50000, charset, 1, 10) -------------------------------------- charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%' 15 collisions (32768 buckets, 18964 used=57.9%): hash=94d7c4f5 => 44f5 20 collisions (32768 buckets, 21548 used=65.8%): hash=acb5b39e => 339e 25 collisions (8192 buckets, 5395 used=65.9%): hash=04d367ae => 7ae 30 collisions => nope random_dict() comes from the following script: * import random def random_string(charset, minlen, maxlen): strlen = random.randint(minlen, maxlen) return ''.join(random.choice(charset) for index in xrange(strlen)) def random_dict(prefix, count, charset, minlen, maxlen): dico = {} keys = set() for index in xrange(count): for tries in xrange(10000): key = prefix + random_string(charset, minlen, maxlen) if key in keys: continue keys.add(key) break else: raise ValueError("unable to generate a new key") dico[key] = None charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.=+_(){}%' charset = 'abcdefghijklmnopqrstuvwxyz0123456789' random_dict('', 50000, charset, 1, 3) * I ran the Django test suite. With a limit of 20 collisions, 60 tests fail. With a limit of 50 collisions, there is no failure. But I don't think that the test suite uses large data sets. I also triend the Django test suite with a randomized hash function. There are 46 failures. Many (all?) are related to the order of dict keys: repr(dict) or indirectly in a HTML output. I didn't analyze all failures. I suppose that Django can simply run the test suite using PYTHONHASHSEED=0 (disable the randomized hash function), at least in a first time.
msg151625 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-19 14:27
STINNER Victor wrote: > ... > So I expect something similar in applications: no change in the > applications, but a lot of hacks/tricks in tests. Tests usually check output of an application given a certain input. If those fail with the randomization, then it's likely real-world application uses will show the same kinds of failures due to the application changing from deterministic to non-deterministic via the randomization. >> BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix >> which needlessly complicates the code and doesn't any additional >> protection against hash value collisions > > How does it complicate the code? It adds an extra XOR to hash(str) and > 4 or 8 bytes in memory, that's all. It is more difficult to compute > the secret from hash(str) output if there is a prefix and a suffix. > If there is only a prefix, knowning a single hash(str) value is just > enough to retrieve directly the secret. The suffix only introduces a constant change in all hash values output, so even if you don't know the suffix, you can still generate data sets with collisions by just having the prefix. >> I don't think it affects more than 0.01% of applications/users :) > > It would help to try a patched Python on a real world application like > Django to realize how much code is broken (or not) by a randomized > hash function. That would help for both approaches, indeed. Please note, that you'd have to extend the randomization to all other Python data types as well in order to reach the same level of security as the collision counting approach. As-is the randomization patch does not solve the integer key attack and even though parsers such as JSON and XML-RPC aren't directly affected, it is well possible that stringified integers such as IDs are converted back to integers later during processing, thereby triggering the attack. Note that the integer attack also applies to other number types in Python: (3, 3, 3) See Tim's post I referenced earlier on for the reasons. Here's a quick summary ;-) ... {3: 3}
msg151626 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-19 14:31
> Please note, that you'd have to extend the randomization to > all other Python data types as well in order to reach the same level > of security as the collision counting approach. You also have to extend the collision counting to sets, by the way.
msg151628 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-19 14:37
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> Please note, that you'd have to extend the randomization to >> all other Python data types as well in order to reach the same level >> of security as the collision counting approach. > > You also have to extend the collision counting to sets, by the way. Indeed, but that's easy, since the set implementation derives from the dict implementation.
msg151629 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-19 14:43
Django's tests will not be run with HASHEED=0, if they're broken with hash randomization then they are likely broken on random.choice(["32-bit", "64-bit", "pypy", "jython", "ironpython"]) and we strive to run on all those platforms. If our tests are order dependent then they're broken, and we'll fix the tests. Further, most of the failures I can think of would be failures in the tests that wouldn't actually be failures in a real application, such as the rendered HTML being different because a tag's attributes are in a different order.
msg151632 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-19 15:11
STINNER Victor wrote: > > I tried the collision counting with a low number of collisions: > ... no false positives with a limit of 50 collisions ... Thanks for running those tests. Looks like a limit lower than 1000 would already do just fine. Some timings showing how long it would take to hit a limit: # 100 python2.7 -m timeit -n 100 "dict((x(264 - 1), 1) for x in xrange(1, 100))" 100 loops, best of 3: 297 usec per loop # 250 python2.7 -m timeit -n 100 "dict((x(2*64 - 1), 1) for x in xrange(1, 250))" 100 loops, best of 3: 1.46 msec per loop # 500 python2.7 -m timeit -n 100 "dict((x(2*64 - 1), 1) for x in xrange(1, 500))" 100 loops, best of 3: 5.73 msec per loop # 750 python2.7 -m timeit -n 100 "dict((x(2*64 - 1), 1) for x in xrange(1, 750))" 100 loops, best of 3: 12.7 msec per loop # 1000 python2.7 -m timeit -n 100 "dict((x(2**64 - 1), 1) for x in xrange(1, 1000))" 100 loops, best of 3: 22.4 msec per loop These timings have to matched against the size of the payload needed to trigger those limits. In any case, the limit needs to be configurable like the hash seed in the randomization patch.
msg151633 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-19 15:13
[Reposting, since roundup removed part of the Python output] M.-A. Lemburg wrote: > Note that the integer attack also applies to other number types > in Python: > > --> (hash(3), hash(3.0), hash(3+0j) > (3, 3, 3) > > See Tim's post I referenced earlier on for the reasons. Here's > a quick summary ;-) ... > > --> {3:1, 3.0:2, 3+0j:3} > {3: 3}
msg151647 - (view)	Author: Frank Sievertsen (fx5)	Date: 2012-01-19 18:05
> The suffix only introduces a constant change in all hash values > output, so even if you don't know the suffix, you can still > generate data sets with collisions by just having the prefix. That's true. But without the suffix, I can pretty easy and efficient guess the prefix by just seeing the result of a few well-chosen and short repr(dict(X)). I suppose that's harder with the suffix.
msg151662 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-20 00:38
Frank Sievertsen wrote: > > Frank Sievertsen <python@sievertsen.de> added the comment: > >> The suffix only introduces a constant change in all hash values >> output, so even if you don't know the suffix, you can still >> generate data sets with collisions by just having the prefix. > > That's true. But without the suffix, I can pretty easy and efficient guess the prefix by just seeing the result of a few well-chosen and short repr(dict(X)). I suppose that's harder with the suffix. Since the hash function is known, it doesn't make things much harder. Without suffix you just need hash('') to find out what the prefix is. With suffix, two values are enough. Say P is your prefix and S your suffix. Let's say you can get the hash values of A = hash('') and B = hash('\x00'). With Victor's hash function you have (IIRC): A = hash('') = P ^ (0<<7) ^ 0 ^ S = P ^ S B = hash('\x00') = ((P ^ (0<<7)) * 1000003) ^ 0 ^ 1 ^ S = (P * 1000003) ^ 1 ^ S Let X = A ^ B, then X = P ^ (P * 1000003) ^ 1 since S ^ S = 0 and 0 ^ Y = Y (for any Y), i.e. the suffix doesn't make any difference. For P < 500000, you can then easily calculate P from X using: P = X // 1000002 (things obviously get tricky once overflow kicks in) Note that for number hashes the randomization doesn't work at all, since there's no length or feedback loop involved. With Victor's approach hash(0) would output the whole seed, but even if the seed is not known, creating an attack data set is trivial, since hash(x) = P ^ x ^ S.
msg151664 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-20 01:11
> Since the hash function is known, it doesn't make things much > harder. Without suffix you just need hash('') to find out what > the prefix is. With suffix, two values are enough. With my patch, hash('') always return zero. I don't remember who asked me to do that, but it avoids to leak too easily the secret :-) I wrote some info how to compute the secret: http://bugs.python.org/issue13703#msg150706 I don't see how to compute the secret, but it doesn't mean that it is impossible :-) I suppose that you have to brute force some bits, at least if you only have repr(dict) which gives only (indirectly) the lower bits of the hash. > (things obviously get tricky once overflow kicks in) hash() doesn't overflow: if you know the string, you can run the algorithm backward. To divide, you can compute 1/1000003 mod 2^32 (or mod 2^64): 2021759595 and 16109806864799210091. So x/1000003 mod 2^32 = x*2021759595 mod 2^32. See my invert_mod() function of: https://bitbucket.org/haypo/misc/src/tip/python/mathfunc.py > With Victor's approach hash(0) would output the whole seed, > but even if the seed is not known, creating an attack data > set is trivial, since hash(x) = P ^ x ^ S. I suppose that it would be too simple to compute the secret of a randomized integer hash, so it is maybe better to leave them unchanged. Using a different secret from strings and integer would not protect Python against an attack only using integers, but integer keys are less common than string keys (especially on web applications). Anyway, I changed my mind about randomized hash: I now prefer counting collisions :-)
msg151677 - (view)	Author: Frank Sievertsen (fx5)	Date: 2012-01-20 04:58
>> That's true. But without the suffix, I can pretty easy and efficient >> guess the prefix by just seeing the result of a few well-chosen and >> short repr(dict(X)). I suppose that's harder with the suffix. > Since the hash function is known, it doesn't make things much > harder. Without suffix you just need hash('') to find out what > the prefix is. With suffix, two values are enough This is obvious and absolutely correct! But it's not what I talked about. I didn't talk about the result of hash(X), but about the result of repr(dict([(str: val), (str: val)....])), which is more likely to happen and not so trivial (if you want to know more than the last 8 bits) IMHO this problem shows that we can't advice dict() or set() for (potential dangerous) user-supplied keys at the moment. I prefer randomization because it fixes this problem. The collision-counting->exception prevents a software from becoming slow, but it doesn't make it work as expected. Sure, you can catch the exception. But when you get the exception, probably you wanted to add the items for a reason: Because you want them to be in the dict and that's how your software works. Imagine an irc-server using a dict to store the connected users, using the nicknames as keys. Even if the irc-server catches the unexpected exception while connecting a new user (when adding his/her name to the dict), an attacker could connect 999 special-named users to prevent a specific user from connecting in future. Collision-counting->exception can make it possible to inhibit a specific future add to the dict. The outcome is highly application dependent. I think it fixes 95% of the attack-vectors, but not all and it adds a few new risks. However, of course it's much better then doing nothing to fix the problem.
msg151679 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-01-20 09:03
> A dict can contain non-orderable keys, I don't know how an AVL tree > can fit into that. They may be non-orderable, but since they are required to be hashable, I guess one can build an comparison function with the following: def cmp(x, y): if x == y: return 0 elif hash(x) <= hash(y): return -1 else: return 1 It doesn't yield a mathematical order because it lacks the anti-symmetry property, but it should be enough for a binary search tree.
msg151680 - (view)	Author: Frank Sievertsen (fx5)	Date: 2012-01-20 09:30
> They may be non-orderable, but since they are required to be hashable, > I guess one can build an comparison function with the following: Since we are are trying to fix a problem where hash(X) == hash(Y), you can't make them orderable by using the hash-values and build a binary out of the (equal) hash-values.
msg151681 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-01-20 10:39
> Since we are are trying to fix a problem where hash(X) == hash(Y), you > can't make them orderable by using the hash-values and build a binary > out of the (equal) hash-values. That's not what I suggested. Keys would be considered equal if they are indeed equal (__eq__). The hash value is just used to know if the key belongs to the left or the right child tree. With a self-balanced binary search tree, you'd still get O(log(N)) complexity. Anyway, I still think that the hash randomization is the right way to go, simply because it does solve the problem, whereas the collision counting doesn't: Martin made a very good point on python-dev with his database example.
msg151682 - (view)	Author: Frank Sievertsen (fx5)	Date: 2012-01-20 10:43
> The hash value is just used to know if the key belongs to the left > or the right child tree. Yes, that's what I don't understand: How can you do this, when ALL hash-values are equal.
msg151684 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-01-20 10:52
> Yes, that's what I don't understand: How can you do this, when ALL > hash-values are equal. You're right, that's stupid. Short night...
msg151685 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-20 11:17
Charles-François Natali wrote: > > Anyway, I still think that the hash randomization is the right way to > go, simply because it does solve the problem, whereas the collision > counting doesn't: Martin made a very good point on python-dev with his > database example. For completeness, I quote Martin here: """ The main issue with that approach is that it allows a new kind of attack. An attacker now needs to find 1000 colliding keys, and submit them one-by-one into a database. The limit will not trigger, as those are just database insertions. Now, if the applications also as a need to read the entire database table into a dictionary, that will suddenly break, and not for the attacker (which would be ok), but for the regular user of the application or the site administrator. So it may be that this approach actually simplifies the attack, making the cure worse than the disease. """ Martin is correct in that it is possible to trick an application into building some data pool which can then be used as indirect input for an attack. What I don't see is what's wrong with the application raising an exception in case it finds such data in an untrusted source (reading arbitrary amounts of user data from a database is just as dangerous as reading such data from any other source). The exception will tell the programmer to be more careful and patch the application not to read untrusted data without additional precautions. It will also tell the maintainer of the application that there was indeed an attack on the system which may need to be tracked down. Note that the collision counting demo patch is trivial - I just wanted to demonstrate how it works. As already mentioned, there's room for improvement: If Python objects were to provide an additional method for calculating a universal hash value (based on an integer input parameter), the dictionary in question could use this to rehash itself and avoid the attack. Think of this as "randomization when needed". (*) Since the dict would still detect the problem, it could also raise a warning to inform the maintainer of the application. So you get the best of both worlds and randomization would only kick in when it's really needed to keep the application running.
msg151689 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-20 12:58
> Note that the collision counting demo patch is trivial - I just > wanted to demonstrate how it works. As already mentioned, there's > room for improvement: > > If Python objects were to provide an additional > method for calculating a universal hash value (based on an > integer input parameter), the dictionary in question could > use this to rehash itself and avoid the attack. Think of this > as "randomization when needed". Yes, the solution can be improved, but maybe not in stable versions (the patch for stable versions should be short and simple). If the hash output depends on an argument, the result cannot be cached. So I suppose that dictionary lookups become slower than the dictionary switches to the randomized mode. It would require to add an optional argument to hash functions, or add a new function to some (or all?) builtin types.
msg151691 - (view)	Author: Charles-François Natali (neologix) *	Date: 2012-01-20 14:42
> So you get the best of both worlds and randomization would only > kick in when it's really needed to keep the application running. Of course, but then the collision counting approach loses its main advantage over randomized hashing: smaller patch, easier to backport. If you need to handle a potential abnormal number of collisions anyway, why not account for it upfront, instead of drastically complexifying the algorithm? While larger, the randomization is conceptually simpler. The only argument in favor the collision counting is that it will not break applications relying on dict order: it has been argued several times that such applications are already broken, but that's of course not an easy decision to make, especially for stable versions...
msg151699 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-20 17:31
Marc-Andre Lemburg: >> So you get the best of both worlds and randomization would only >> kick in when it's really needed to keep the application running. Charles-François Natali > The only argument in favor the collision counting is that it will not > break applications relying on dict order: There is also the "taxes suck" argument; if hashing is made complex, then every object (or at least almost every string) pays a price, even if it will never be stuck in a dict big enough to matter. With collision counting, there are no additional operations unless and until there is at least one collision -- in other words, after the base hash algorithm has already started to fail for that particular piece of data. In fact, the base algorithm can be safely simplified further, precisely because it does not need to be quite as adequate for reprobes on data that does have at least one collision.
msg151700 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2012-01-20 17:39
On Thu, Jan 19, 2012 at 8:58 PM, Frank Sievertsen <report@bugs.python.org>wrote: > > Frank Sievertsen <python@sievertsen.de> added the comment: > > >> That's true. But without the suffix, I can pretty easy and efficient > >> guess the prefix by just seeing the result of a few well-chosen and > >> short repr(dict(X)). I suppose that's harder with the suffix. > > > Since the hash function is known, it doesn't make things much > > harder. Without suffix you just need hash('') to find out what > > the prefix is. With suffix, two values are enough > > This is obvious and absolutely correct! > > But it's not what I talked about. I didn't talk about the result of > hash(X), but about the result of repr(dict([(str: val), (str: > val)....])), which is more likely to happen and not so trivial > (if you want to know more than the last 8 bits) > > IMHO this problem shows that we can't advice dict() or set() for > (potential dangerous) user-supplied keys at the moment. > > I prefer randomization because it fixes this problem. The > collision-counting->exception prevents a software from becoming slow, > but it doesn't make it work as expected. > That depends. If collision counting prevents the DoS attack that may be "work as expected", assuming you believe (as I do) that "real life" data won't ever have that many collisions. Note that every web service is vulnerable to some form of DoS where a sufficient number of malicious requests will keep all available servers occupied so legitimate requests suffer delays and timeouts. The defense is to have sufficient capacity so that a potential attacker would need a large amount of resources to do any real damage. The hash collision attack vastly reduces the amount of resources needed to bring down a service; crashing early moves the balance of power significantly back, and that's all we can ask for. Sure, you can catch the exception. But when you get the exception, > probably you wanted to add the items for a reason: Because you want > them to be in the dict and that's how your software works. > No, real data would never make this happen, so it's a "don't care" case (at least for the vast majority of services). An attacker could also send you such a large amount of data that your server runs out of memory, or starts swapping (which is almost worse). But that requires for the attacker to have enough bandwidth to send you that data. Or they could send you very many requests. Same requirement. All we need to guard for here is the unfortunate multiplication of the attacker's effort due to the behavior of the collision-resolution code in the dict implementation. Beyond that it's every app for itself. > Imagine an irc-server using a dict to store the connected users, using > the nicknames as keys. Even if the irc-server catches the unexpected > exception while connecting a new user (when adding his/her name to the > dict), an attacker could connect 999 special-named users to prevent a > specific user from connecting in future. > Or they could use many other tactics. At this point the attack is specific to this IRC implementation and it's no longer Python's responsibility. > Collision-counting->exception can make it possible to inhibit a > specific future add to the dict. The outcome is highly application > dependent. > > I think it fixes 95% of the attack-vectors, but not all and it adds a > few new risks. However, of course it's much better then doing nothing > to fix the problem. > Right -- it vastly increases the effort needed to attack any particular service, and does not affect any behavior of existing Python apps.
msg151701 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-20 17:42
On Fri, Jan 20, 2012 at 7:58 AM, STINNER Victor > If the hash output depends on an argument, the result cannot be > cached. They can still be cached in a separate dict based on id, rather than string contents. It may also be possible to cache them in the dict itself; for a string-only dict, the hash of each entry is already cached on the object, and the cache member of the entry is technically redundant. Entering a key with the alternative hash can also switch the lookup function to one that handles that possibility, just as entering a non-string key currently does. > It would require to add an > optional argument to hash functions, or add a new function to some > (or all?) builtin types. For backports, the alternative hashing could be done privately within dict and set, and would not require new slots on other types.
msg151703 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-20 18:11
I ran the test suite of Twisted 11.1 using a limit of 20 collisions: there is no test failing because of hash collisions.
msg151707 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-20 22:55
On Fri, 2012-01-06 at 12:52 +0000, Marc-Andre Lemburg wrote: > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > Demo patch implementing the collision limit idea for Python 2.7. > > ---------- > Added file: http://bugs.python.org/file24151/hash-attack.patch > Marc: is this the latest version of your patch? Whether or not we go with collision counting and/or adding a random salt to hashes and/or something else, I've had a go at updating your patch Although debate on python-dev seems to have turned against the collision-counting idea, based on flaws reported by Frank Sievertsen http://mail.python.org/pipermail/python-dev/2012-January/115726.html it seemed to me to be worth at least adding some test cases to flesh out the approach. Note that the test cases deliberately avoid containing "hostile" data. Am attaching an updated version which: * adds various FIXMEs (my patch isn't ready yet, but I wanted to get more eyes on this) * introduces a new TooManyHashCollisions exception, and uses that rather than KeyError (currently it extends BaseException; am not sure where it should sit in the exception hierarchy). * adds debug text to the above exception, including the repr() and hash of the key for which the issue was triggered: TooManyHashCollisions: 1001 hash collisions within dict at key ChosenHash(999, 42) with hash 42 * moves exception-setting to a helper function, to avoid duplicated code * adds a sys.max_dict_collisions, though currently with just a copy-and-paste of the 1000 value from dictobject.c * starts adding a test suite to test_dict.py, using a ChosenHash helper class (to avoid having to publish hostile data), and a context manager for ensuring the timings of various operations fall within sane bounds, so I can do things like this: with self.assertFasterThan(seconds=TIME_LIMIT) as cm: for i in range(sys.max_dict_collisions -1 ): key = ChosenHash(i, 42) d[key] = 0 The test suite reproduces the TooManyHashCollisions response to a basic DoS, and also "successfully" fails due to scenario 2 in Frank's email above (assuming I understood his email correctly). Presumably this could also incorporate a reproducer for scenario 1 in this email, though I don't have one yet (but I don't want to make hostile data public). The patch doesn't yet do anything for sets. Hope this is helpful Dave
msg151714 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-21 03:16
On Fri, 2012-01-20 at 22:55 +0000, Dave Malcolm wrote: > Dave Malcolm <dmalcolm@redhat.com> added the comment: > > On Fri, 2012-01-06 at 12:52 +0000, Marc-Andre Lemburg wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > > > Demo patch implementing the collision limit idea for Python 2.7. > > > > ---------- > > Added file: http://bugs.python.org/file24151/hash-attack.patch > > > > Marc: is this the latest version of your patch? > > Whether or not we go with collision counting and/or adding a random salt > to hashes and/or something else, I've had a go at updating your patch > > Although debate on python-dev seems to have turned against the > collision-counting idea, based on flaws reported by Frank Sievertsen > http://mail.python.org/pipermail/python-dev/2012-January/115726.html > it seemed to me to be worth at least adding some test cases to flesh out > the approach. Note that the test cases deliberately avoid containing > "hostile" data. I had a brainstorm, and I don't yet know if the following makes sense, but here's a crude patch with another approach, which might get around the issues Frank raises. Rather than count the number of equal-hash collisions within each call to lookdict, instead keep a per-dict count of the total number of iterations through the probe sequence (regardless of the hashing), amortized across all calls to lookdict, and if it looks like we're going O(n^2) rather than O(n), raise an exception. Actually, that's not quite it, but see below... We potentially have 24 words of per-dictionary storage hiding in the ma_smalltable area within PyDictObject, which we can use when ma_mask >= PyDict_MINSIZE (when mp->ma_table != mp->ma_smalltable), without changing sizeof(PyDictObject) and thus breaking ABI. I hope there isn't any code out there that uses this space. (Anyone know of any?) This very crude patch uses that area to add per-dict tracking of the total number of iterations spent probing for a free PyDictEntry whilst constructing the dictionary. It rules that if we've gone more than (32 * ma_used) iterations whilst constructing the dictionary (counted across all ma_lookup calls), then we're degenerating into O(n^2) behavior, and this triggers an exception. Any other usage of ma_lookup resets the count (e.g. when reading values back). I picked the scaling factor of 32 from out of the air; I hope there's a smarter threshold. I'm assuming that an attack scenario tends to involve a dictionary that goes through a construction phase (which the attacker is aiming to change from O(N) to O(N^2)), and then a usage phase, whereas there are other patterns of dictionary usage in which insertion and lookup are intermingled for which this approach wouldn't raise an exception. This leads to exceptions like this: AlgorithmicComplexityError: dict construction used 4951 probes for 99 entries at key 99 with hash 42 (i.e. the act of constructing a dict with 99 entries required traversing 4951 PyDictEntry slots, suggesting someone is sending deliberately awkward data). Seems to successfully handle both the original DoS and the second scenario in Frank's email. I don't have a reproducer for the first of Frank's scenarios, but in theory it ought to handle it. (I hope!) Have seen two failures within python test suite from this, which I hope can be fixed by tuning the thresholds and the reset events (they seem to happen when a large dict is emptied). May have a performance impact, but I didn't make any attempt to optimize it (beyond picking a power of two for the scaling factor). (There may be random bits of the old patch thrown in; sorry) Thoughts? (apart from "ugh! it's ugly!" yes I know - it's late here) Dave
msg151731 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-21 14:27
> Thoughts? (apart from "ugh! it's ugly!" yes I know - it's late here) Is it guaranteed that no usage pattern can render this protection inefficient? What if a dict is constructed by intermingling lookups and inserts? Similarly, what happens with e.g. the common use case of dictdefault(list), where you append() after the lookup/insert? Does some key distribution allow the attack while circumventing the protection?
msg151734 - (view)	Author: Zbyszek Jędrzejewski-Szmek (zbysz) *	Date: 2012-01-21 15:36
The hashing with random seed is only marginally slower or more complicated than current version. The patch is big because it moves random number generator initialization code around. There's no "per object" tax, and the cost of the random number generator initialization is only significant on windows. Basically, there's no "tax". Zbyszek
msg151735 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-21 17:02
On Sat, 2012-01-21 at 14:27 +0000, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > > Thoughts? (apart from "ugh! it's ugly!" yes I know - it's late here) > > Is it guaranteed that no usage pattern can render this protection > inefficient? What if a dict is constructed by intermingling lookups and > inserts? > Similarly, what happens with e.g. the common use case of > dictdefault(list), where you append() after the lookup/insert? Does some > key distribution allow the attack while circumventing the protection? Yes, I agree that I was making an unrealistic assumption about usage patterns. There was also some global state (the "is_inserting" variable). I've tweaked the approach somewhat, moved the global to be per-dict, and am attaching a revised version of the patch: amortized-probe-counting-dmalcolm-2012-01-21-003.patch In this patch, rather than reset the count each time, I keep track of the total number of calls to insertdict() that have happened for each "large dict" (i.e. for which ma_table != ma_smalltable), and the total number of probe iterations that have been needed to service those insertions/overwrites. It raises the exception when the number of probe iterations per insertion exceeds a threshold factor (rather than merely comparing the number of iterations against the current ma_used of the dict). I believe this means that it's tracking and checking every time the dict is modified, and (I hope) protects us against any data that drives the dict implementation away from linear behavior (because that's essentially what it's testing for). [the per-dict stats are reset each time that it shrinks down to using ma_smalltable again, but I think at-risk usage patterns in which that occurs are uncommon relative to those in which it doesn't]. When attacked, this leads to exceptions like this: AlgorithmicComplexityError: dict construction used 1697 probes whilst performing 53 insertions (len() == 58) at key 58 with hash 42 i.e we have a dictionary containing 58 keys, which has seen 53 insert/overwrite operations since transitioning to the non-ma_smalltable representation (at size 6); presumably it has 128 PyDictEntries. Servicing those 53 operations has required a total 1697 iterations through the probing loop, or a little over 32 probes per insert. I just did a full run of the test suite (using run_tests.py), and it mostly passed the new tests I've added (included the test for scenario 2 from Frank's email). There were two failures: ====================================================================== FAIL: test_inheritance (test.test_pep352.ExceptionClassTests) ---------------------------------------------------------------------- AssertionError: 1 != 0 : {'AlgorithmicComplexityError'} not accounted for ---------------------------------------------------------------------- which is obviously fixable (given a decision on where the exception lives in the hierarchy) and this one: test test_mutants crashed -- Traceback (most recent call last): File "/home/david/coding/python-hg/cpython-count-collisions/Lib/test/regrtest.py", line 1214, in runtest_inner the_package = __import__(abstest, globals(), locals(), []) File "/home/david/coding/python-hg/cpython-count-collisions/Lib/test/test_mutants.py", line 159, in <module> test(100) File "/home/david/coding/python-hg/cpython-count-collisions/Lib/test/test_mutants.py", line 156, in test test_one(random.randrange(1, 100)) File "/home/david/coding/python-hg/cpython-count-collisions/Lib/test/test_mutants.py", line 132, in test_one dict2keys = fill_dict(dict2, range(n), n) File "/home/david/coding/python-hg/cpython-count-collisions/Lib/test/test_mutants.py", line 118, in fill_dict Horrid(random.choice(candidates)) AlgorithmicComplexityError: dict construction used 2753 probes whilst performing 86 insertions (len() == 64) at key Horrid(86) with hash 42 though that seems to be deliberately degenerate code. Caveats: * no overflow handling (what happens after 2*32 modifications to a long-lived dict on a 32-bit build?) - though that's fixable. no idea what the scaling factor for the threshold should be (there may also be a deep mathematical objection here, based on how big-O notation is defined in terms of an arbitrary scaling factor and limit) * not optimized; I haven't looked at performance yet * doesn't cover set(), though that also has spare space (I hope) via its own smalltable array. BTW, note that although I've been working on this variant of the collision counting approach, I'm not opposed to the hash randomization approach, or to adding extra checks in strategic places within the stdlib: I'm keen to get some kind of appropriate fix approved by the upstream Python development community so I can backport it to the various recent-to-ancient versions of CPython I support in RHEL (and Fedora), before we start seeing real-world attacks. Hope this is helpful Dave
msg151737 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-21 17:07
(or combination of fixes, of course)
msg151739 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-21 18:57
> In this patch, rather than reset the count each time, I keep track of > the total number of calls to insertdict() that have happened for each > "large dict" (i.e. for which ma_table != ma_smalltable), and the total > number of probe iterations that have been needed to service those > insertions/overwrites. It raises the exception when the number of > probe iterations per insertion exceeds a threshold factor (rather than > merely comparing the number of iterations against the current ma_used of > the dict). This sounds much more robust than the previous attempt. > When attacked, this leads to exceptions like this: > AlgorithmicComplexityError: dict construction used 1697 probes whilst > performing 53 insertions (len() == 58) at key 58 with hash 42 We'll have to discuss the name of the exception and the error message :) > Caveats: > * no overflow handling (what happens after 2*32 modifications to a > long-lived dict on a 32-bit build?) - though that's fixable. How do you suggest to fix it? > no idea what the scaling factor for the threshold should be (there may > also be a deep mathematical objection here, based on how big-O notation > is defined in terms of an arbitrary scaling factor and limit) I'd make the threshold factor a constant, e.g. 64 or 128 (it should not be too small, to avoid false positives). We're interested in the actual slowdown factor, which a constant factor models adequately. It's the slowdown factor which makes a DOS attack using this technique efficient. Whether or not dict construction truely degenerates into a O(n**2) operation is less relevant. There needs to be a way to disable it: an environment variable would be the minimum IMO. Also, in 3.3 there should probably be a sys function to enable or disable it at runtime. Not sure it should be backported since it's a new API.
msg151744 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-21 21:07
Well, the old attempt was hardly robust :) Can anyone see any vulnerabilities in this approach? Yeah; I was mostly trying to add raw data (to help me debug the implementation). I wonder if the dict statistics should be exposed with extra attributes or a method on the dict; e.g. a __stats__ attribute, something like this: LargeDictStats(keys=58, mask=127, insertions=53, iterations=1697) SmallDictStats(keys=3, mask=7) or somesuch. Though that's a detail, I think. > > Caveats: > > * no overflow handling (what happens after 2*32 modifications to a > > long-lived dict on a 32-bit build?) - though that's fixable. > > How do you suggest to fix it? If the dict is heading towards overflow of these counters, it's either long-lived, or huge. Possible approaches: (a) use 64-bit counters rather than 32-bit, though that's simply delaying the inevitable (b) when one of the counters gets large, divide both of them by a constant (e.g. 2). We're interested in their ratio, so dividing both by a constant preserves this. By "a constant" do you mean from the perspective of big-O notation, or do you mean that it should be hardcoded (I was wondering if it should be a sys variable/environment variable etc?). > We're interested in the actual slowdown factor, which a constant factor > models adequately. It's the slowdown factor which makes a DOS attack > using this technique efficient. Whether or not dict construction truely > degenerates into a O(n*2) operation is less relevant. OK. > There needs to be a way to disable it: an environment variable would be > the minimum IMO. e.g. set it to 0 to enable it, set it to nonzero to set the scale factor. Any idea what to call it? PYTHONALGORITHMICCOMPLEXITYTHRESHOLD=0 would be quite a mouthful. OK BTW, presumably if we do it, we should do it for sets as well?
msg151745 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-21 22:20
> I wonder if the dict statistics should be exposed with extra attributes > or a method on the dict; e.g. a __stats__ attribute, something like > this: > > LargeDictStats(keys=58, mask=127, insertions=53, iterations=1697) > > SmallDictStats(keys=3, mask=7) Sounds a bit overkill, and it shouldn't be a public API (which __methods__ are). Even a private API on dicts would quickly become visible, since dicts are so pervasive. > > > Caveats: > > > * no overflow handling (what happens after 2*32 modifications to a > > > long-lived dict on a 32-bit build?) - though that's fixable. > > > > How do you suggest to fix it? > > If the dict is heading towards overflow of these counters, it's either > long-lived, or huge*. > > Possible approaches: > (a) use 64-bit counters rather than 32-bit, though that's simply > delaying the inevitable Well, even assuming one billion lookup probes per second on a single dictionary, the inevitable will happen in 584 years with a 64-bit counter (but only 4 seconds with a 32-bit counter). A real issue, though, may be the cost of 64-bit arithmetic on 32-bit CPUs. > (b) when one of the counters gets large, divide both of them by a > constant (e.g. 2). We're interested in their ratio, so dividing both by > a constant preserves this. Sounds good, although we may want to pull this outside of the critical loop. > By "a constant" do you mean from the perspective of big-O notation, or > do you mean that it should be hardcoded (I was wondering if it should be > a sys variable/environment variable etc?). Hardcoded, as in your patch. > > There needs to be a way to disable it: an environment variable would be > > the minimum IMO. > > e.g. set it to 0 to enable it, set it to nonzero to set the scale > factor. 0 to enable it sounds misleading. I'd say: - 0 to disable it - 1 to enable it and use the default scaling factor - >= 2 to enable it and set the scaling factor > Any idea what to call it? PYTHONDICTPROTECTION ? Most people should either enable or disable it, not change the scaling factor. > BTW, presumably if we do it, we should do it for sets as well? Yeah, and use the same env var / sys function.
msg151747 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-21 22:41
On Sat, 2012-01-21 at 22:20 +0000, Antoine Pitrou wrote: > Sounds a bit overkill, and it shouldn't be a public API (which > __methods__ are). Even a private API on dicts would quickly become > visible, since dicts are so pervasive. Fair enough. > > > > Caveats: > > > > * no overflow handling (what happens after 2*32 modifications to a > > > > long-lived dict on a 32-bit build?) - though that's fixable. > > > > > > How do you suggest to fix it? > > > > If the dict is heading towards overflow of these counters, it's either > > long-lived, or huge*. > > > > Possible approaches: > > (a) use 64-bit counters rather than 32-bit, though that's simply > > delaying the inevitable > > Well, even assuming one billion lookup probes per second on a single > dictionary, the inevitable will happen in 584 years with a 64-bit > counter (but only 4 seconds with a 32-bit counter). > > A real issue, though, may be the cost of 64-bit arithmetic on 32-bit > CPUs. > > > (b) when one of the counters gets large, divide both of them by a > > constant (e.g. 2). We're interested in their ratio, so dividing both by > > a constant preserves this. > > Sounds good, although we may want to pull this outside of the critical > loop. OK; I'll look at implementing (b). Oops, yeah, that was a typo; I meant 0 to disable. > - 0 to disable it > - 1 to enable it and use the default scaling factor > - >= 2 to enable it and set the scaling factor You said above that it should be hardcoded; if so, how can it be changed at run-time from an environment variable? Or am I misunderstanding. Works for me. > > BTW, presumably if we do it, we should do it for sets as well? > > Yeah, and use the same env var / sys function. Despite the "DICT" in the title? OK. Thanks for the feedback.
msg151748 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-21 22:45
> You said above that it should be hardcoded; if so, how can it be changed > at run-time from an environment variable? Or am I misunderstanding. You're right, I used the wrong word. I meant it should be a constant independently of the dict size. But, indeed, not hard-coded in the source. > > > BTW, presumably if we do it, we should do it for sets as well? > > > > Yeah, and use the same env var / sys function. > > Despite the "DICT" in the title? OK. Well, dict is the most likely target for these attacks.
msg151753 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-01-21 23:42
On Sat, Jan 21, 2012 at 2:45 PM, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> You said above that it should be hardcoded; if so, how can it be changed >> at run-time from an environment variable? Or am I misunderstanding. > > You're right, I used the wrong word. I meant it should be a constant > independently of the dict size. But, indeed, not hard-coded in the > source. > >> > > BTW, presumably if we do it, we should do it for sets as well? >> > >> > Yeah, and use the same env var / sys function. >> >> Despite the "DICT" in the title? OK. > > Well, dict is the most likely target for these attacks. > While true I wouldn't make that claim as there will be applications using a set in a vulnerable manner. I'd prefer to see any such environment variable name used to configure this behavior not mention DICT or SET but just say HASHTABLE. That is a much better bikeshed color. ;) I'm still in the hash seed randomization camp but I'm finding it interesting all of the creative ways others are trying to "solve" this problem in a way that could be enabled by default in stable versions regardless. :) -gps
msg151754 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-21 23:47
On Sat, Jan 21, 2012 at 5:42 PM, Gregory P. Smith <report@bugs.python.org>wrote: > > Gregory P. Smith <greg@krypto.org> added the comment: > > On Sat, Jan 21, 2012 at 2:45 PM, Antoine Pitrou <report@bugs.python.org> > wrote: > > > > Antoine Pitrou <pitrou@free.fr> added the comment: > > > >> You said above that it should be hardcoded; if so, how can it be changed > >> at run-time from an environment variable? Or am I misunderstanding. > > > > You're right, I used the wrong word. I meant it should be a constant > > independently of the dict size. But, indeed, not hard-coded in the > > source. > > > >> > > BTW, presumably if we do it, we should do it for sets as well? > >> > > >> > Yeah, and use the same env var / sys function. > >> > >> Despite the "DICT" in the title? OK. > > > > Well, dict is the most likely target for these attacks. > > > > While true I wouldn't make that claim as there will be applications > using a set in a vulnerable manner. I'd prefer to see any such > environment variable name used to configure this behavior not mention > DICT or SET but just say HASHTABLE. That is a much better bikeshed > color. ;) > > I'm still in the hash seed randomization camp but I'm finding it > interesting all of the creative ways others are trying to "solve" this > problem in a way that could be enabled by default in stable versions > regardless. :) > > -gps > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________ > I'm a little slow, so bear with me, but David, does this counting scheme in any way address the issue of: I'm able to put N pieces of data into the database on successive requests, but then rendering that data puts it in a dictionary, which renders that page unviewable by anyone.
msg151756 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-22 02:13
5 more characters: PYTHONHASHTABLEPROTECTION or PYHASHTABLEPROTECTION maybe? I'm in both camps: I like hash seed randomization fwiw. I'm nervous about enabling either of the approaches by default, but I can see myself backporting both approaches into RHEL's ancient Python versions, compiled in, disabled by default, but available at runtime via env vars (assuming that no major flaws are discovered in my patch e.g. performance). I'm sorry if I'm muddying the waters by working on this approach. Is the hash randomization approach ready to go, or is more work needed? If the latter, is there a clear TODO list? (for backporting to 2., presumably we'd want PyStringObject to be randomized; I think this means that PyBytesObject needs to be randomized also in 3.; don't we need hash(b'foo') == hash('foo') ?). Does the patch needs to also randomize the hashes of the numeric types? (I think not; that may break too much 3rd-party code (NumPy?)). [If we're bikeshedding, I prefer the term "salt" to "seed" in the hash randomization approach: there's a per-process "hash salt", which is either randomly generated, or comes from the environment, set to 0 to disable]
msg151758 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-22 03:43
On Sat, Jan 21, 2012 at 3:47 PM, Alex Gaynor <report@bugs.python.org> wrote: > I'm able to put N pieces of data into the database on successive requests, > but then rendering that data puts it in a dictionary, which renders that > page unviewable by anyone. This and the problems Frank mentions are my primary concerns about the counting approach. Without the original suggestion of modifying the hash and continuing without an exception (which has its own set of problems), the "valid data python can't process" problem is a pretty big one. Allowing attackers to poison interactions for other users is unacceptable. The other thing I haven't seen mentioned yet is that while it is true that most web applications do have robust error handling to produce proper 500s, an unexpected error will usually result in restarting the server process - something that can carry significant weight by itself. I would consider it a serious problem if every attack request required a complete application restart, a la original cgi. I'm strongly in favor of randomization. While there are many broken applications in the wild that depend on dictionary ordering, if we ship with this feature disabled by default for security and bugfix branches, and enable it for 3.3, users can opt-in to protection as they need it and as they fix their applications. Users who have broken applications can still safely apply the security fix (without even reading the release notes) because it won't change the default behavior. Distro managers can make an appropriate choice for their user base. Most importantly, it negates the entire "compute once, attack everywhere" class of collision problems, even if we haven't explicitly discovered them.
msg151794 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-01-23 00:22
@dmalcolm: How did you chose Py_MAX_AVERAGE_PROBES_PER_INSERT=32? Did you try your patch on applications like the test suite of Django or Twisted?
msg151796 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-23 03:48
On Sat, 2012-01-21 at 23:47 +0000, Alex Gaynor wrote: > Alex Gaynor <alex.gaynor@gmail.com> added the comment: > > On Sat, Jan 21, 2012 at 5:42 PM, Gregory P. Smith <report@bugs.python.org>wrote: > > > > > Gregory P. Smith <greg@krypto.org> added the comment: > > > > On Sat, Jan 21, 2012 at 2:45 PM, Antoine Pitrou <report@bugs.python.org> > > wrote: > > > > > > Antoine Pitrou <pitrou@free.fr> added the comment: > > > > > >> You said above that it should be hardcoded; if so, how can it be changed > > >> at run-time from an environment variable? Or am I misunderstanding. > > > > > > You're right, I used the wrong word. I meant it should be a constant > > > independently of the dict size. But, indeed, not hard-coded in the > > > source. > > > > > >> > > BTW, presumably if we do it, we should do it for sets as well? > > >> > > > >> > Yeah, and use the same env var / sys function. > > >> > > >> Despite the "DICT" in the title? OK. > > > > > > Well, dict is the most likely target for these attacks. > > > > > > > While true I wouldn't make that claim as there will be applications > > using a set in a vulnerable manner. I'd prefer to see any such > > environment variable name used to configure this behavior not mention > > DICT or SET but just say HASHTABLE. That is a much better bikeshed > > color. ;) > > > > I'm still in the hash seed randomization camp but I'm finding it > > interesting all of the creative ways others are trying to "solve" this > > problem in a way that could be enabled by default in stable versions > > regardless. :) > > > > -gps > > > > ---------- > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <http://bugs.python.org/issue13703> > > _______________________________________ > > > > I'm a little slow, so bear with me, but David, does this counting scheme in > any way address the issue of: > > I'm able to put N pieces of data into the database on successive requests, > but then rendering that data puts it in a dictionary, which renders that > page unviewable by anyone. It doesn't address this issue - though if the page is taking many hours to render, is that in practice less unviewable that everyone getting an immediate exception with (perhaps) a useful error message? Unfortunately, given the current scale factor, my patch may make it worse: in my tests, this approach rejected malicious data much more quickly than the old collision-counting one, which I thought was a good thing - but then I realized that this means that an attacker adopting the strategy you describe would have to do less work to trigger the exception than to trigger the slowdown. So I'm not convinced my approach flies, and I'm leaning towards working on the hash randomization patch rather than pursuing this. I need sleep though, so I'm not sure the above is coherent Dave
msg151798 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-23 04:04
I arbitrarily started with 50, and then decided a power of two would be quicker when multiplying. There wasn't any rigorous analysis behind the choice of factor. Though, as noted in msg151796, I've gone off this idea, since I think the "protection" creates additional avenues of attack. I think getting some kind of hash randomization patch into the hands of users ASAP is the way forward here (even if disabled by default). If we're going to support shipping backported versions of the hash randomization patch with the randomization disabled, did we decide on a way of enabling it? If not, then I propose that those who want to ship with it disabled by default standardize on (say): PYTHONHASHRANDOMIZATION as an environment variable: if set to nonzero, it enables hash randomization (reading the random seed as per the 3.3. patch, and respecting the PYTHONHASHSEED variable if that's also set). If set to zero or not present, hash randomization is disabled. Does that sound sane? (we can't use PYTHONHASHSEED for this, since if a number is given, that means "use this number", right?) FWIW, I favor hash randomization in 2.* for PyStringObject, PyUnicodeObject, PyBufferObject, and the 3 datetime classes in Modules/_datetimemodule.c (see the implementation of generic_hash in that file), but to not do it for the numeric types. Sorry; I only tried it on the python test suite (and on a set of reproducers for the DoS that I've written for RH's in-house test suite).
msg151812 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-23 13:07
Dave Malcolm wrote: > > Dave Malcolm <dmalcolm@redhat.com> added the comment: > > On Fri, 2012-01-06 at 12:52 +0000, Marc-Andre Lemburg wrote: >> Marc-Andre Lemburg <mal@egenix.com> added the comment: >> >> Demo patch implementing the collision limit idea for Python 2.7. >> >> ---------- >> Added file: http://bugs.python.org/file24151/hash-attack.patch >> > > Marc: is this the latest version of your patch? Yes. As mentioned in the above message, it's just a demo of how the collision limit idea can be implemented. > Whether or not we go with collision counting and/or adding a random salt > to hashes and/or something else, I've had a go at updating your patch > > Although debate on python-dev seems to have turned against the > collision-counting idea, based on flaws reported by Frank Sievertsen > http://mail.python.org/pipermail/python-dev/2012-January/115726.html > it seemed to me to be worth at least adding some test cases to flesh out > the approach. Note that the test cases deliberately avoid containing > "hostile" data. Martin's example is really just a red herring: it doesn't matter where the hostile data originates or how it gets into the application. There are many ways an attacker can get the O(n^2) worst case timing triggered. Frank's example is an attack on the second possible way to trigger the O(n^2) behavior. See msg150724 further above where I listed the two possibilities: """ An attack can be based on trying to find many objects with the same hash value, or trying to find many objects that, as they get inserted into a dictionary, very often cause collisions due to the collision resolution algorithm not finding a free slot. """ My demo patch only addresses the first variant. In order to cover the second variant as well, you'd have to count and limit the number of iterations in the perturb for-loop of the lookdict() functions where the hash value of the slot does not match the key's hash value. Note that the second variant is both a lot less likely to trigger (due to the dict getting resized on a regular basis) and the code involved a lot faster than the code for the first variant (which requires a costly object comparison), so the limit for the second variant would have to be somewhat higher than for the first. BTW: The collision counting patch chunk for the string dicts in my demo patch is wrong. I've attached a corrected version. In the original patch it was counting both collision variants with the same counter and limit.
msg151813 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-23 13:38
Alex Gaynor wrote: > I'm able to put N pieces of data into the database on successive requests, > but then rendering that data puts it in a dictionary, which renders that > page unviewable by anyone. I think you're asking a bit much here :-) A broken app is a broken app, no matter how nice Python tries to work around it. If an app puts too much trust into user data, it will be vulnerable one way or another and regardless of how the user data enters the app. These are the collision counting possibilities we've discussed so far: With an collision counting exception you'd get a clear notice that something in your data and your application is wrong and needs fixing. The rest of your web app will continue to work fine and you won't run into a DoS problem taking down all of your web server. With the proposed enhancement of collision counting + universal hash function for Python 3.3, you'd get a warning printed to the logs, the dict implementation would self-heal and your page is viewable nonetheless. The admin would then see the log entry and get a chance to fix the problem. Note: Even if Python works around the problem successfully, there's no guarantee that the data doesn't end up being processed by some other tool in the chain with similar problems. All this is a work-around for an application bug, nothing more. Silencing the problem by e.g. using randomization in the string hash algorithm doesn't really help in identifying the bug. Overall, I don't think we should make Python's hash function non-deterministic. Even with the universal hash function idea, the dict implementation should use a predefined way of determining the next hash parameter to use, so that running the application twice against attack data will still result in the same data output.
msg151814 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-23 13:40
> Frank's example is an attack on the second possible way to > trigger the O(n^2) behavior. See msg150724 further above where I > listed the two possibilities: > > """ > An attack can be based on trying to find many objects with the same > hash value, or trying to find many objects that, as they get inserted > into a dictionary, very often cause collisions due to the collision > resolution algorithm not finding a free slot. > """ No, Frank's examples attack both possible ways.
msg151815 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-23 13:56
> With an collision counting exception you'd get a clear notice that > something in your data and your application is wrong and needs > fixing. The rest of your web app will continue to work fine Except when it doesn't, because you've also broken batch processing functions and the like. > Note: Even if Python works around the problem successfully, there's no > guarantee that the data doesn't end up being processed by some other > tool in the chain with similar problems. Non-Python tools don't use Python's hash functions, they are therefore not vulnerable to the same data.
msg151825 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-23 16:43
Here's a version of the collision counting patch that takes both hash and slot collisions into account. I've also added a test script which demonstrates both types of collisions using integer objects (since it's trivial to calculate their hashes). To see the collision counting, enable the DEBUG_DICT_COLLISIONS macro variable.
msg151826 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-23 16:45
> I've also added a test script which demonstrates both types of > collisions using integer objects (since it's trivial to calculate > their hashes). I forgot to mention: the test script is for 64-bit platforms. It's easy to adapt it to 32-bit if needed.
msg151847 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-23 21:31
I'm attaching an attempt at backporting haypo's random-8.patch to 2.7 Changes relative to random-8.patch: * The randomization is off by default, and must be enabled by setting a new environment variable PYTHONHASHRANDOMIZATION to a non-empty string. (if so then, PYTHONHASHSEED also still works, if provided, in the same way as in haypo's patch) * All of the various "Py_hash_t" become "long" again (Py_hash_t was added in 3.2: issue9778) * I expanded the randomization from just PyUnicodeObject to also cover these types: * PyStringObject * PyBufferObject The randomization does not cover numeric types: if we change the hash of int so that hash(i) no longer equals i, we also have to change it consistently for long, float, complex, decimal.Decimal and fractions.Fraction; however, there are 3rd-party numeric types that have their own __hash__ implementation that mimics int.__hash__ (see e.g. numpy) As noted in http://bugs.python.org/issue13703#msg151063 and http://bugs.python.org/issue13703#msg151064, it's not possible to directly create a dict with integer keys via JSON or XML-RPC. This seems like a tradeoff between the risk of attack via other means vs breakage induced by not having hash() == hash() for the various equivalent numerical representations in pre-existing code. * To support my expanded usage of the random secret, I moved: PyAPI_DATA(_Py_unicode_hash_secret_t) _Py_unicode_hash_secret from unicodeobject.h to object.h and renamed it to: PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret; This also exposes it for usage by C extension modules, just in case they need it (Murphy's Law suggests we will need if we don't expose it). This is an extension of the API, but warranted, I feel. My plan for downstream RHEL is to add this explicitly to the RPM metadata as a "Provides" of the RPM providing libpython.so so that if something needs to use it, it can express a "Requires" on it; I assume that something similar is possible with .deb) * generalized test_unicode.HashTest to support the new env var and the additional types. In my version, get_hash takes a _repr string rather than an object, so that I can test it with a buffer(). Arguably the tests should thus be moved from test_unicode to somewhere else, but this location keeps things consistent with haypo's patch. haypo: in random-8.patch, within test_unicode.HashTest.test_null_hash, "hash_empty" seems to be misnamed * dropped various selftest fixes where the corresponding selftests don't exist in 2.7 * adds a description of the new environment variables to the manpage; arguably this should be done for the patch for the default branch also Caveats: * only tested on Linux (Fedora 15 x86_64); not tested on Windows. Tested via "make test" both with and without PYTHONHASHRANDOMIZATION=1 * not yet benchmarked Doc/using/cmdline.rst \| 28 ++ Include/object.h \| 7 Include/pythonrun.h \| 2 Lib/lib-tk/test/test_ttk/test_functions.py \| 2 Lib/os.py \| 19 - Lib/test/mapping_tests.py \| 2 Lib/test/regrtest.py \| 5 Lib/test/test_gdb.py \| 15 + Lib/test/test_inspect.py \| 1 Lib/test/test_os.py \| 47 +++- Lib/test/test_unicode.py \| 55 +++++ Makefile.pre.in \| 1 Misc/python.man \| 22 ++ Modules/posixmodule.c \| 126 ++---------- Objects/bufferobject.c \| 8 Objects/object.c \| 2 Objects/stringobject.c \| 8 Objects/unicodeobject.c \| 17 + PCbuild/pythoncore.vcproj \| 4 Python/pythonrun.c \| 2 b/Python/random.c \| 284 +++++++++++++++++++++++++++++ 21 files changed, 510 insertions(+), 147 deletions(-)
msg151850 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-01-23 21:39
> To see the collision counting, enable the DEBUG_DICT_COLLISIONS > macro variable. Running (part of ()) the test suite with debugging enabled on a 64-bit machine shows that slot collisions are much more frequent than hash collisions, which only account for less than 0.01% of all collisions. It also shows that slot collisions in the low 1-10 range are most frequent, with very few instances of a dict lookup reaching 20 slot collisions (less than 0.0002% of all collisions). The great number of cases with 1 or 2 slot collisions surprised me. It seems that there's potential for improvement of the perturbation formula left. Due to the large number of 1 or 2 slot collisions, the patch is going to cause a minor hit to dict lookup performance. It may make sense to unroll the slot search loop and only start counting after the third round of misses. () I stopped the run after several hours run-time, producing some 148GB log data.
msg151867 - (view)	Author: Paul McMillan (PaulMcMillan) *	Date: 2012-01-24 00:14
> I think you're asking a bit much here :-) A broken app is a broken > app, no matter how nice Python tries to work around it. If an > app puts too much trust into user data, it will be vulnerable > one way or another and regardless of how the user data enters > the app. I notice your patch doesn't include fixes for the entire standard library to work around this problem. Were you planning on writing those, or leaving that for others? As a developer, I honestly don't know how I can state with certainty that input data is clean or not, until I actually see the error you propose. I can't check validity before the fact, the way I can check for invalid unicode before storing it in my database. Once I see the error (probably only after my application is attacked, certainly not during development), it's too late. My application can't know which particular data triggered the error, so it can't delete it. I'm reduced to trial and error to remove the offending data, or to writing code that never stores more than 1000 things in a dictionary. And I have to accept that the standard library may not work on any particular data I want to process, and must write code that detects the error state and somehow magically removes the offending data. The alternative, randomization, simply means that my dictionary ordering is not stable, something that is already the case. While I appreciate that the counting approach feels cleaner; randomization is the only solution that makes practical sense.
msg151869 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-24 00:42
On Mon, Jan 23, 2012 at 4:39 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: > Running (part of ()) the test suite with debugging enabled on a 64-bit > machine shows that slot collisions are much more frequent than > hash collisions, which only account for less than 0.01% of all > collisions. Even 1 in 10,000 seems pretty high, though I suppose it is a result of non-random input. (For a smalldict with 8 == 2^3 slots, on a 64-bit machine, true hash collisions "should" only account for 1 in 2^61 slot collisions.) > It also shows that slot collisions in the low 1-10 range are > most frequent, with very few instances of a dict lookup > reaching 20 slot collisions (less than 0.0002% of all > collisions). Thus the argument that collisions > N implies (possibly malicious) data that really needs a different hash -- and that this dict instance in particular should take the hit to use an alternative hash. (Do note that this alternative hash could be stored in the hash member of the PyDictEntry; if anything actually equal* to the key comes along, it will have gone through just as many collisions, and therefore also have been rehashed.) > The great number of cases with 1 or 2 slot collisions surprised > me. It seems that there's potential for improvement of > the perturbation formula left. In retrospect, this makes sense. for (perturb = hash; ; perturb >>= PERTURB_SHIFT) { i = (i << 2) + i + perturb + 1; If two objects collided then they have the same last few last few bits in their hashes -- which means they also have the same last few bits in their initial perturb. And since the first probe is to slot 6i+1, it funnels down to only even consider half the slots until the second probe. Also note that this explains why Randomization could make the Django tests fail, even though 64-bit users haven't complained. The initial hash(&mask) is the same, and the first probe is the same, and (for a small enough dict) so are the next several. In a dict with 2^12 slots, the first 6 tries will be the same ... so I doubt the test cases have sufficiently large amounts of sufficiently unlucky data to notice very often -- unless the hash itself is changed, as in the patch.
msg151870 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-01-24 00:44
On Mon, Jan 23, 2012 at 1:32 PM, Dave Malcolm <report@bugs.python.org> wrote: > > Dave Malcolm <dmalcolm@redhat.com> added the comment: > > I'm attaching an attempt at backporting haypo's random-8.patch to 2.7 > > Changes relative to random-8.patch: > > * The randomization is off by default, and must be enabled by setting > a new environment variable PYTHONHASHRANDOMIZATION to a non-empty string. > (if so then, PYTHONHASHSEED also still works, if provided, in the same > way as in haypo's patch) > > * All of the various "Py_hash_t" become "long" again (Py_hash_t was > added in 3.2: issue9778) > > * I expanded the randomization from just PyUnicodeObject to also cover > these types: > > * PyStringObject > > * PyBufferObject > > The randomization does not cover numeric types: if we change the hash of > int so that hash(i) no longer equals i, we also have to change it > consistently for long, float, complex, decimal.Decimal and > fractions.Fraction; however, there are 3rd-party numeric types that > have their own __hash__ implementation that mimics int.__hash__ (see > e.g. numpy) > > As noted in http://bugs.python.org/issue13703#msg151063 and > http://bugs.python.org/issue13703#msg151064, it's not possible > to directly create a dict with integer keys via JSON or XML-RPC. > > This seems like a tradeoff between the risk of attack via other means > vs breakage induced by not having hash() == hash() for the various > equivalent numerical representations in pre-existing code. Exactly. I would NOT worry about hash repeatability for integers and complex data structures. It is not at the core of the common problem (maybe a couple application specific problems but not a general "all python web apps" severity problem). Doing it for base byte string and unicode string like objects is sufficient. Good catch on doing it for buffer objects, I'd forgotten about those. ;) A big flaw with haypo's patch is that it only considers unicode instead of all byte-string-ish stuff. (the code in issue13704 does that better). > > * To support my expanded usage of the random secret, I moved: > > PyAPI_DATA(_Py_unicode_hash_secret_t) _Py_unicode_hash_secret > > from unicodeobject.h to object.h and renamed it to: > > PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret; > > This also exposes it for usage by C extension modules, just in case > they need it (Murphy's Law suggests we will need if we don't expose > it). This is an extension of the API, but warranted, I feel. My > plan for downstream RHEL is to add this explicitly to the RPM metadata > as a "Provides" of the RPM providing libpython.so so that if something > needs to use it, it can express a "Requires" on it; I assume that > something similar is possible with .deb) Exposing this is good. There is a hash table implementation within modules/expat/xmlparse.c that should probably use it as well. > * generalized test_unicode.HashTest to support the new env var and the > additional types. In my version, get_hash takes a _repr string rather > than an object, so that I can test it with a buffer(). Arguably the > tests should thus be moved from test_unicode to somewhere else, but this > location keeps things consistent with haypo's patch. > > haypo: in random-8.patch, within test_unicode.HashTest.test_null_hash, > "hash_empty" seems to be misnamed Lets move this to a better location in all patches. At this point haypo's patch is not done yet so relevant bits of what you are doing here is likely to be fed back into the eventual 3.3 tip patch. -gps
msg151939 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-25 11:05
I'm attaching a patch which implements a hybrid approach: hybrid-approach-dmalcolm-2012-01-25-001.patch This is a blend of various approaches from the discussion, taking aspects of both hash randomization and collision-counting. It incorporates code from amortized-probe-counting-dmalcolm-2012-01-21-003.patch backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch random-8.patch along with ideas from: http://mail.python.org/pipermail/python-dev/2012-January/115812.html The patch is against the default branch (although my primary goal here is eventual backporting). As per haypo's random-8.patch, a randomization seed is read at startup. By default, the existing hash() values are preserved, and no randomization is performed until a dict comes under attack. This preserves existing behaviors (such as dict ordering) under non-attack conditions. For large dictionaries, it reuses the ma_smalltable area to track the amortized cost of all modifications to this dictionary. When the cost exceeds a set threshold, we convert the dictionary's ma_lookup function from lookdict/lookdict_unicode to a "paranoid" variant. These variants ignore the hash passed in, and instead uses a new function: PyObject_RandomizedHash(obj) to give a second hash value, which is fixed value for a given object within the process, but not predictable to an attacker for the most high-risk types (PyUnicodeObject and PyBytesObject). This patch is intended as a base for backporting, and takes it as given that we can't expand PyTypeObject or hide something in one of the PyMethods tables; iirc we've run out of tp_flags in 2., hence we're forced to implement PyObject_RandomizedHash via direct ob_type comparison, for the most high-risk types. As noted in http://bugs.python.org/issue13703#msg151870: > I would NOT worry about hash repeatability for integers and > complex data structures. It is not at the core of the common problem > (maybe a couple application specific problems but not a general "all > python web apps" severity problem). > Doing it for base byte string and unicode string like objects is > sufficient. [We can of course implement hash randomization by default in 3.3, but I care more about getting a fix into the released branches ASAP] Upon transition of a dict to paranoid mode, the hash values become unpredictable to an attacker, and all PyDictEntries are rebuilt based on the new hash values. Handling the awkward case within custom ma_lookup functions allows us to move most of the patch from out of the fast path, and lookdict/lookdict_unicode only need minimal changes (stat gathering for the above cost analysis tracking). Once a dict has transitioned to paranoid mode, it isn't using PyObject_Hash anymore, and thus isn't using cached object values, performing a more expensive calculation, but I believe this calculation is essentially constant-time. This preserves hash() and dict order for the cases where you're not under attack, and gracefully handles the attack without having to raise an exception: it doesn't introduce any new exception types. It preserves ABI, assuming no-one else is reusing ma_smalltable. It is suitable for backporting to 3.2, 2.7, and earlier (I'm investigating fixing this going all the way back to Python 2.2) Under the old implementation, there were 4 types of PyDictObject, given these two booleans: * "small vs large" i.e ma_table == ma_smalltable vs ma_table != ma_smalltable * "all keys are str" vs arbitary keys i.e ma_lookdict == lookdict_unicode vs lookdict Under this implementation, this doubles to 8 kinds, adding the boolean: * normal hash vs randomized hash (normal vs "paranoid"). This is expressed via the ma_lookdict callback, adding two new variants, lookdict_unicode_paranoid and lookdict_paranoid Note that if a paranoid dict goes small again (ma_table == ma_smalltable), it stays paranoid. This is for simplicity: it avoids having to rebuild all of the non-randomized me_hash values again (which could fail). Naturally the patch adds selftests. I had to add some diagnostic methods to support them; dict gains _stats() and _make_paranoid() methods, and sys gains a _getrandomizedhash() method. These could be hidden more thoroughly if need be (see DICT_PROTECTION_TRACKING in dictobject.c). Amongst other things, the selftests measure wallclock time taken for various dict operations (and so might introduce failures on a heavily-loaded machine, I guess). Hopefully this approach is a viable way forward. Caveats and TODO items: TODO: I haven't yet tuned the safety threshold. According to http://bugs.python.org/issue13703#msg151850: > slot collisions are much more frequent than > hash collisions, which only account for less than 0.01% of all > collisions. > > It also shows that slot collisions in the low 1-10 range are > most frequent, with very few instances of a dict lookup > reaching 20 slot collisions (less than 0.0002% of all > collisions). This suggests that the threshold of 32 slot/hash collisions per lookup may already be high enough. TODO: in a review of an earlier version of the complexity detection idea, Antoine Pitrou suggested that make the protection scale factor be a run-time configurable value, rather than a #define. This isn't done yet. TODO: run more extensive tests (e.g. Django and Twisted), monitoring the worst-case complexity that's encountered TODO: not yet benchmarked and optimized. I want to get feedback on the approach before I go in and hand-optimize things (e.g. by hand-inlining check_iter_count, and moving the calculations out of the loop etc). I believe any performance issues ought to be fixable, in that the we can get the cost of this for the "we're not under attack" case to be negligible, and the "under attack" case should transition from O(N^2) to O(N), albeit it with a larger constant factor. TODO: this doesn't cover sets, but assuming this approach works, the patch can be extended to cover it in an analogous way. TODO: should it cover PyMemoryViewObject, buffer object, etc? TODO: should it cover the hashing in Modules/expat/xmlparse.c? FWIW I rip this code out when doing my downstream builds in RHEL and Fedora, and instead dynamically link against a system copy of expat TODO: only tested on Linux so far (which is all I've got). Fedora 15 x86_64 fwiw Doc/using/cmdline.rst \| 6 Include/bytesobject.h \| 2 Include/object.h \| 8 Include/pythonrun.h \| 2 Include/unicodeobject.h \| 2 Lib/os.py \| 17 -- Lib/test/regrtest.py \| 5 Lib/test/test_dict.py \| 298 +++++++++++++++++++++++++++++++++++++ Lib/test/test_hash.py \| 53 ++++++ Lib/test/test_os.py \| 35 +++- Makefile.pre.in \| 1 Modules/posixmodule.c \| 126 ++------------- Objects/bytesobject.c \| 7 Objects/dictobject.c \| 369 +++++++++++++++++++++++++++++++++++++++++++++- Objects/object.c \| 37 ++++ Objects/unicodeobject.c \| 51 ++++++ PCbuild/pythoncore.vcproj \| 4 Python/pythonrun.c \| 3 Python/sysmodule.c \| 16 + b/Python/random.c \| 268 +++++++++++++++++++++++++++++++++ 20 files changed, 1173 insertions(+), 137 deletions(-)
msg151941 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-25 12:45
I've found a bug in my patch; insertdict writes the old non-randomized hash value into me_hash at: ep->me_hash = hash; rather than using the randomized hash, leading to issues when tested against a real attack. I'm looking into fixing it.
msg151942 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-25 12:47
On Wed, Jan 25, 2012 at 7:45 AM, Dave Malcolm <report@bugs.python.org>wrote: > > Dave Malcolm <dmalcolm@redhat.com> added the comment: > > I've found a bug in my patch; insertdict writes the old non-randomized > hash value into me_hash at: > ep->me_hash = hash; > rather than using the randomized hash, leading to issues when tested > against a real attack. > > I'm looking into fixing it. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________ > What happens if I have a dict with str keys that goes into paranoid mode, and I then do: class A(object): def __init__(self, s): self.s = s def __eq__(self, other): return self.s == other def __hash__(self): return hash(self.s) d[A("some str that's a key in d")] Is it still able to find the value?
msg151944 - (view)	Author: Frank Sievertsen (fx5)	Date: 2012-01-25 13:12
> Is it still able to find the value? Probably not. :( That's exactly why I stopped thinking about all two-hash-functions or rehashing ideas.
msg151956 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-25 17:49
On Wed, 2012-01-25 at 12:45 +0000, Dave Malcolm wrote: > Dave Malcolm <dmalcolm@redhat.com> added the comment: > > I've found a bug in my patch; insertdict writes the old non-randomized > hash value into me_hash at: > ep->me_hash = hash; > rather than using the randomized hash, leading to issues when tested > against a real attack. I'm attaching a revised version of the patch that should fix the above issue: hybrid-approach-dmalcolm-2012-01-25-002.patch Changes relative to -001.patch: * updated insertdict() so that when it write ep->me_hash, it uses the correct hash value. Unfortunately there doesn't seem to be a good way of reusing the value we calculated in the "paranoid" ma_lookup callbacks, without breaking ABI (suggestions welcome), so we call PyObject_RandomizedHash again. * slightly reworked the two _paranoid ma_lookup callbacks to capture the randomized hash as a local variable, in case there's a way of reusing it in insertdict() * when lookdict() calls into itself, it now calls mp->ma_lookup instead * don't generate a fatal error with an unknown ma_lookup callback. With this, I'm able to insert 200,000 non-equal PyUnicodeObject with hash()==0 into a dict on a 32-bit build --with-pydebug in 2.2 seconds; it can retrieve all the values correctly in about 4 seconds [compare with ~1.5 hours of CPU churn for inserting the same data on an optimized build without the patch on the same guest]. The amortized ratio of total work done per modification increases linearly when under an O(N^2) attack, and the dict switches itself to paranoid mode 56 insertions after ma_table stops using ma_smalltable (that's when we start tracking stats). After the transition to paranoid mode, it drops to an average of a little under 2 probes per insertion (the amortized ratio seems to be converging to about 1.9 probes per key insertion at the point where my hostile test data runs out).
msg151959 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-25 18:05
> I'm attaching a revised version of the patch that should fix the above > issue: > hybrid-approach-dmalcolm-2012-01-25-002.patch It looks like that approach will break any non-builtin type (in either C or Python) which can compare equal to bytes or str objects. If that's the case, then I think the likelihood of acceptance is close to zero. Also, the level of complication is far higher than in any other of the proposed approaches so far (I mean those with patches), which isn't really a good thing. So I'm rather -1 myself on this approach, and would much prefer to randomize hashes in all conditions.
msg151960 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-25 18:14
On Wed, Jan 25, 2012 at 6:06 AM, Dave Malcolm <dmalcolm@redhat.com> added the comment: > hybrid-approach-dmalcolm-2012-01-25-001.patch > As per haypo's random-8.patch, a randomization seed is read at startup. Why not wait until it is needed? I suspect a lot of scripts will never need it for any dict, so why add the overhead to startup? > Once a dict has transitioned to paranoid mode, it isn't using > PyObject_Hash anymore, and thus isn't using cached object values The alternative hashes could be stored in an id-keyed dict performing a more expensive calculation, but I believe this calculation is essentially constant-time. > > This preserves hash() and dict order for the cases where you're not under attack, and gracefully handles the attack without having to raise an exception: it doesn't introduce any new exception types. > > It preserves ABI, assuming no-one else is reusing ma_smalltable. > > It is suitable for backporting to 3.2, 2.7, and earlier (I'm investigating fixing this going all the way back to Python 2.2) > > Under the old implementation, there were 4 types of PyDictObject, given these two booleans: > * "small vs large" i.e ma_table == ma_smalltable vs ma_table != ma_smalltable > * "all keys are str" vs arbitary keys i.e ma_lookdict == lookdict_unicode vs lookdict > > Under this implementation, this doubles to 8 kinds, adding the boolean: > * normal hash vs randomized hash (normal vs "paranoid"). > > This is expressed via the ma_lookdict callback, adding two new variants, lookdict_unicode_paranoid and lookdict_paranoid > > Note that if a paranoid dict goes small again (ma_table == ma_smalltable), it stays paranoid. This is for simplicity: it avoids having to rebuild all of the non-randomized me_hash values again (which could fail). > > Naturally the patch adds selftests. I had to add some diagnostic methods to support them; dict gains _stats() and _make_paranoid() methods, and sys gains a _getrandomizedhash() method. These could be hidden more thoroughly if need be (see DICT_PROTECTION_TRACKING in dictobject.c). Amongst other things, the selftests measure wallclock time taken for various dict operations (and so might introduce failures on a heavily-loaded machine, I guess). > > Hopefully this approach is a viable way forward. > > Caveats and TODO items: > > TODO: I haven't yet tuned the safety threshold. According to http://bugs.python.org/issue13703#msg151850: >> slot collisions are much more frequent than >> hash collisions, which only account for less than 0.01% of all >> collisions. >> >> It also shows that slot collisions in the low 1-10 range are >> most frequent, with very few instances of a dict lookup >> reaching 20 slot collisions (less than 0.0002% of all >> collisions). > > This suggests that the threshold of 32 slot/hash collisions per lookup may already be high enough. > > TODO: in a review of an earlier version of the complexity detection idea, Antoine Pitrou suggested that make the protection scale factor be a run-time configurable value, rather than a #define. This isn't done yet. > > TODO: run more extensive tests (e.g. Django and Twisted), monitoring the worst-case complexity that's encountered > > TODO: not yet benchmarked and optimized. I want to get feedback on the approach before I go in and hand-optimize things (e.g. by hand-inlining check_iter_count, and moving the calculations out of the loop etc). I believe any performance issues ought to be fixable, in that the we can get the cost of this for the "we're not under attack" case to be negligible, and the "under attack" case should transition from O(N^2) to O(N), albeit it with a larger constant factor. > > TODO: this doesn't cover sets, but assuming this approach works, the patch can be extended to cover it in an analogous way. > > TODO: should it cover PyMemoryViewObject, buffer object, etc? > > TODO: should it cover the hashing in Modules/expat/xmlparse.c? FWIW I rip this code out when doing my downstream builds in RHEL and Fedora, and instead dynamically link against a system copy of expat > > TODO: only tested on Linux so far (which is all I've got). Fedora 15 x86_64 fwiw > > Doc/using/cmdline.rst \| 6 > Include/bytesobject.h \| 2 > Include/object.h \| 8 > Include/pythonrun.h \| 2 > Include/unicodeobject.h \| 2 > Lib/os.py \| 17 -- > Lib/test/regrtest.py \| 5 > Lib/test/test_dict.py \| 298 +++++++++++++++++++++++++++++++++++++ > Lib/test/test_hash.py \| 53 ++++++ > Lib/test/test_os.py \| 35 +++- > Makefile.pre.in \| 1 > Modules/posixmodule.c \| 126 ++------------- > Objects/bytesobject.c \| 7 > Objects/dictobject.c \| 369 +++++++++++++++++++++++++++++++++++++++++++++- > Objects/object.c \| 37 ++++ > Objects/unicodeobject.c \| 51 ++++++ > PCbuild/pythoncore.vcproj \| 4 > Python/pythonrun.c \| 3 > Python/sysmodule.c \| 16 + > b/Python/random.c \| 268 +++++++++++++++++++++++++++++++++ > 20 files changed, 1173 insertions(+), 137 deletions(-) > > ---------- > Added file: http://bugs.python.org/file24320/hybrid-approach-dmalcolm-2012-01-25-001.patch > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________
msg151961 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-25 18:29
Sorry; hit the wrong key... intended message below: On Wed, Jan 25, 2012 at 6:06 AM, Dave Malcolm <dmalcolm@redhat.com> added the comment: [lots of good stuff] > hybrid-approach-dmalcolm-2012-01-25-001.patch > As per haypo's random-8.patch, a randomization seed is read at > startup. Why not wait until it is needed? I suspect a lot of scripts will never need it for any dict, so why add the overhead to startup? > Once a dict has transitioned to paranoid mode, it isn't using > PyObject_Hash anymore, and thus isn't using cached object values The alternative hashes could be stored in an id-keyed WeakKeyDictionary; that would handle at least the normal case of using exactly the same string for the lookup. > Note that if a paranoid dict goes small again > (ma_table == ma_smalltable), it stays paranoid. As I read it, that couldn't happen, because paranoid dicts couldn't shrink at all. (Not letting them shrink beneath 2*PyDict_MINSIZE does seem like a reasonable solution.) Additional TODOs... The checks for Unicode and Dict should not be exact; it is OK to do on a subclass so long as they are using the same lookdict (and, for unicode, the same eq). Additional small strings should be excluded from the new hash, to avoid giving away the secret. At a minimum, single-char strings should be excluded, and I would prefer to exclude all strings of length <= N (where N defaults to 4).
msg151965 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-25 19:04
On Wed, Jan 25, 2012 at 1:05 PM, Antoine Pitrou <pitrou@free.fr> added the comment: > It looks like that approach will break any non-builtin type (in either C > or Python) which can compare equal to bytes or str objects. If that's > the case, then I think the likelihood of acceptance is close to zero. (1) Isn't that true of any patch that changes hashing? (Thus the PYTHONHASHSEED=0 escape hatch.) (2) I think it would still work for the lookdict_string (or lookdict_unicode) case ... which is the normal case, and also where most vulnerabilities should appear. (3) If the alternate hash is needed for non-string keys, there is no perfect resolution, but I suppose you could get closer with if obj == str(obj): newhash=hash(str(obj))
msg151966 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-25 19:13
> Jim Jewett <jimjjewett@gmail.com> added the comment: > > On Wed, Jan 25, 2012 at 1:05 PM, Antoine Pitrou <pitrou@free.fr> > added the comment: > > > It looks like that approach will break any non-builtin type (in either C > > or Python) which can compare equal to bytes or str objects. If that's > > the case, then I think the likelihood of acceptance is close to zero. > > (1) Isn't that true of any patch that changes hashing? (Thus the > PYTHONHASHSEED=0 escape hatch.) If a third-party type wants to compare equal to bytes or str objects, its __hash__ method will usually end up calling hash() on the equivalent bytes/str object. That's especially true for Python types (I don't think anyone wants to re-implement a slow str-alike hash in pure Python). > (2) I think it would still work for the lookdict_string (or > lookdict_unicode) case ... which is the normal case, and also where > most vulnerabilities should appear. It would probably still work indeed. > (3) If the alternate hash is needed for non-string keys, there is no > perfect resolution, but I suppose you could get closer with > > if obj == str(obj): > newhash=hash(str(obj)) That may be slowing down things quite a bit. It looks correct though.
msg151967 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-25 19:19
On Wed, 2012-01-25 at 18:05 +0000, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > > I'm attaching a revised version of the patch that should fix the above > > issue: > > hybrid-approach-dmalcolm-2012-01-25-002.patch > > It looks like that approach will break any non-builtin type (in either C > or Python) which can compare equal to bytes or str objects. If that's > the case, then I think the likelihood of acceptance is close to zero. How? > Also, the level of complication is far higher than in any other of the > proposed approaches so far (I mean those with patches), which isn't > really a good thing. So would I. I want something I can backport, though.
msg151970 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-25 19:28
Le mercredi 25 janvier 2012 à 19:19 +0000, Dave Malcolm a écrit : > Dave Malcolm <dmalcolm@redhat.com> added the comment: > > On Wed, 2012-01-25 at 18:05 +0000, Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > > > > I'm attaching a revised version of the patch that should fix the above > > > issue: > > > hybrid-approach-dmalcolm-2012-01-25-002.patch > > > > It looks like that approach will break any non-builtin type (in either C > > or Python) which can compare equal to bytes or str objects. If that's > > the case, then I think the likelihood of acceptance is close to zero. > > How? This kind of type, for example: class C: def __hash__(self): return hash(self._real_str) def __eq__(self, other): if isinstance(other, C): other = other._real_str return self._real_str == other If I'm not mistaken, looking up C("abc") will stop matching "abc" when there are too many collisions in one of your dicts. > > Also, the level of complication is far higher than in any other of the > > proposed approaches so far (I mean those with patches), which isn't > > really a good thing. > > So would I. I want something I can backport, though. Well, your approach sounds like it subtly and unpredictably changes the behaviour of dicts when there are too many collisions, so I'm not sure it's a good idea to backport it, either. If we don't want to backport full hash randomization, I think I much prefer raising a BaseException when there are too many collisions, rather than this kind of (excessively) sophisticated workaround. You are changing a fundamental datatype in a rather complicated way.
msg151973 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-25 20:23
I think you're right: it will stop matching it during lookup within such a dict, since the dict will be using the secondary hash for "abc", but hash() for the C instance. It will still match outside of the dict, and within other dicts. So yes, this would be a subtle semantic change when under attack. Bother. Having said that, note that within the typical attack scenarios (HTTP headers, HTTP POST, XML-RPC, JSON), we have a pure-str dict (or sometimes a pure-bytes dict). Potentially I could come up with a patch that only performs this change for such a case (pure-str is easier, given that we already track this), which would avoid the semantic change you identify, whilst still providing protection against a wide range of attacks. Is it worth me working on this? > > > Also, the level of complication is far higher than in any other of the > > > proposed approaches so far (I mean those with patches), which isn't > > > really a good thing. > > > > So would I. I want something I can backport, though. > > Well, your approach sounds like it subtly and unpredictably changes the > behaviour of dicts when there are too many collisions, so I'm not sure > it's a good idea to backport it, either. > > If we don't want to backport full hash randomization, I think I much > prefer raising a BaseException when there are too many collisions, > rather than this kind of (excessively) sophisticated workaround. You > are changing a fundamental datatype in a rather complicated way. Well, each approach has pros and cons, and we've circled around between hash randomization vs collision counting vs other approaches for several weeks. I've supplied patches for 3 different approaches. Is this discussion likely to reach a conclusion soon? Would it be regarded as rude if I unilaterally ship something close to: backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch in RHEL/Fedora, so that my users have some protection they can enable if they get attacked? (see http://bugs.python.org/msg151847). If I do this, I can post the patches here in case other distributors want to apply them. As for python.org, who is empowered to make a decision here? How can we move this forward?
msg151977 - (view)	Author: Frank Sievertsen (fx5)	Date: 2012-01-25 21:34
For the sake of completeness: Collision-counting (with Exception) has interesting effects, too. >>> d={((1<<(65+i))-2**(i+4)): 9 for i in range(1001)} >>> for i in list(d): ... del d[i] >>> d {} >>> 9 in d False >>> 0 in d Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'too many slot collisions' >>> d[9] = 1 >>> d {9: 1} >>> d == {0: 1} False >>> {0: 1} == d Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'too many slot collisions'
msg151984 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-25 23:14
> I think you're right: it will stop matching it during lookup within such > a dict, since the dict will be using the secondary hash for "abc", but > hash() for the C instance. > > It will still match outside of the dict, and within other dicts. > > So yes, this would be a subtle semantic change when under attack. > Bother. Hmm, you're right, perhaps it's not as important as I thought. By the way, have you run benchmarks on some of your patches? > Is this discussion likely to reach a conclusion soon? Would it be > regarded as rude if I unilaterally ship something close to: > backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch > in RHEL/Fedora, so that my users have some protection they can enable if > they get attacked? I don't think Fedora shipping its own patches can be considered "rude" by anyone else than its users. And deciding what is best for your users is indeed your job as a distro maintainer, not python-dev's. > As for python.org, who is empowered to make a decision here? How can we > move this forward? I don't know. Guido is empowered if he wants to make a pronouncement. Otherwise, we have the following data points: - hash randomization is generally considered the cleanest solution - hash randomization cannot be enabled by default in bugfix, let alone security releases - collision counting can mitigate some of the attacks, although it can have weaknesses (see Frank's emails) and it comes with its own problems (breaking the program "later on") So I'd suggest the following course of action: - ship and enable some form of collision counting on bugfix and security releases - ship and enable hash randomization in 3.3
msg152030 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-26 21:00
I'd like to propose an entirely different approach: use AVL trees for colliding strings, for dictionaries containing only unicode or byte strings. A prototype for this is in http://hg.python.org/sandbox/loewis/branches#avl It is not fully working yet, but I'm now confident that this is a feasible approach. It has the following advantages over the alternatives: - performance in case of collisions is O(log(N)), where N is the number of colliding keys - no new exceptions are raised, except for MemoryError if it runs out of memory for allocating nodes in the tree - the hash values do not change - the dictionary order does not change as long as no keys collide on hash values (which for all practical purposes should mean that the dictionary order does not change in all places where it matters)
msg152033 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-26 21:04
On Thu, Jan 26, 2012 at 4:00 PM, Martin v. Löwis <report@bugs.python.org>wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > I'd like to propose an entirely different approach: use AVL trees for > colliding strings, for dictionaries containing only unicode or byte strings. > > A prototype for this is in > http://hg.python.org/sandbox/loewis/branches#avl > It is not fully working yet, but I'm now confident that this is a feasible > approach. > > It has the following advantages over the alternatives: > - performance in case of collisions is O(log(N)), where N is the number of > colliding keys > - no new exceptions are raised, except for MemoryError if it runs out of > memory for allocating nodes in the tree > - the hash values do not change > - the dictionary order does not change as long as no keys collide on hash > values (which for all practical purposes should mean that the dictionary > order does not change in all places where it matters) > > ---------- > nosy: +loewis > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________ > Martin, What happens if, instead of putting strings in a dictionary directly, I have them wrapped in something. For example, the classes Antoine and I pasted early. These define hash and equal as being strings, but don't have an ordering. Alex
msg152037 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-26 22:13
On Thu, 2012-01-26 at 21:04 +0000, Alex Gaynor wrote: > Alex Gaynor <alex.gaynor@gmail.com> added the comment: > > On Thu, Jan 26, 2012 at 4:00 PM, Martin v. Löwis <report@bugs.python.org>wrote: > > > > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > > > I'd like to propose an entirely different approach: use AVL trees for > > colliding strings, for dictionaries containing only unicode or byte strings. > > > > A prototype for this is in > > http://hg.python.org/sandbox/loewis/branches#avl > > It is not fully working yet, but I'm now confident that this is a feasible > > approach. > > > > It has the following advantages over the alternatives: > > - performance in case of collisions is O(log(N)), where N is the number of > > colliding keys > > - no new exceptions are raised, except for MemoryError if it runs out of > > memory for allocating nodes in the tree > > - the hash values do not change > > - the dictionary order does not change as long as no keys collide on hash > > values (which for all practical purposes should mean that the dictionary > > order does not change in all places where it matters) > > > > ---------- > > nosy: +loewis > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <http://bugs.python.org/issue13703> > > _______________________________________ > > > > Martin, > > What happens if, instead of putting strings in a dictionary directly, I > have them wrapped in something. For example, the classes Antoine and I > pasted early. These define hash and equal as being strings, but don't have > an ordering. [Obviously I'm not Martin, but his idea really interests me] Looking at: http://hg.python.org/sandbox/loewis/file/58be269aa0b1/Objects/dictobject.c#l517 as soon as any key insertion or lookup occurs where the key isn't exactly one of the correct types, the dict flattens any AVL trees back into the regular flat representation (and switches to lookdict for ma_lookup), analogous to the existing ma_lookup transition on dicts. From my reading of the code, if you have a dict purely of bytes/str, collisions on a hash value lead to the PyDictEntry's me_key being set to an AVL tree. All users of the ma_lookup callback within dictobject.c check to see if they're getting such PyDictEntry back. If they are, they call into the tree, which leads to TREE_FIND(), TREE_INSERT() and TREE_DELETE() invocations as appropriate; ultimately, the AVL macros call back to within node_cmp(): PyObject_Compare(left->key, right->key) [Martin, I'm sorry if I got this wrong] So if I'm reading the code correctly, it might be possible to generalize it from {str, bytes} to any set of types within which ordering and equality checking of instances from any type is "sane", which loosely, would seem to be: that we can reliably compare all objects from any type within the set, so that we can use the comparisons to perform a search to hone in on a pair of keys that compare as "equal", without any chance of raising exceptions, or missing a valid chance for two objects to be equal etc. I suspect that you can't plug arbitrary user-defined types into it, since there's no way to guarantee that ordering and comparison work in the ways that the AVL lookup requires. But I could be misreading Martin's code. [thinking aloud, if a pair of objects don't implement comparison at the PyObject_Compare level, is it possible to instead simply compare the addresses of the objects? I don't think so, since you have a custom equality implementation in your UDT, but maybe I've missed something?] Going higher-level, I feel that there are plenty of attacks against pure-str/bytes dicts, and having protection against them is worthwhile - even if there's no direct way to use it to protect the use-case you describe. Hope this is helpful Dave
msg152039 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-26 22:34
> as soon as any key insertion or lookup occurs where the key isn't > exactly one of the correct types, the dict flattens any AVL trees back > into the regular flat representation (and switches to lookdict for > ma_lookup), analogous to the existing ma_lookup transition on dicts. Correct. > TREE_DELETE() invocations as appropriate; ultimately, the AVL macros > call back to within node_cmp(): > PyObject_Compare(left->key, right->key) Correct. > I suspect that you can't plug arbitrary user-defined types into it, > since there's no way to guarantee that ordering and comparison work in > the ways that the AVL lookup requires. That's all true. It would be desirable to automatically determine which types also support total order in addition to hashing, alas, there is currently no protocol for it. On the contrary, Python has moved away of assuming that all objects form a total order. > [thinking aloud, if a pair of > objects don't implement comparison at the PyObject_Compare level, is it > possible to instead simply compare the addresses of the objects? 2.x has an elaborate logic to provide a total order on objects. It took the entire 1.x and 2.x series to fix issues with that order, only to recognize that it is not feasible to provide one - hence the introduction of rich comparisons and the omission of cmp in 3.x. For the dictionary, using addresses does not work: the order of objects needs to be consistent with equality (i.e. x < y and x == y must not simultaneously hold, as must x == y and y < z imply that also x < z, else the tree lookup won't find the equal keys). > Going higher-level, I feel that there are plenty of attacks against > pure-str/bytes dicts, and having protection against them is worthwhile - > even if there's no direct way to use it to protect the use-case you > describe. Indeed. This issue doesn't need to fix all possible attacks using hash collisions. Instead, it needs to cover the common case, and it needs to allow users to rewrite their code so that they can protect it against this family of attacks.
msg152040 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-26 22:42
> What happens if, instead of putting strings in a dictionary directly, I > have them wrapped in something. For example, the classes Antoine and I > pasted early. These define hash and equal as being strings, but don't have > an ordering. As Dave has analysed: the dictionary falls back to the current implementation. So wrt. your question "Is it still able to find the value?", the answer is Yes, certainly. It's fully backwackwards compatible, with the limitation in msg152030 (i.e. the dictionary order may change for dictionaries with string keys colliding in their hash() values).
msg152041 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-26 22:43
On Thu, Jan 26, 2012 at 5:42 PM, Martin v. Löwis <report@bugs.python.org>wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > > What happens if, instead of putting strings in a dictionary directly, I > > have them wrapped in something. For example, the classes Antoine and I > > pasted early. These define hash and equal as being strings, but don't > have > > an ordering. > > As Dave has analysed: the dictionary falls back to the current > implementation. > So wrt. your question "Is it still able to find the value?", the answer is > > Yes, certainly. It's fully backwackwards compatible, with the limitation > in msg152030 (i.e. the dictionary order may change for dictionaries with > string keys colliding in their hash() values). > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________ > But using non-__builtin__.str objects (such as UserString) would expose the user to an attack?
msg152043 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-26 23:03
> But using non-__builtin__.str objects (such as UserString) would expose the > user to an attack? Not necessarily: only if they use these strings as dictionary keys, and only if they do so in contexts where arbitrary user input is consumed. In these cases, users need to rewrite their code to replace the keys. Using dictionary wrappers (such as UserDict), this is possible using only local changes.
msg152046 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-01-26 23:22
I'm sorry then, but I'm a little confused. I think we pretty clearly established earlier that requiring users to make changes anywhere they stored user data would be dangerous, because these locations are often in libraries or other places where the code creating and modifying the dictionary has no idea it's user data in it. The proposed AVL solution fails if it requires users to fundamentally restructure their data depending on it's origin. We have solution that is known to work in all cases: hash randomization. There were three discussed issues with it: a) Code assuming a stable ordering to dictionaries b) Code assuming hashes were stable across runs. c) Code reimplementing the hashing algorithm of a core datatype that is now randomized. I don't think any of these are realistic issues the way "doesn't protect all cases" is. (a) was never a documented, or intended property, indeed it breaks all the time, if you insert keys in the wrong order, use a different platform, or anything else can change. (b) For the same reasons code relying on (b) only worked if you didn't change anything, and in practice I'm convinced neither of these were common (if ever existed). Finally (c), while it's a concern, I've reviewed Django, SQLAlchemy, PyPy, and the stdlib: there is only one place where compatibility with a core-hash is attempted, decimal.Decimal. In summary, I think the case against hash-randomization has been seriously overstated, and in no way is more dangerous than having a solution that fails to solve the problem comprehensively. Further, I think it is imperative that we reach a consensus on this quickly, as the only reason this hasn't been widely exploited yet is the lack of availability of the data, when it becomes available I firmly expect just about every high profile Python site on the internet (of which there are many) to be attacked. On Thu, Jan 26, 2012 at 6:03 PM, Martin v. Löwis <report@bugs.python.org>wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > > But using non-__builtin__.str objects (such as UserString) would expose > the > > user to an attack? > > Not necessarily: only if they use these strings as dictionary keys, and > only > if they do so in contexts where arbitrary user input is consumed. In these > cases, users need to rewrite their code to replace the keys. Using > dictionary > wrappers (such as UserDict), this is possible using only local changes. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________ >
msg152051 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-26 23:43
> I'm sorry then, but I'm a little confused. I think we pretty clearly > established earlier that requiring users to make changes anywhere they > stored user data would be dangerous, because these locations are often in > libraries or other places where the code creating and modifying the > dictionary has no idea it's user data in it. I don't consider that established for the specific case of string-like objects. Users can easily determine whether they use string-like objects, and if so, in what places, and what data gets put into them. > The proposed AVL solution fails if it requires users to fundamentally > restructure their data depending on it's origin. It doesn't fail at all. User don't have to restructure their code, let alone fundamentally. Their code may currently be vulnerable, yet not use string-like objects at all. With the proposed solution, such code will be fixed for good. It's true that the solution does not fix all cases of the vulnerability, but neither does any other proposed solution. > We have solution that is known to work in all cases: hash randomization. Well, you believe that it fixes the problem, even though it actually may not, assuming an attacker can somehow reproduce the hash function. > There were three discussed issues with it: > > a) Code assuming a stable ordering to dictionaries > b) Code assuming hashes were stable across runs. > c) Code reimplementing the hashing algorithm of a core datatype that is now > randomized. > > I don't think any of these are realistic issues I'm fairly certain that code will break in massive ways, despite any argumentation that it should not. The question really is Do we break code in a massive way, or do we fix the vulnerability for most users with no code breakage? I clearly value compatibility much higher than 100% protection against a DoS-style attack (which has many other forms of protecting against available also). > (a) was never a documented, or intended property, indeed it > breaks all the time, if you insert keys in the wrong order, use a different > platform, or anything else can change. Still, a lot of code relies on dictionary order, and successfully so, in practice. Practicality beats purity. > (b) For the same reasons code > relying on (b) only worked if you didn't change anything That's not true. You cannot practically change the way string hashing works other than by changing the interpreter source. Hashes are currently stable across runs. > and in practice I'm convinced neither of these were common (if ever existed). Are you willing to bet the trust people have in Python's bug fix policies on that? I'm not. > In summary, I think the case against hash-randomization has been seriously > overstated, and in no way is more dangerous than having a solution that > fails to solve the problem comprehensively. Further, I think it is > imperative that we reach a consensus on this quickly Well, I cannot be part of a consensus that involves massive code breakage in a bug fix release. Lacking consensus, either the release managers or the BDFL will have to pronounce.
msg152057 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-27 01:19
> > There were three discussed issues with it: > > > > a) Code assuming a stable ordering to dictionaries > > b) Code assuming hashes were stable across runs. > > c) Code reimplementing the hashing algorithm of a core datatype that is now > > randomized. > > > > I don't think any of these are realistic issues > > I'm fairly certain that code will break in massive ways, despite any > argumentation that it should not. The question really is > > Do we break code in a massive way, or do we fix the vulnerability > for most users with no code breakage? > > I clearly value compatibility much higher than 100% protection against > a DoS-style attack (which has many other forms of protecting against > available also). If I your read patch correctly, collisions will produce additional allocations of one distinct PyObject (i.e. AVL node) per colliding key. That's a pretty massive change in memory consumption for string dicts (and also in memory fragmentation and cache friendliness, probably). The performance effect in most situations is likely to be negative too, despite the better worst-case complexity. IMO that would be a rather controversial change for a feature release, let alone a bugfix or security release. It would be nice to have the release managers' opinions on this issue.
msg152060 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-27 02:26
> If I your read patch correctly, collisions will produce additional > allocations of one distinct PyObject (i.e. AVL node) per colliding key. That's correct. > That's a pretty massive change in memory consumption for string dicts > (and also in memory fragmentation and cache friendliness, probably). That's not correct. It's not a massive change, as colliding hash values never happen in practice, unless you are being attacked, in which case it will be one additional PyObject for the set of all colliding keys (i.e. one object per possible hundreds of string objects). Even including the nodes of the tree (one per colliding node) is IMO a moderate increase in memory usage, in order to solve the vulnerability. It also doesn't impact memory fragmentation badly, as these objects are allocated using the Python small objects allocator. > The > performance effect in most situations is likely to be negative too, > despite the better worst-case complexity. Compared to the status quo? Hardly. In all practical applications, collision never happens, so none of the extra code is ever exexcuted - except for AVL_Check invocations, which are a plain pointer comparison. > IMO that would be a rather controversial change for a feature release, > let alone a bugfix or security release. Apparently so, but it's not clear to me why that is. That change meets all criteria of a security fix release nicely, as opposed to the proposed changes to the hash function, which break existing code.
msg152066 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-01-27 06:25
>> But using non-__builtin__.str objects (such as UserString) would expose the >> user to an attack? > > Not necessarily: only if they use these strings as dictionary keys, and only > if they do so in contexts where arbitrary user input is consumed. In these > cases, users need to rewrite their code to replace the keys. Using dictionary > wrappers (such as UserDict), this is possible using only local changes. Could the AVL tree approach be extended to apply to dictionaries containing keys of any single type that supports comparison? That approach would autodetect UserString or similar and support it properly. I expect that dictionaries with keys of more than one type to be very rare and highly unlikely when it comes to values generated directly via user input. (and on top of all of this I believe we're all settled on having per interpreter hash randomization _as well_ in 3.3; but this AVL tree approach is one nice option for a backport to fix the major vulnerability) -gps
msg152070 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-27 08:42
> Could the AVL tree approach be extended to apply to dictionaries > containing keys of any single type that supports comparison? That > approach would autodetect UserString or similar and support it > properly. I think we would need a place to store the single key type, which, from an ABI point of view, might be difficult to find (but we could overload ma_smalltable for that, or reserve ma_table[0]). In addition, I think it is difficult to determine whether a type supports comparison, at least in 2.x. For example, class X: def __eq__(self, o): return self.a == o.a allows to create objects x and y so that x<y<z, yet x==z. For 3.x, we could assume that a failure to support comparison raises an exception, in which case we could just wait for the exception to happen, and then flatten the dictionary and start over with the lookup. This would extend even to mixed key types.
msg152104 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-27 17:45
On Thu, Jan 26, 2012 at 8:19 PM, Antoine Pitrou <report@bugs.python.org> wrote: > If I read your [Martin v. Löwis' ] patch correctly, collisions will > produce additional allocations ... That's a pretty massive > change in memory consumption for string dicts Not in practice. The point I first missed is that this triggers only when the hash is fully equal; if the hashes are merely equal after masking, then today's try-another-slot approach will still be used, even for strings. Per ( http://bugs.python.org/issue13703#msg151850 ) Marc-Andre Lemburg's measurements, full-hash equality explains only 1 in 10,000 collisions. From a performance standpoint, we can almost ignore a case that rare; it is almost certainly dwarfed by resizing. I am a bit concerned that the possible contents of a dictentry change; this could cause easily-missed-in-testing breakage for anything that treats table as an array. That said, it doesn't seem much worse than the search finger, and there seemed to be recent consensus that even promising an exact dict -- subclasses not allowed -- didn't mean that direct access was sanctioned. So it still seems safer than changing the de-facto iteration order.
msg152112 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-27 19:32
> I am a bit concerned that the possible contents of a dictentry > change; this could cause easily-missed-in-testing breakage for > anything that treats table as an array. This is indeed a concern: the new code needs to be exercised. I came up with a Py_REDUCE_HASH #define; if set, the dict will only use the lowest 3 bits of the hash, producing plenty collisions. In that mode, the branch currently doesn't work at all due to the remaining bugs.
msg152117 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-27 20:25
[Martin's approach] > The point I first missed is that this triggers only when the hash is > fully equal; if the hashes are merely equal after masking, then > today's try-another-slot approach will still be used, even for > strings. But then isn't it vulnerable to Frank's first attack as exposed in http://mail.python.org/pipermail/python-dev/2012-January/115726.html ?
msg152118 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-27 21:02
> But then isn't it vulnerable to Frank's first attack as exposed in > http://mail.python.org/pipermail/python-dev/2012-January/115726.html ? It would be, yes. That's sad. That could be fixed by indeed creating trees in all cases (i.e. moving away from open addressing altogether). The memory consumption does not worry me here; however, dictionary order will change in more cases. Compatibility could be restored by introducing a threshold for tree creation: if insertion visits more than N slots, go back to the original slot and put a tree there. I'd expect that N could be small, e.g. N==4. Lookup would then have to consider all AVL trees along the chain of visited slots, but ISTM it could also stop after visiting N slots.
msg152125 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-27 21:42
On Fri, 2012-01-27 at 21:02 +0000, Martin v. Löwis wrote: > Martin v. Löwis <martin@v.loewis.de> added the comment: > > > But then isn't it vulnerable to Frank's first attack as exposed in > > http://mail.python.org/pipermail/python-dev/2012-January/115726.html ? > > It would be, yes. That's sad. > > That could be fixed by indeed creating trees in all cases (i.e. moving > away from open addressing altogether). The memory consumption does not worry > me here; however, dictionary order will change in more cases. > > Compatibility could be restored by introducing a threshold for > tree creation: if insertion visits more than N slots, go back to the > original slot and put a tree there. I'd expect that N could be small, > e.g. N==4. Lookup would then have to consider all AVL trees along the > chain of visited slots, but ISTM it could also stop after visiting N > slots. Perhaps we could combine my attack-detection code from http://bugs.python.org/issue13703#msg151714 with Martin's AVL approach? Use the ma_smalltable to track stats, and when a dict detects that it's under attack, if all the keys are AVL-compatible, it could transition to full-AVL mode. [I believe that my patch successfully handles both of Frank's attacks, but I don't have the test data - I'd be very grateful to receive a copy (securely)]. [See hybrid-approach-dmalcolm-2012-01-25-002.patch for the latest version of attack-detection; I'm working on a rewrite in which I restrict it to working just on pure-str dicts. With that idea, when a dict detects that it's under attack, if all the keys satisfy this condition (new_hash(keyA) == new_hash(keyB)) iff (hash(keyA) == hash(keyB)) then all hash values get recalculated using new_hash (which is randomized), which should offer protection in many common attack scenarios, without the semantic change Alex and Antoine indicated]
msg152146 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-01-28 03:03
For the record, Barry and I agreed on what we'll be doing for stable releases [1]. David says he should have a patch soon. [1] http://mail.python.org/pipermail/python-dev/2012-January/115892.html
msg152149 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-28 05:13
On Sat, 2012-01-28 at 03:03 +0000, Benjamin Peterson wrote: > Benjamin Peterson <benjamin@python.org> added the comment: > > For the record, Barry and I agreed on what we'll be doing for stable releases [1]. David says he should have a patch soon. > > [1] http://mail.python.org/pipermail/python-dev/2012-January/115892.html I'm attaching what I've got so far (need sleep). Attached patch is for 3.1 and adds opt-in hash randomization. It's based on haypo's work: random-8.patch (thanks haypo!), with additional changes as seen in my backport of that to 2.7: http://bugs.python.org/issue13703#msg151847 * The randomization is off by default, and must be enabled by setting a new environment variable PYTHONHASHRANDOMIZATION to a non-empty string. (if so then, PYTHONHASHSEED also still works, if provided, in the same way as in haypo's patch) * All of the various "Py_hash_t" become "long" again (Py_hash_t was added in 3.2: issue9778) * I expanded the randomization from just PyUnicodeObject to also cover PyBytesObject, and the types within datetime. * It doesn't cover numeric types; see my explanation in msg151847; also see http://bugs.python.org/issue13703#msg151870 * It doesn't yet cover the embedded copy of expat. * I moved the hash tests from test_unicode.py to test_hash.py * I tweaked the wording of the descriptions of the envvars in cmdline.rst and the manpage * I've tested it on a 32-bit box, and it successfully protects against one set of test data (four cases: assembling then reading back items by key for a dict vs set, bytes vs str, with 200000 distinct items of data which all have hash() == 0 in unmodified build; each takes about 1.5 seconds on this --with-pydebug build, vs of the order of hours). * I haven't yet benchmarked it * Only tested on Linux (Fedora x86_64 and i686). I don't know the impact on windows (e.g. startup time without the envvar vs with the env vars). I'm seeing one failing test: ====================================================================== FAIL: test_clear_dict_in_ref_cycle (__main__.ModuleTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/david/coding/python-hg/cpython-3.1-hash-randomization/Lib/test/test_module.py", line 79, in test_clear_dict_in_ref_cycle self.assertEqual(destroyed, [1]) AssertionError: Lists differ: [] != [1]
msg152183 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-28 19:26
This turns out to pass without PYTHONHASHRANDOMIZATION in the environment, and fail intermittently with it. Note that "make test" invokes the built python with "-E", so that it ignores the setting of PYTHONHASHRANDOMIZATION in the environment. Barry, Benjamin: does fixing this bug require getting the full test suite to pass with randomization enabled (and fixing the intermittent failures due to ordering issues), or is it acceptable to "merely" have full passes without randomizing the hashes? What do the buildbots do?
msg152186 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-01-28 20:05
I think we don't need to mess with tests in 2.6/3.1, but everything should pass under 2.7 and 3.2.
msg152199 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-28 23:14
On Sat, 2012-01-28 at 20:05 +0000, Benjamin Peterson wrote: > Benjamin Peterson <benjamin@python.org> added the comment: > > I think we don't need to mess with tests in 2.6/3.1, but everything should pass under 2.7 and 3.2. New version of the patch for 3.1 optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch This version adds a command-line flag to enable hash-randomization: -R (given that the -E flag disables env vars and thus disabled PYTHONHASHRANDOMIZATION). See [1] below [Is there a convenient way to check the length of the usage messages in Modules/main.c? I see this comment: /* Long usage message, split into parts < 512 bytes / ] I reworded the documentation somewhat based on input from Barry and Antoine. Also adds a NEWS item. Passes "make test" on this x86_64 Fedora 15 box, --with-pydebug, though that's without randomization enabled (it just does it within individual test cases that explicitly enable it). No performance testing done yet (though hopefully similar to that of Victor's patch; see msg151078) No idea of the impact on Windows users (I don't have a windows dev box). It still has the stuff from Victor's patch described in msg151158. How is this looking? Dave [1] IRC transcript concerning "-R" follows: <__ap__> dmalcolm: IMO it would be simpler if there was only one env var (preferably not too clumsy to type) <__ap__> also, despite being neither barry nor gutworth, I think the test suite should* pass with randomized hashes <__ap__> :) <dmalcolm> :) <__ap__> also the failure you're having is a bit worrying, since apparently it's not about dict ordering <dmalcolm> PYTHONHASHSEED exists mostly for selftesting (also for compat, if you absolutely need to reproduce a specific random dict ordering) <__ap__> ok <__ap__> if -E suppresses hash randomization, I think we should also add a command-line flag <__ap__> -R seems untaken <__ap__> also it'll make things easier for Windows users, I think
msg152200 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-01-28 23:24
> Passes "make test" on this x86_64 Fedora 15 box, --with-pydebug, though > that's without randomization enabled (it just does it within individual > test cases that explicitly enable it). I think you should check with randomization enabled, if only to see the nature of the failures and if they are expected.
msg152203 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-01-28 23:56
> I think you should check with randomization enabled, if only to see the > nature of the failures and if they are expected. Including the list of when-enabled expected failures in the release notes would help those who compile and test.
msg152204 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-29 00:06
On Sat, 2012-01-28 at 23:56 +0000, Terry J. Reedy wrote: > Terry J. Reedy <tjreedy@udel.edu> added the comment: > > > I think you should check with randomization enabled, if only to see the > > nature of the failures and if they are expected. > > Including the list of when-enabled expected failures in the release > notes would help those who compile and test. OK, though note that because it's random, I'll have to run it a few times, and we'll see what shakes out. Am running with: $ make test TESTPYTHONOPTS=-R leading to: ./python -E -bb -R ./Lib/test/regrtest.py -l BTW, I see: Testing with flags: sys.flags(debug=0, division_warning=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=1, verbose=0, bytes_warning=2) which doesn't list the new flag. Should I add it to sys.flags? (or does anyone ever do tuple-unpacking of that PyStructSequence and thus rely on the number of elements?)
msg152270 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-01-29 22:36
On Jan 28, 2012, at 07:26 PM, Dave Malcolm wrote: >This turns out to pass without PYTHONHASHRANDOMIZATION in the >environment, and fail intermittently with it. > >Note that "make test" invokes the built python with "-E", so that it >ignores the setting of PYTHONHASHRANDOMIZATION in the environment. > >Barry, Benjamin: does fixing this bug require getting the full test >suite to pass with randomization enabled (and fixing the intermittent >failures due to ordering issues), or is it acceptable to "merely" have >full passes without randomizing the hashes? I think we at least need to identify (to the best of our ability) the tests that fail and include them in release notes. If they're easy to fix, we should fix them. Maybe also open a bug report for each failure. I'm okay though with some tests failing in 2.6 with this environment variable set. We needn't go back and fix them in 2.6 (since we're in security-fix only mode), but I'll bet you'll get almost the same set for 2.7 and there we should fix them, even if it happens after the release. >What do the buildbots do? I'm not sure, but as long as the buildbots are green, I'm happy. :)
msg152271 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-29 22:39
Given PYTHONHASHSEED, what is the point of PYTHONHASHRANDOMIZATION? Alternative: On startup, python reads a config file with the seed (which defaults to zero). Add a function to write a random value to that config file for the next startup.
msg152275 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2012-01-29 22:50
Barry A. Warsaw wrote: > Barry A. Warsaw <barry@python.org> added the comment: > > On Jan 28, 2012, at 07:26 PM, Dave Malcolm wrote: > >> This turns out to pass without PYTHONHASHRANDOMIZATION in the >> environment, and fail intermittently with it. >> >> Note that "make test" invokes the built python with "-E", so that it >> ignores the setting of PYTHONHASHRANDOMIZATION in the environment. >> >> Barry, Benjamin: does fixing this bug require getting the full test >> suite to pass with randomization enabled (and fixing the intermittent >> failures due to ordering issues), or is it acceptable to "merely" have >> full passes without randomizing the hashes? > > I think we at least need to identify (to the best of our ability) the tests > that fail and include them in release notes. If they're easy to fix, we > should fix them. Maybe also open a bug report for each failure. http://bugs.python.org/issue13903 causes even more tests to fail, so I'm submitting bug reports for most of the failing tests already.
msg152276 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-29 22:51
> Given PYTHONHASHSEED, what is the point of PYTHONHASHRANDOMIZATION? How would you do what it does without it? I.e. how would you indicate that it should randomize the seed, rather than fixing the seed value? > On startup, python reads a config file with the seed (which defaults to zero). -1 on configuration files that Python reads at startup (let alone in a bugfix release).
msg152299 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-30 01:39
On Sun, 2012-01-29 at 00:06 +0000, Dave Malcolm wrote: I went ahead and added the flag to sys.flags, so now $ make test TESTPYTHONOPTS=-R shows: Testing with flags: sys.flags(debug=0, division_warning=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=1, verbose=0, bytes_warning=2, hash_randomization=1) ...note the: hash_randomization=1 at the end of sys.flags. (This seems useful for making it absolutely clear if you're getting randomization or not). Hopefully I'm not creating too much work for the other Python implementations. Am attaching new version of patch for 3.1: optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch
msg152300 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-30 01:44
On Sat, 2012-01-28 at 23:56 +0000, Terry J. Reedy wrote: > Terry J. Reedy <tjreedy@udel.edu> added the comment: > > > I think you should check with randomization enabled, if only to see the > > nature of the failures and if they are expected. > > Including the list of when-enabled expected failures in the release > notes would help those who compile and test. Am attaching a patch which fixes various problems that are clearly just assumptions about dict ordering: fix-unittests-broken-by-randomization-dmalcolm-2012-01-29-001.patch json/__init__.py \| 4 +++- test/mapping_tests.py \| 2 +- test/test_descr.py \| 12 +++++++++++- test/test_urllib.py \| 4 +++- tkinter/test/test_ttk/test_functions.py \| 2 +- 5 files changed, 19 insertions(+), 5 deletions(-) Here are the issues that it fixes: Lib/test/test_descr.py: fix for intermittent failure due to dict repr: File "Lib/test/test_descr.py", line 4304, in test_repr self.assertEqual(repr(self.C.__dict__), 'dict_proxy({!r})'.format(dict_)) AssertionError: "dict_proxy({'__module__': 'test.test_descr', '__dict__': <attribute '__dict__' of 'C' objects>, '__doc__': None, '__weakref__': <attribute '__weakref__' of 'C' objects>, 'meth': <function meth at 0x5834be0>})" != "dict_proxy({'__module__': 'test.test_descr', '__doc__': None, '__weakref__': <attribute '__weakref__' of 'C' objects>, 'meth': <function meth at 0x5834be0>, '__dict__': <attribute '__dict__' of 'C' objects>})" Lib/json/__init__.py: fix (based on haypo's work) for intermittent failure: Failed example: json.dumps([1,2,3,{'4': 5, '6': 7}], separators=(',', ':')) Expected: '[1,2,3,{"4":5,"6":7}]' Got: '[1,2,3,{"6":7,"4":5}]' Lib/test/mapping_tests.py: fix (based on haypo's work) for intermittent failures of test_collections, test_dict, and test_userdict seen here: ====================================================================== ERROR: test_update (__main__.GeneralMappingTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "Lib/test/mapping_tests.py", line 207, in test_update i1 = sorted(d.items()) TypeError: unorderable types: str() < int() Lib/test/test_urllib.py: fix (based on haypo's work) for intermittent failure: ====================================================================== FAIL: test_nonstring_seq_values (__main__.urlencode_Tests) ---------------------------------------------------------------------- Traceback (most recent call last): File "Lib/test/test_urllib.py", line 844, in test_nonstring_seq_values urllib.parse.urlencode({"a": {"a": 1, "b": 1}}, True)) AssertionError: 'a=a&a=b' != 'a=b&a=a' ---------------------------------------------------------------------- Lib/tkinter/test/test_ttk/test_functions.py: fix from haypo's patch for intermittent failure: Traceback (most recent call last): File "Lib/tkinter/test/test_ttk/test_functions.py", line 146, in test_format_elemcreate ('a', 'b'), a='x', b='y'), ("test a b", ("-a", "x", "-b", "y"))) AssertionError: Tuples differ: ('test a b', ('-b', 'y', '-a',... != ('test a b', ('-a', 'x', '-b',... I see two remaining issues (which this patch doesn't address): test test_module failed -- Traceback (most recent call last): File "Lib/test/test_module.py", line 79, in test_clear_dict_in_ref_cycle self.assertEqual(destroyed, [1]) AssertionError: Lists differ: [] != [1] test_multiprocessing Exception AssertionError: AssertionError() in <Finalize object, dead> ignored
msg152309 - (view)	Author: Zbyszek Jędrzejewski-Szmek (zbysz) *	Date: 2012-01-30 07:15
What about PYTHONHASHSEED= -> off, PYTHONHASHSEED=0 -> random, PYTHONHASHSEED=n -> n ? I agree with Jim that it's better to have one env. variable than two.
msg152311 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-01-30 07:45
> What about PYTHONHASHSEED= -> off, PYTHONHASHSEED=0 -> random, > PYTHONHASHSEED=n -> n ? I agree with Jim that it's better to have one > env. variable than two. Rather than the "" empty string for off I suggest an explicit string that makes it clear what the meaning is. PYTHONHASHSEED="disabled" perhaps. Agreed, if we can have a single env var that is preferred. It is more obvious that the PYTHONHASHSEED env var. has no effect when it is set to a special value rather than when it is set to something but it is configured to be ignored by a _different_ env var.
msg152315 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-01-30 08:16
> Rather than the "" empty string for off I suggest an explicit string > that makes it clear what the meaning is. PYTHONHASHSEED="disabled" > perhaps. > > Agreed, if we can have a single env var that is preferred. It is more > obvious that the PYTHONHASHSEED env var. has no effect when it is set > to a special value rather than when it is set to something but it is > configured to be ignored by a _different_ env var. I think this is bike-shedding. The requirements for environment variables are a) with no variable set, it must not do randomization b) there must be a way to seed from the platform's RNG Having an explicit seed actually is no requirement, so I'd propose to drop PYTHONHASHSEED instead. However, I really suggest to let the patch author (Dave Malcolm) design the API within the constraints.
msg152335 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-30 17:31
It's useful for the selftests, so I've kept PYTHONHASHSEED. However, I've removed it from the man page; the only other place it's mentioned (in Doc/using/cmdline.rst) I now explicitly say that it exists just to serve the interpreter's own selftests. Am attaching a revised patch, which has the above change, plus some tweaks to Lib/test/test_hash.py (adds test coverage for the datetime hash randomization): optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch Has anyone had a chance to try this patch on Windows? Martin? I'm hoping that it doesn't impose a startup cost in the default no-randomization cost, and that any startup cost in the -R case is acceptable.
msg152344 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-30 19:55
On Mon, Jan 30, 2012 at 12:31 PM, Dave Malcolm <dmalcolm@redhat.com> added the comment: > It's useful for the selftests, so I've kept PYTHONHASHSEED. The reason to read PYTHONHASHSEED was so that multiple members of a cluster could use the same hash. It would have been nice to have fewer environment variables, but I'll grant that it is hard to say "use something random that we have not precomputed" without either a config file or a magic value for PYTHONHASHSEED. -jJ
msg152352 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-30 22:22
I slightly messed up the test_hash.py changes. Revised patch attached: optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch
msg152362 - (view)	Author: Martin (gz) *	Date: 2012-01-30 23:41
> Has anyone had a chance to try this patch on Windows? Martin? I'm > hoping that it doesn't impose a startup cost in the default > no-randomization cost, and that any startup cost in the -R case is > acceptable. Just tested as requested. Is the patch against 3.1 for a reason? Can't really be compared to earlier results, but get enough weird outliers that that may not be useful anyway. Also needed the following change: -+ chunk = Py_MIN(size, INT_MAX); ++ chunk = size > INT_MAX ? INT_MAX : size; Summary, looks like extra work in the default case is avoided and isn't crippling otherwise, though there were a couple of very odd runs not presented probably due to other disk access. Vanilla: >timeit PCbuild\python.exe -c "import sys;print(sys.version)" 3.1.4+ (default, Jan 30 2012, 22:38:52) [MSC v.1500 32 bit (Intel)] Version Number: Windows NT 5.1 (Build 2600) Exit Time: 10:42 pm, Monday, January 30 2012 Elapsed Time: 0:00:00.218 Process Time: 0:00:00.187 System Calls: 3974 Context Switches: 574 Page Faults: 1696 Bytes Read: 480331 Bytes Written: 0 Bytes Other: 190860 Patched: >timeit PCbuild\python.exe -c "import sys;print(sys.version)" 3.1.4+ (default, Jan 30 2012, 22:55:06) [MSC v.1500 32 bit (Intel)] Version Number: Windows NT 5.1 (Build 2600) Exit Time: 10:55 pm, Monday, January 30 2012 Elapsed Time: 0:00:00.218 Process Time: 0:00:00.187 System Calls: 3560 Context Switches: 441 Page Faults: 1660 Bytes Read: 461956 Bytes Written: 0 Bytes Other: 24926 >timeit PCbuild\python.exe -Rc "import sys;print(sys.version)" 3.1.4+ (default, Jan 30 2012, 22:55:06) [MSC v.1500 32 bit (Intel)] Version Number: Windows NT 5.1 (Build 2600) Exit Time: 11:05 pm, Monday, January 30 2012 Elapsed Time: 0:00:00.249 Process Time: 0:00:00.234 System Calls: 3959 Context Switches: 483 Page Faults: 1847 Bytes Read: 892464 Bytes Written: 0 Bytes Other: 27090
msg152364 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-01-31 01:34
Am attaching a backport of optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch to 2.6 Randomization covers the str, unicode and buffer types; equality of hashes is preserved for these types.
msg152422 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-01 03:29
On Tue, 2012-01-31 at 01:34 +0000, Dave Malcolm wrote: > Dave Malcolm <dmalcolm@redhat.com> added the comment: > > Am attaching a backport of optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch to 2.6 > > Randomization covers the str, unicode and buffer types; equality of hashes is preserved for these types. I tried benchmarking the 2.6 version of the patch. I reran "perf.py" 16 times, setting PYTHONHASHRANDOMIZATION=1, and --inherit_env=PYTHONHASHRANDOMIZATION so that the patched python uses the randomization, using a different hash for each run. Some tests are slightly faster with the patch on some runs; some are slightly slower, and it appears to vary from run to run. However, the amount involved is a few percent. [compare e.g. with msg151078] Here's the command I used. (for i in $(seq 16) ; do echo RUN $i ; (PYTHONHASHRANDOMIZATION=1 python ./perf.py --inherit_env=PYTHONHASHRANDOMIZATION ../cpython-2.6-clean/python ../cpython-2.6-hash-randomization/python) ; done) \| tee results-16.txt Am attaching the results.
msg152452 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-02 01:18
On Mon, 2012-01-30 at 23:41 +0000, Martin wrote: > Martin <gzlist@googlemail.com> added the comment: > > > Has anyone had a chance to try this patch on Windows? Martin? I'm > > hoping that it doesn't impose a startup cost in the default > > no-randomization cost, and that any startup cost in the -R case is > > acceptable. > > Just tested as requested. Is the patch against 3.1 for a reason? Can't > really be compared to earlier results, but get enough weird outliers > that that may not be useful anyway. Also needed the following change: > > -+ chunk = Py_MIN(size, INT_MAX); > ++ chunk = size > INT_MAX ? INT_MAX : size; > > Summary, looks like extra work in the default case is avoided and > isn't crippling otherwise, though there were a couple of very odd runs > not presented probably due to other disk access. Thanks for testing this! Oops, yes: Py_MIN is only present in "default" [it was added to Include/Python.h (as PY_MIN) in 72475:8beaa9a37387 for PEP 393, renamed to Py_MIN in 72489:dacac31460c0, eventually reaching Include/pymacro.h in 72512:36fc514de7f0] "orig_size" in win32_urandom was apparently unused, so I've removed it. I also found and fixed an occasional failure in my 2.6 backport of the new test_os.URandomTests.get_urandom_subprocess. Am attaching 4 patches containing the above changes, plus patches to fix dict/set ordering assumptions that start breaking if you try to run the test suite with randomization enabled: add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch 2.6 also could use the dict-ordering fix for test_symtable that was fixed in 2.7 as 74256:789d59773801 FWIW I'm seeing a failure this failure in test_urllib2, but I also see it with a clean checkout of 2.6: ====================================================================== ERROR: test_invalid_redirect (__main__.HandlerTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "Lib/test/test_urllib2.py", line 963, in test_invalid_redirect MockHeaders({"location": valid_url})) File "/home/david/coding/python-hg/cpython-2.6-hash-randomization/Lib/urllib2.py", line 616, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "/home/david/coding/python-hg/cpython-2.6-hash-randomization/Lib/urllib2.py", line 218, in __getattr__ raise AttributeError, attr AttributeError: timeout
msg152453 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-02 01:30
It looks like it was not yet decided if the CryptoGen API or a weak LCG should be used on Windows. Extract of add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch: +#ifdef MS_WINDOWS +#if 1 + (void)win32_urandom((unsigned char )secret, secret_size, 0); +#else + / fast but weak RNG (fast initialization, weak seed) */ Does someone know how to link Python to advapi32.dll (on Windows) to avoid GetModuleHandle("advapi32.dll")?
msg152723 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-02-06 06:11
IIUC, Win9x and NT4 are not supported anymore in any of the target releases of the patch, so calling CryptGenRandom should be fine. In a security fix release, we shouldn't change the linkage procedures, so I recommend that the LoadLibrary dance remains.
msg152730 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-06 09:53
> In a security fix release, we shouldn't change the linkage procedures, > so I recommend that the LoadLibrary dance remains. So the overhead in startup time is not an issue?
msg152731 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 10:20
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> In a security fix release, we shouldn't change the linkage procedures, >> so I recommend that the LoadLibrary dance remains. > > So the overhead in startup time is not an issue? It is an issue. Not only in terms of startup time, but also because randomization per default makes Python behave in non-deterministc ways - which is not what you want from a programming language or interpreter (unless you explicitly tell it to behave like that). I think it would be much better to just let the user define a hash seed using environment variables for Python to use and then forget about how this variable value is determined. If it's not set, Python uses 0 as seed, thereby disabling the seeding logic. This approach would have Python behave in a deterministic way per default and still allow users who wish to use a different seed, set this to a different value - even on a case by case basis. If you absolutely want to add a feature to have the seed set randomly, you could make a seed value of -1 trigger the use of a random number source as seed. I also still firmly believe that the collision counting scheme should be made available via an environment variable as well. The user could then set the variable to e.g. 1000 to have it enabled with limit 1000, or leave it undefined to disable the collision counting. With those two tools, users could then choose the method they find most attractive for their purposes. By default, they would be disabled, but applications which are exposed to untrusted user data and use dictionaries for managing such data could check whether the protections are enabled and trigger a startup error if needed.
msg152732 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-02-06 12:22
> It is an issue. Not only in terms of startup time, but also > because randomization per default makes Python behave in > non-deterministc ways - which is not what you want from a > programming language or interpreter (unless you explicitly > tell it to behave like that). That's debatable. For example id() is fairly unpredictable accross runs (except for statically-allocated instances). > I think it would be much better to just let the user > define a hash seed using environment variables for Python > to use and then forget about how this variable value is > determined. If it's not set, Python uses 0 as seed, thereby > disabling the seeding logic. > > This approach would have Python behave in a deterministic way > per default and still allow users who wish to use a different > seed, set this to a different value - even on a case by case > basis. > > If you absolutely want to add a feature to have the seed set > randomly, you could make a seed value of -1 trigger the use > of a random number source as seed. Having both may indeed be a good idea. > I also still firmly believe that the collision counting scheme > should be made available via an environment variable as well. > The user could then set the variable to e.g. 1000 to have it > enabled with limit 1000, or leave it undefined to disable the > collision counting. > > With those two tools, users could then choose the method they > find most attractive for their purposes. It's not about being attractive, it's about fixing the security problem. The simple collision counting approach leaves a gaping hole open, as demonstrated by Frank.
msg152734 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 13:12
Antoine Pitrou wrote: > > The simple collision counting approach leaves a gaping hole open, as > demonstrated by Frank. Could you elaborate on this ? Note that I've updated the collision counting patch to cover both possible attack cases I mentioned in http://bugs.python.org/issue13703#msg150724. If there's another case I'm unaware of, please let me know.
msg152740 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-02-06 15:47
On Mon, Feb 6, 2012 at 8:12 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > Antoine Pitrou wrote: >> >> The simple collision counting approach leaves a gaping hole open, as >> demonstrated by Frank. > Could you elaborate on this ? > Note that I've updated the collision counting patch to cover both > possible attack cases I mentioned in http://bugs.python.org/issue13703#msg150724. > If there's another case I'm unaware of, please let me know. The problematic case is, roughly, (1) Find out what N will trigger collision-counting countermeasures. (2) Insert N-1 colliding entries, to make it as slow as possible. (3) Keep looking up (or updating) the N-1th entry, so that the slow-as-possible-without-countermeasures path keeps getting rerun.
msg152747 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 17:07
Jim Jewett wrote: > > Jim Jewett <jimjjewett@gmail.com> added the comment: > > On Mon, Feb 6, 2012 at 8:12 AM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: >> >> Marc-Andre Lemburg <mal@egenix.com> added the comment: >> >> Antoine Pitrou wrote: >>> >>> The simple collision counting approach leaves a gaping hole open, as >>> demonstrated by Frank. > >> Could you elaborate on this ? > >> Note that I've updated the collision counting patch to cover both >> possible attack cases I mentioned in http://bugs.python.org/issue13703#msg150724. >> If there's another case I'm unaware of, please let me know. > > The problematic case is, roughly, > > (1) Find out what N will trigger collision-counting countermeasures. > (2) Insert N-1 colliding entries, to make it as slow as possible. > (3) Keep looking up (or updating) the N-1th entry, so that the > slow-as-possible-without-countermeasures path keeps getting rerun. Since N is constant, I don't see how such an "attack" could be used to trigger the O(n^2) worst-case behavior. Even if you can create n sets of entries that each fill up N-1 positions, the overall performance will still be O(nN(N-1)/2) = O(n). So in the end, we're talking about a regular brute force DoS attack, which requires different measures than dictionary implementation tricks :-) BTW: If you set the limit N to e.g. 100 (which is reasonable given Victor's and my tests), the time it takes to process one of those sets only takes 0.3 ms on my machine. That's hardly usable as basis for an effective DoS attack.
msg152753 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-02-06 18:31
On Mon, Feb 6, 2012 at 12:07 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > Jim Jewett wrote: >> The problematic case is, roughly, >> (1) Find out what N will trigger collision-counting countermeasures. >> (2) Insert N-1 colliding entries, to make it as slow as possible. >> (3) Keep looking up (or updating) the N-1th entry, so that the >> slow-as-possible-without-countermeasures path keeps getting rerun. > Since N is constant, I don't see how such an "attack" could be used > to trigger the O(n^2) worst-case behavior. Agreed; it tops out with a constant, but if it takes only 16 bytes of input to force another run through a 1000-long collision, that may still be too much leverage. > BTW: If you set the limit N to e.g. 100 (which is reasonable given > Victor's and my tests), Agreed. Frankly, I think 5 would be more than reasonable so long as there is a fallback. > the time it takes to process one of those > sets only takes 0.3 ms on my machine. That's hardly usable as basis > for an effective DoS attack. So it would take around 3Mb to cause a minute's delay...
msg152754 - (view)	Author: Frank Sievertsen (fx5)	Date: 2012-02-06 18:53
> Agreed; it tops out with a constant, but if it takes only 16 bytes of > input to force another run through a 1000-long collision, that may > still be too much leverage. You should prepare the dict so that you have the collisions-run with a one-byte string or better with an even empty string, not a 16 bytes string. > BTW: If you set the limit N to e.g. 100 (which is reasonable given > Victor's and my tests), 100 is probably hard to exploit for a DoS attack. However it makes it much easier to cause unwanted (future?) exceptions in other apps. > So it would take around 3Mb to cause a minute's delay... How did you calculate that?
msg152755 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 18:54
Jim Jewett wrote: > >> BTW: If you set the limit N to e.g. 100 (which is reasonable given >> Victor's and my tests), > > Agreed. Frankly, I think 5 would be more than reasonable so long as > there is a fallback. > >> the time it takes to process one of those >> sets only takes 0.3 ms on my machine. That's hardly usable as basis >> for an effective DoS attack. > > So it would take around 3Mb to cause a minute's delay... I'm not sure how you calculated that number. Here's what I get: tale a dictionary with 100 integer collisions: d = dict((x(264 - 1), 1) for x in xrange(1, 100)) The repr(d) has 2713 bytes, which is a good approximation of how much (string) data you have to send in order to trigger the problem case. If you can create 3333 distinct integer sequences, you'll get a processing time of about 1 second on my slow dev machine. The resulting dict will likely have a repr() of around 603333*2713 = 517MB. So you need to send 517MB to cause my slow dev machine to consume 1 minute of CPU time. Today's servers are at least 10 times as fast as my aging machine. If you then take into account that the integer collision dictionary is a very efficient collision example (size vs. effect), the attack doesn't really sound practical anymore.
msg152758 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-06 19:07
On Mon, 2012-02-06 at 06:11 +0000, Martin v. Löwis wrote: > Martin v. Löwis <martin@v.loewis.de> added the comment: > > IIUC, Win9x and NT4 are not supported anymore in any of the target releases of the patch, so calling CryptGenRandom should be fine. > In a security fix release, we shouldn't change the linkage procedures, so I recommend that the LoadLibrary dance remains. Thanks. Am attaching tweaked versions of the 2012-02-01 patches, in which I've removed the indecisive: #if 1 (void)win32_urandom((unsigned char )secret, secret_size, 0); #else / fast but weak RNG (fast initialization, weak seed) */ ...etc... #endif stuff, and simply use the first clause (win32_urandom) on Windows: add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch
msg152760 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-06 19:11
On Mon, 2012-02-06 at 10:20 +0000, Marc-Andre Lemburg wrote: > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > STINNER Victor wrote: > > > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > > >> In a security fix release, we shouldn't change the linkage procedures, > >> so I recommend that the LoadLibrary dance remains. > > > > So the overhead in startup time is not an issue? > > It is an issue. Not only in terms of startup time, but also msg152362 indicated that there was negligible impact on startup time when randomization is disabled. The impact when it is enabled is unclear, but reported there as "isn't crippling". > because randomization per default makes Python behave in > non-deterministc ways - which is not what you want from a > programming language or interpreter (unless you explicitly > tell it to behave like that). The release managers have pronounced: http://mail.python.org/pipermail/python-dev/2012-January/115892.html Quoting that email: > 1. Simple hash randomization is the way to go. We think this has the > best chance of actually fixing the problem while being fairly > straightforward such that we're comfortable putting it in a stable > release. > 2. It will be off by default in stable releases and enabled by an > envar at runtime. This will prevent code breakage from dictionary > order changing as well as people depending on the hash stability.
msg152763 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-02-06 19:34
On Mon, Feb 6, 2012 at 1:53 PM, Frank Sievertsen <report@bugs.python.org> wrote: >>> BTW: If you set the limit N to e.g. 100 (which is reasonable given >>> Victor's and my tests), >> So it would take around 3Mb to cause a minute's delay... > How did you calculate that? 16 bytes/entry * 3300 entries/second * 60 seconds/minute But if there is indeed a way to cut that 16 bytes/entry, that is worse. Switching dict implementations at 5 collisions is still acceptable, except from a complexity standpoint. -jJ
msg152764 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 19:44
Dave Malcolm wrote: > >>> So the overhead in startup time is not an issue? >> >> It is an issue. Not only in terms of startup time, but also >... >> because randomization per default makes Python behave in >> non-deterministc ways - which is not what you want from a >> programming language or interpreter (unless you explicitly >> tell it to behave like that). > > The release managers have pronounced: > http://mail.python.org/pipermail/python-dev/2012-January/115892.html > Quoting that email: >> 1. Simple hash randomization is the way to go. We think this has the >> best chance of actually fixing the problem while being fairly >> straightforward such that we're comfortable putting it in a stable >> release. >> 2. It will be off by default in stable releases and enabled by an >> envar at runtime. This will prevent code breakage from dictionary >> order changing as well as people depending on the hash stability. Right, but that doesn't contradict what I wrote about adding env vars to fix a seed and optionally enable using a random seed, or adding collision counting as extra protection for cases that are not addressed by the hash seeding, such as e.g. collisions caused by 3rd types or numbers.
msg152767 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 20:14
Marc-Andre Lemburg wrote: > Dave Malcolm wrote: >> The release managers have pronounced: >> http://mail.python.org/pipermail/python-dev/2012-January/115892.html >> Quoting that email: >>> 1. Simple hash randomization is the way to go. We think this has the >>> best chance of actually fixing the problem while being fairly >>> straightforward such that we're comfortable putting it in a stable >>> release. >>> 2. It will be off by default in stable releases and enabled by an >>> envar at runtime. This will prevent code breakage from dictionary >>> order changing as well as people depending on the hash stability. > > Right, but that doesn't contradict what I wrote about adding > env vars to fix a seed and optionally enable using a random > seed, or adding collision counting as extra protection for > cases that are not addressed by the hash seeding, such as > e.g. collisions caused by 3rd types or numbers. ... at least I hope not :-)
msg152768 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-02-06 20:17
> > Right, but that doesn't contradict what I wrote about adding > > env vars to fix a seed and optionally enable using a random > > seed, or adding collision counting as extra protection for > > cases that are not addressed by the hash seeding, such as > > e.g. collisions caused by 3rd types or numbers. > > ... at least I hope not :-) I think the env var part is a good idea (except that -1 as a magic value to enable randomization isn't great).
msg152769 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 20:24
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >>> Right, but that doesn't contradict what I wrote about adding >>> env vars to fix a seed and optionally enable using a random >>> seed, or adding collision counting as extra protection for >>> cases that are not addressed by the hash seeding, such as >>> e.g. collisions caused by 3rd types or numbers. >> >> ... at least I hope not :-) > > I think the env var part is a good idea (except that -1 as a magic value > to enable randomization isn't great). Agreed. Since it's an env var, using "random" would be a better choice.
msg152777 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-02-06 21:18
> > > The release managers have pronounced: > > http://mail.python.org/pipermail/python-dev/2012-January/115892.html > > Quoting that email: > >> 1. Simple hash randomization is the way to go. We think this has the > >> best chance of actually fixing the problem while being fairly > >> straightforward such that we're comfortable putting it in a stable > >> release. > >> 2. It will be off by default in stable releases and enabled by an > >> envar at runtime. This will prevent code breakage from dictionary > >> order changing as well as people depending on the hash stability. > > Right, but that doesn't contradict what I wrote about adding > env vars to fix a seed and optionally enable using a random > seed, or adding collision counting as extra protection for > cases that are not addressed by the hash seeding, such as > e.g. collisions caused by 3rd types or numbers. We won't be back-porting anything more than the hash randomization for 2.6/2.7/3.1/3.2 but we are free to do more in 3.3 if someone can demonstrate it working well and a need for it. For me, things like collision counting and tree based collision buckets when the types are all the same and known comparable make sense but are really sounding like a lot of additional complexity. I'd like to see active black-box design attack code produced that goes after something like a wsgi web app written in Python with hash randomization enabled to demonstrate the need before we accept additional protections like this for 3.3+. -gps
msg152780 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 21:41
Gregory P. Smith wrote: > > Gregory P. Smith <greg@krypto.org> added the comment: > >> >>> The release managers have pronounced: >>> http://mail.python.org/pipermail/python-dev/2012-January/115892.html >>> Quoting that email: >>>> 1. Simple hash randomization is the way to go. We think this has the >>>> best chance of actually fixing the problem while being fairly >>>> straightforward such that we're comfortable putting it in a stable >>>> release. >>>> 2. It will be off by default in stable releases and enabled by an >>>> envar at runtime. This will prevent code breakage from dictionary >>>> order changing as well as people depending on the hash stability. >> >> Right, but that doesn't contradict what I wrote about adding >> env vars to fix a seed and optionally enable using a random >> seed, or adding collision counting as extra protection for >> cases that are not addressed by the hash seeding, such as >> e.g. collisions caused by 3rd types or numbers. > > We won't be back-porting anything more than the hash randomization for > 2.6/2.7/3.1/3.2 but we are free to do more in 3.3 if someone can > demonstrate it working well and a need for it. > > For me, things like collision counting and tree based collision > buckets when the types are all the same and known comparable make > sense but are really sounding like a lot of additional complexity. I'd > like to see active black-box design attack code produced that goes > after something like a wsgi web app written in Python with hash > randomization enabled to demonstrate the need before we accept > additional protections like this for 3.3+. I posted several examples for the integer collision attack on this ticket. The current randomization patch does not address this at all, the collision counting patch does, which is why I think both are needed. Note that my comment was more about the desire to not recommend using random hash seeds per default, but instead advocate using a random but fixed seed, or at least document that using random seeds that are set during interpreter startup will cause problems with repeatability of application runs.
msg152781 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-02-06 21:42
On Mon, Feb 6, 2012 at 4:41 PM, Marc-Andre Lemburg <report@bugs.python.org>wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > Gregory P. Smith wrote: > > > > Gregory P. Smith <greg@krypto.org> added the comment: > > > >> > >>> The release managers have pronounced: > >>> http://mail.python.org/pipermail/python-dev/2012-January/115892.html > >>> Quoting that email: > >>>> 1. Simple hash randomization is the way to go. We think this has the > >>>> best chance of actually fixing the problem while being fairly > >>>> straightforward such that we're comfortable putting it in a stable > >>>> release. > >>>> 2. It will be off by default in stable releases and enabled by an > >>>> envar at runtime. This will prevent code breakage from dictionary > >>>> order changing as well as people depending on the hash stability. > >> > >> Right, but that doesn't contradict what I wrote about adding > >> env vars to fix a seed and optionally enable using a random > >> seed, or adding collision counting as extra protection for > >> cases that are not addressed by the hash seeding, such as > >> e.g. collisions caused by 3rd types or numbers. > > > > We won't be back-porting anything more than the hash randomization for > > 2.6/2.7/3.1/3.2 but we are free to do more in 3.3 if someone can > > demonstrate it working well and a need for it. > > > > For me, things like collision counting and tree based collision > > buckets when the types are all the same and known comparable make > > sense but are really sounding like a lot of additional complexity. I'd > > like to see active black-box design attack code produced that goes > > after something like a wsgi web app written in Python with hash > > randomization enabled to demonstrate the need before we accept > > additional protections like this for 3.3+. > > I posted several examples for the integer collision attack on this > ticket. The current randomization patch does not address this at all, > the collision counting patch does, which is why I think both are > needed. > > Note that my comment was more about the desire to not recommend > using random hash seeds per default, but instead advocate using > a random but fixed seed, or at least document that using random > seeds that are set during interpreter startup will cause > problems with repeatability of application runs. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________ > Can't randomization just be applied to integers as well? Alex
msg152784 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-06 21:53
> Can't randomization just be applied to integers as well? > It could, but see http://bugs.python.org/issue13703#msg151847 Would my patches be more or less likely to get reviewed with vs without an extension of randomization to integers?
msg152787 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 22:04
Alex Gaynor wrote: > Can't randomization just be applied to integers as well? A simple seed xor'ed with the hash won't work, since the attacks I posted will continue to work (just colliding on a different hash value). Using a more elaborate hash algorithm would slow down uses of numbers as dictionary keys and also be difficult to implement for non-integer types such as float, longs and complex numbers. The reason is that Python applications expect x == y => hash(x) == hash(y), e.g. hash(3) == hash(3L) == hash(3.0) == hash(3+0j). AFAIK, the randomization patch also doesn't cover tuples, which are rather common as dictionary keys as well, nor any of the other more esoteric Python built-in hashable data types (e.g. frozenset) or hashable data types defined by 3rd party extensions or applications (simply because it can't).
msg152789 - (view)	Author: Alex Gaynor (alex) *	Date: 2012-02-06 22:07
On Mon, Feb 6, 2012 at 5:04 PM, Marc-Andre Lemburg <report@bugs.python.org>wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > Alex Gaynor wrote: > > Can't randomization just be applied to integers as well? > > A simple seed xor'ed with the hash won't work, since the attacks > I posted will continue to work (just colliding on a different hash > value). > > Using a more elaborate hash algorithm would slow down uses of > numbers as dictionary keys and also be difficult to implement for > non-integer types such as float, longs and complex numbers. The > reason is that Python applications expect x == y => hash(x) == hash(y), > e.g. hash(3) == hash(3L) == hash(3.0) == hash(3+0j). > > AFAIK, the randomization patch also doesn't cover tuples, which are > rather common as dictionary keys as well, nor any of the other > more esoteric Python built-in hashable data types (e.g. frozenset) > or hashable data types defined by 3rd party extensions or > applications (simply because it can't). > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________ > There's no need to cover any container types, because if their constituent types are securely hashable then they will be as well. And of course if the constituent types are unsecure then they're directly vulnerable. Alex
msg152797 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-06 23:00
Alex Gaynor wrote: > There's no need to cover any container types, because if their constituent > types are securely hashable then they will be as well. And of course if > the constituent types are unsecure then they're directly vulnerable. I wouldn't necessarily take that for granted: since container types usually calculate their hash based on the hashes of their elements, it's possible that a clever combination of elements could lead to a neutralization of the the hash seed used by the elements, thereby reenabling the original attack on the unprotected interpreter. Still, because we have far more vulnerable hashable types out there, trying to find such an attack doesn't really make practical sense, so protecting containers is indeed not as urgent :-)
msg152811 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-07 15:41
On Mon, 2012-02-06 at 23:00 +0000, Marc-Andre Lemburg wrote: > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > Alex Gaynor wrote: > > There's no need to cover any container types, because if their constituent > > types are securely hashable then they will be as well. And of course if > > the constituent types are unsecure then they're directly vulnerable. > > I wouldn't necessarily take that for granted: since container > types usually calculate their hash based on the hashes of their > elements, it's possible that a clever combination of elements > could lead to a neutralization of the the hash seed used by > the elements, thereby reenabling the original attack on the > unprotected interpreter. > > Still, because we have far more vulnerable hashable types out there, > trying to find such an attack doesn't really make practical > sense, so protecting containers is indeed not as urgent :-) FWIW, I'm still awaiting review of my patches. I don't believe Marc-Andre's concerns are a sufficient rebuttal to the approach I've taken. If anyone is aware of an attack via numeric hashing that's actually possible, please let me know (privately). I believe only specific apps could be affected, and I'm not aware of any such specific apps.
msg152855 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-08 13:10
Dave Malcolm wrote: > > If anyone is aware of an attack via numeric hashing that's actually > possible, please let me know (privately). I believe only specific apps > could be affected, and I'm not aware of any such specific apps. I'm not sure what you'd like to see. Any application reading user provided data from a file, database, web, etc. is vulnerable to the attack, if it uses the read numeric data as keys in a dictionary. The most common use case for this is a dictionary mapping codes or IDs to strings or objects, e.g. for caching purposes, to find a list of unique IDs, checking for duplicates, etc. This also works indirectly on 32-bit platforms, e.g. via date/time or IP address values that get converted to key integers.
msg153055 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-02-10 15:30
So modulo my (small) review comments, David's patches are ready to go in.
msg153074 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-02-10 19:23
Thanks for reviewing Benjamin. I'm also reviewing this today. Sorry for the delay! BTW, like Schadenfreude? A hash collision DOS issue "fix" patch for PHP5 was done poorly and introduced a new security vulnerability that was just used to let script kiddies root many servers all around the web: http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2012-0830
msg153081 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-10 23:01
Review of add-randomization-(...).patch: - there is a missing ")" in the doc, near "the types covered by the :option:`-R` option (or its equivalent, :envvar:`PYTHONHASHRANDOMIZATION`." - get_hash() in test_hash.py fails completly on Windows: Windows requires some environment variables. Just use env=os.environ.copy() instead of env={}. - PYTHONHASHSEED doc is not clear: it should be mentionned that the variable is ignored if PYTHONHASHRANDOMIZATION is not set - (Python 2.6) test_hash fails because of "[xxx refs]" in stderr if Python is compiled in debug mode. Add strip_python_stderr() to test_support.py and use it in get_hash(). def strip_python_stderr(stderr): """Strip the stderr of a Python process from potential debug output emitted by the interpreter. This will typically be run on the result of the communicate() method of a subprocess.Popen object. """ stderr = re.sub(br"\[\d+ refs\]\r?\n?$", b"", stderr).strip() return stderr Except these minor nits, the patches (2.6 and 3.1) looks good. I didn't read the tests patches: just run the tests to test them :-) (Or our buildbots will do the work for you.)
msg153082 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-02-10 23:49
On Fri, Feb 10, 2012 at 6:02 PM, STINNER Victor > - PYTHONHASHSEED doc is not clear: it should be mentionned > that the variable is ignored if PYTHONHASHRANDOMIZATION > is not set That is why this two-envvar solution bothers me. PYTHONHASHSEED has to be a string anyhow, so why not just get rid of PYTHONHASHRANDOMIZATION? Use PYTHONHASHSEED=random to use randomization. Other values that cannot be turned into an integer will be (currently) undefined. (You may want to raise a fatal error, on the assumption that errors should not pass silently.) A missing PYTHONHASHSEED then has the pleasant interpretation of defaulting to "0" for backwards compatibility.
msg153140 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-11 23:06
On Fri, 2012-02-10 at 23:02 +0000, STINNER Victor wrote: > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > Review of add-randomization-(...).patch: > - there is a missing ")" in the doc, near "the types covered by the :option:`-R` option (or its equivalent, :envvar:`PYTHONHASHRANDOMIZATION`." > - get_hash() in test_hash.py fails completly on Windows: Windows requires some environment variables. Just use env=os.environ.copy() instead of env={}. > - PYTHONHASHSEED doc is not clear: it should be mentionned that the variable is ignored if PYTHONHASHRANDOMIZATION is not set > - (Python 2.6) test_hash fails because of "[xxx refs]" in stderr if Python is compiled in debug mode. Add strip_python_stderr() to test_support.py and use it in get_hash(). I'm attaching revised versions of the "add-randomization" patches incorporating review feedback: add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch The other pair of patches are unchanged from before: fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch Changes relative to -2012-02-06-001.patch: changed the wording of the docs relating to PYTHONHASHSEED in Doc/using/cmdline.rst to: * clarify the interaction with PYTHONHASHRANDOMIZATION and -R * mentioning another possible use case: "to allow a cluster of python processes to share hash values." (as per http://bugs.python.org/issue13703#msg152344 ) * rewording the awkward "overrides the other setting" * I've added a description of PYTHONHASHSEED back to the man page and to the --help text * grammar fixes for "Fail to" in 2.6 version of the patch (were already fixed in 3.1) * restored __VMS randomization, by porting vms_urandom from Modules/posixmodule.c to Python/random.c (though I have no way of testing this) * changed env = {} to env = os.environ.copy() in get_hash() as noted by haypo * fixed test_hash --with-pydebug as noted by haypo (and test_os), adding strip_python_stderr from 2.7 I haven't enabled randomization in the Makefile.pre.in
msg153141 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-11 23:09
I'm not quite sure how that would interact with the -R command-line option for enabling randomization. The changes to the docs in the latest patch clarifies the meaning of what I've implemented (I hope). My view is that we should simply enable hash randomization by default in 3.3 At that point, PYTHONHASHRANDOMIZATION and the -R option become meaningless (and could be either removed, or silently ignored), and you have to set PYTHONHASHSEED=0 to get back the old behavior.
msg153143 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-02-12 01:37
Should -R be required to take a parameter specifying "on" or "off" so that code using a #! line continues to work as specified across the a change in default behavior when upgrading from 3.2 to 3.3? #!/usr/bin/python3 -R on #!/usr/bin/python3 -R off In 3.3 it would be a good idea to have a command line flag to turn this off. Rather than introducing a new flag in 3.3 a parameter that is specific without regards for the default avoids that entirely. before anyone suggests it: I do not think -R should accept a value to use as the seed. that is unnecessary.
msg153144 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-02-12 02:11
Comments to be addressed added on the code review.
msg153297 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2012-02-13 20:37
On Sun, 2012-02-12 at 02:11 +0000, Gregory P. Smith wrote: > Gregory P. Smith <greg@krypto.org> added the comment: > > Comments to be addressed added on the code review. Thanks. I'm attaching new patches: add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch I incorporated the feedback from Gregory P Smith's review. I haven't changed the way the command-line options or env variables work, though. Changes relative to -2012-02-11-001.patch: added versionadded 2.6.8 and 3.1.5 to hash_randomization/-R within Docs/library/sys.rst and Docs/using/cmdline.rst (these will need changing to "2.7.3" and "3.2.3" in the forward ports to the 2.7 and 3.2 branches) * fixed line wrapping within the --help text in Modules/main.c * reverted text of urandom__doc__ * added comments about the specialcasing of length 0: /* We make the hash of the empty string be 0, rather than using (prefix ^ suffix), since this slightly obfuscates the hash secret */ (see discussion in http://bugs.python.org/issue13703#msg151664 onwards) I didn't change the range of values for PYTHONHASHSEED on 64-bit
msg153301 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-13 20:50
Dave Malcolm wrote: > [new patch] Please change how the env vars work as discussed earlier on this ticket. Quick summary: We only need one env var for the randomization logic: PYTHONHASHSEED. If not set, 0 is used as seed. If set to a number, a fixed seed is used. If set to "random", a random seed is generated at interpreter startup. Same for the -R cmd line option. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg153369 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-02-14 20:34
On Mon, Feb 13, 2012 at 3:37 PM, Dave Malcolm <dmalcolm@redhat.com> added the comment: > * added comments about the specialcasing of length 0: > /* > We make the hash of the empty string be 0, rather than using > (prefix ^ suffix), since this slightly obfuscates the hash secret > */ Frankly, other short strings may give away even more, because you can put several into the same dict. I would prefer that the randomization not kick in until strings are at least 8 characters, but I think excluding length 1 is a pretty obvious win.
msg153395 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-02-15 08:25
> Frankly, other short strings may give away even more, because you can > put several into the same dict. Please don't make such claims without some reasonable security analysis: how exactly would you derive the hash seed when you have the hash values of all 256 one-byte strings (or all 2**20 one-char Unicode strings)? > I would prefer that the randomization not kick in until strings are at > least 8 characters, but I think excluding length 1 is a pretty obvious > win. -1. It is very easy to create a good number of hash collisions already with 6-character strings. You are opening the security hole again that we are attempting to close.
msg153682 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-02-19 09:14
Attaching reviewed version for 3.1 with unified env var PYTHONHASHSEED and encompassing Antoine's and Greg's review comments.
msg153683 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-02-19 09:21
New version, with the hope that it gets a "review" link.
msg153690 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-02-19 10:00
New patch fixes failures due to sys.flags backwards compatibility. With PYTHONHASHSEED=random, at least those tests still fail: test_descr test_json test_set test_ttk_textonly test_urllib Do we want to fix them in 3.1?
msg153695 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2012-02-19 10:22
> With PYTHONHASHSEED=random, at least those tests still fail: > test_descr test_json test_set test_ttk_textonly test_urllib > > Do we want to fix them in 3.1? It the failures are caused by the test depending on dict order (i.e. not real bugs, not changed behavior), then I think we can live with them.
msg153750 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-02-20 01:05
> With PYTHONHASHSEED=random, at least those tests still fail: > test_descr test_json test_set test_ttk_textonly test_urllib > > Do we want to fix them in 3.1? I don't know, but we'll have to fix them in 3.2 to avoid breaking the buildbots. So we might also fix them in 3.1.
msg153753 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-02-20 01:31
+1 for fixing all tests.
msg153798 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-02-20 19:01
New changeset f4b7ecf8a5f8 by Georg Brandl in branch '3.1': Issue #13703: add a way to randomize the hash values of basic types (str, bytes, datetime) http://hg.python.org/cpython/rev/f4b7ecf8a5f8
msg153802 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-02-20 20:41
New changeset 4a31f6b11e7a by Georg Brandl in branch '3.2': Merge from 3.1: Issue #13703: add a way to randomize the hash values of basic types (str, bytes, datetime) http://hg.python.org/cpython/rev/4a31f6b11e7a
msg153817 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-02-20 23:37
New changeset ed76dc34b39d by Georg Brandl in branch 'default': Merge 3.2: Issue #13703 plus some related test suite fixes. http://hg.python.org/cpython/rev/ed76dc34b39d
msg153833 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-02-21 01:44
New changeset 6b7704fe1be1 by Barry Warsaw in branch '2.6': - Issue #13703: oCERT-2011-003: add -R command-line option and PYTHONHASHSEED http://hg.python.org/cpython/rev/6b7704fe1be1
msg153848 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-02-21 06:01
Roundup Robot didn't seem to notice it, but this has also been committed in 2.7: http://hg.python.org/cpython/rev/a0f43f4481e0
msg153849 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2012-02-21 06:03
Yep, the bot only looks at commit messages, it does not inspect merges or other topographical information. That’s why some of us make sure to repeat bug numbers in our merge commit messages.
msg153850 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-02-21 06:12
But since our workflow is such that commits in X.Y branches always show up in X.Y+1, it doesn't really matter.
msg153852 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-02-21 06:35
The bug report is the easiest thing to search for and follow when checking when something is resolved so it is nice to have a link to the relevant patch(es) for each branch. I just wanted to note the major commit here so that all planned branches had a note recorded. I don't care that it wasn't automatic. :) For observers: There have been several more commits related to fixing this (test dict/set order fixes, bug/typo/merge oops fixes for the linked to patches, etc). Anyone interested in seeing the full list of diffs should look at their specific branch on our around the time of the linked to changelists. Too many to list here.
msg153853 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-02-21 06:40
Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now. Saying yes "working as intended" is fine by me. sys.flags.hash_randomization seems to simply indicate that doing something with the hash seed was explicitly specified as opposed to defaulting to off, not that the hash seed was actually chosen randomly. What this implies for 3.3 after we make hash randomization default to on is that sys.flags.hash_randomization will always be 1.
msg153854 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-02-21 06:47
That is a good question. I don't really care either way, but let's say +0 for turning it off when seed == 0. -R still needs to be made default in 3.3 - that's one reason this issue is still open.
msg153860 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-02-21 09:47
> Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now. > > Saying yes "working as intended" is fine by me. It is documented that PYTHONHASHSEED=0 disables the randomization, so sys.flags.hash_randomization must be False (0).
msg153861 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-21 09:48
Gregory P. Smith wrote: > > Gregory P. Smith <greg@krypto.org> added the comment: > > Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now. The flag should probably be removed - simply because the env var is not a flag, it's a configuration parameter. Exposing the seed value as sys.hashseed would be better and more useful to applications.
msg153862 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2012-02-21 09:50
STINNER Victor wrote: > > STINNER Victor <victor.stinner@gmail.com> added the comment: > >> Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now. >> >> Saying yes "working as intended" is fine by me. > > It is documented that PYTHONHASHSEED=0 disables the randomization, so > sys.flags.hash_randomization must be False (0). PYTHONHASHSEED=1 will disable randomization as well :-) Only setting PYTHONHASHSEED=random actually enables randomization.
msg153868 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-02-21 11:37
> That is a good question. I don't really care either way, but let's > say +0 for turning it off when seed == 0. +1
msg153872 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-02-21 15:33
On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote: >Exposing the seed value as sys.hashseed would be better and more useful >to applications. That makes the most sense to me.
msg153873 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-02-21 15:42
On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote: >The flag should probably be removed - simply because >the env var is not a flag, it's a configuration parameter. > >Exposing the seed value as sys.hashseed would be better and more useful >to applications. Okay, after chatting with __ap__ on irc, here's what I think the behavior should be: sys.flags.hash_randomization should contain just the value given by the -R flag. It should only be True if the flag is present, False otherwise. sys.hash_seed contains the hash seed, set by virtue of the flag or envar. It should contain the actual seed value used. E.g. it might be zero, the explicitly set integer, or the randomly selected seed value in use during this Python execution if a random seed was requested. If you really need the envar value, getenv('PYTHONHASHSEED') is good enough for that.
msg153877 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-02-21 16:28
+1 to what barry and __ap__ discussed and settled on.
msg153975 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-02-22 17:46
I have to amend my suggestion about sys.flags.hash_randomization. It needs to be non-zero even if $PYTHONHASHSEED is given instead of -R. Many other flags that also have envars work the same way, e.g. -O and $PYTHONOPTIMIZE. So hash_randomization has to work the same way. I'll still work on a patch for exposing the seed in sys.
msg153980 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2012-02-22 18:12
Never mind about sys.hash_seed. See my follow up in python-dev. I consider this issue is closed wrt the 2.6 branch.
msg154428 - (view)	Author: Roger Serwy (roger.serwy) *	Date: 2012-02-27 04:34
After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7.
msg154430 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-02-27 05:01
Can you paste the error you're getting? 2012/2/26 Roger Serwy <report@bugs.python.org>: > > Roger Serwy <roger.serwy@gmail.com> added the comment: > > After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7. > > ---------- > nosy: +serwy > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue13703> > _______________________________________
msg154432 - (view)	Author: Roger Serwy (roger.serwy) *	Date: 2012-02-27 05:22
It was a false alarm. I didn't recompile python before running it with the latest /Lib files. My apologies.
msg154853 - (view)	Author: Chris Rebert (cvrebert) *	Date: 2012-03-03 20:36
The Design and History FAQ (will) need a minor corresponding update: http://docs.python.org/dev/faq/design.html#how-are-dictionaries-implemented
msg155293 - (view)	Author: Kurt Seifried (kseifried@redhat.com)	Date: 2012-03-10 05:59
I have assigned CVE-2012-1150 for this issue as per http://www.openwall.com/lists/oss-security/2012/03/10/3
msg155472 - (view)	Author: Jon Vaughan (jsvaughan)	Date: 2012-03-12 20:37
FWIW I upgraded to ubuntu pangolin beta over the weekend, which includes 2.7.3rc1, and I'm also experiencing a problem with urandom. File "/usr/lib/python2.7/email/utils.py", line 27, in <module> import random File "/usr/lib/python2.7/random.py", line 47, in <module> from os import urandom as _urandom ImportError: cannot import name urandom Given Roger Serwy's comment it sounds like a beta ubuntu problem, but thought it worth mentioning.
msg155527 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-03-12 23:51
> FWIW I upgraded to ubuntu pangolin beta over the weekend, > which includes 2.7.3rc1, ... > > File "/usr/lib/python2.7/random.py", line 47, in <module> > from os import urandom as _urandom > ImportError: cannot import name urandom It looks like you are using random.py of Python 2.7.3 with the Python program 2.7.2, because os.urandom() is now always available in Python 2.7.3.
msg155680 - (view)	Author: Jon Vaughan (jsvaughan)	Date: 2012-03-13 22:15
Victor - yes that was it; a mixture of a 2.7.2 virtual env and 2.7.3. Apologies for any nuisance caused.
msg155681 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-03-13 22:18
Can we close this issue?
msg155682 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2012-03-13 22:25
I believe so. This is in all of the release candidates. The expat/xmlparse.c hash collision DoS issue is being handled on its own via http://bugs.python.org/issue14234.
msg405727 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2021-11-04 15:35
Hey Erlend, why did you add so many people to the nosy list of this old issue? On Thu, Nov 4, 2021 at 07:33 Erlend E. Aasland <report@bugs.python.org> wrote: > > Change by Erlend E. Aasland <erlend.aasland@innova.no>: > > > ---------- > components: +Interpreter Core -Argument Clinic > nosy: +Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, > PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, > christian.heimes, cvrebert, dmalcolm, eric.araujo, eric.snow, fx5, > georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, jsvaughan, > lemburg, loewis, mark.dickinson, neologix, pitrou, python-dev, roger.serwy, > skorgu, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz > -ahmedsayeed1982, larry > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue13703> > _______________________________________ > -- --Guido (mobile)
msg405745 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-11-04 19:28
Because today's spammer, whose message was removed, deleted us all. Restoring the version to 3.3 is not possible.

History
Date	User	Action	Args
2022-04-11 14:57:25	admin	set	github: 57912
2021-11-08 16:57:10	vstinner	set	nosy: - vstinner
2021-11-04 19:28:37	terry.reedy	set	messages: + msg405745
2021-11-04 15:35:57	gvanrossum	set	messages: + msg405727
2021-11-04 14:33:14	erlendaasland	set	nosy: + lemburg, gvanrossum, tim.peters, loewis, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, vstinner, christian.heimes, benjamin.peterson, roger.serwy, eric.araujo, grahamd, Arfrever, v+python, alex, cvrebert, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, python-dev, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, Jim.Jewett, PaulMcMillan, fx5, skorgu, jsvaughan, - larry, ahmedsayeed1982 components: + Interpreter Core, - Argument Clinic
2021-11-04 14:32:42	erlendaasland	set	messages: - msg405707
2021-11-04 12:12:04	ahmedsayeed1982	set	versions: + Python 3.11, - Python 2.6, Python 3.1, Python 2.7, Python 3.2, Python 3.3 nosy: + ahmedsayeed1982, larry, - lemburg, gvanrossum, tim.peters, loewis, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, vstinner, christian.heimes, benjamin.peterson, roger.serwy, eric.araujo, grahamd, Arfrever, v+python, alex, cvrebert, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, python-dev, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, Jim.Jewett, PaulMcMillan, fx5, skorgu, jsvaughan messages: + msg405707 components: + Argument Clinic, - Interpreter Core
2012-03-13 22:25:45	gregory.p.smith	set	status: open -> closed resolution: fixed messages: + msg155682
2012-03-13 22:18:57	vstinner	set	messages: + msg155681
2012-03-13 22:15:44	jsvaughan	set	messages: + msg155680
2012-03-12 23:51:17	vstinner	set	messages: + msg155527
2012-03-12 20:37:34	jsvaughan	set	nosy: + jsvaughan messages: + msg155472
2012-03-10 05:59:40	kseifried@redhat.com	set	nosy: - kseifried@redhat.com
2012-03-10 05:59:10	kseifried@redhat.com	set	nosy: + kseifried@redhat.com messages: + msg155293
2012-03-03 20:36:36	cvrebert	set	messages: + msg154853
2012-02-27 05:22:27	roger.serwy	set	messages: + msg154432
2012-02-27 05:01:39	benjamin.peterson	set	messages: + msg154430
2012-02-27 04:34:41	roger.serwy	set	nosy: + roger.serwy messages: + msg154428
2012-02-23 21:49:36	cvrebert	set	nosy: + cvrebert
2012-02-22 18:12:06	barry	set	messages: + msg153980
2012-02-22 17:46:48	barry	set	messages: + msg153975
2012-02-21 16:28:11	gregory.p.smith	set	messages: + msg153877
2012-02-21 15:42:33	barry	set	messages: + msg153873
2012-02-21 15:33:37	barry	set	messages: + msg153872
2012-02-21 11:37:41	pitrou	set	messages: + msg153868
2012-02-21 09:50:25	lemburg	set	messages: + msg153862
2012-02-21 09:48:43	lemburg	set	messages: + msg153861
2012-02-21 09:47:32	vstinner	set	messages: + msg153860
2012-02-21 06:47:35	georg.brandl	set	messages: + msg153854
2012-02-21 06:40:31	gregory.p.smith	set	messages: + msg153853
2012-02-21 06:35:48	gregory.p.smith	set	messages: + msg153852
2012-02-21 06:12:32	georg.brandl	set	messages: + msg153850
2012-02-21 06:03:38	eric.araujo	set	messages: + msg153849
2012-02-21 06:01:56	gregory.p.smith	set	messages: + msg153848
2012-02-21 01:44:32	python-dev	set	messages: + msg153833
2012-02-20 23:37:06	python-dev	set	messages: + msg153817
2012-02-20 20:41:43	python-dev	set	messages: + msg153802
2012-02-20 19:01:40	python-dev	set	nosy: + python-dev messages: + msg153798
2012-02-20 01:31:03	benjamin.peterson	set	messages: + msg153753
2012-02-20 01:05:03	pitrou	set	messages: + msg153750
2012-02-19 10:22:02	eric.araujo	set	messages: + msg153695
2012-02-19 10:00:46	georg.brandl	set	files: + hash-patch-3.1-gb-03.patch messages: + msg153690
2012-02-19 09:59:46	georg.brandl	set	files: - hash-patch-3.1-gb.patch
2012-02-19 09:21:53	georg.brandl	set	files: + hash-patch-3.1-gb.patch messages: + msg153683
2012-02-19 09:21:32	georg.brandl	set	files: - hash-patch-3.1-gb.diff
2012-02-19 09:14:32	georg.brandl	set	files: + hash-patch-3.1-gb.diff messages: + msg153682
2012-02-15 08:25:01	loewis	set	messages: + msg153395
2012-02-14 20:34:56	Jim.Jewett	set	messages: + msg153369
2012-02-13 20:50:09	lemburg	set	messages: + msg153301
2012-02-13 20:37:13	dmalcolm	set	files: + add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch, add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch messages: + msg153297
2012-02-12 02:11:27	gregory.p.smith	set	messages: + msg153144
2012-02-12 01:37:26	gregory.p.smith	set	messages: + msg153143
2012-02-11 23:09:26	dmalcolm	set	messages: + msg153141
2012-02-11 23:06:24	dmalcolm	set	files: + add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch, add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch messages: + msg153140
2012-02-10 23:49:17	Jim.Jewett	set	messages: + msg153082
2012-02-10 23:02:00	vstinner	set	messages: + msg153081
2012-02-10 19:23:56	gregory.p.smith	set	messages: + msg153074
2012-02-10 15:30:01	benjamin.peterson	set	messages: + msg153055
2012-02-08 13:10:39	lemburg	set	messages: + msg152855
2012-02-07 15:41:36	dmalcolm	set	messages: + msg152811
2012-02-06 23:00:03	lemburg	set	messages: + msg152797
2012-02-06 22:07:39	alex	set	messages: + msg152789
2012-02-06 22:04:28	lemburg	set	messages: + msg152787
2012-02-06 21:53:17	dmalcolm	set	messages: + msg152784
2012-02-06 21:42:27	alex	set	messages: + msg152781
2012-02-06 21:41:04	lemburg	set	messages: + msg152780
2012-02-06 21:18:22	gregory.p.smith	set	messages: + msg152777
2012-02-06 20:24:15	lemburg	set	messages: + msg152769
2012-02-06 20:17:47	pitrou	set	messages: + msg152768
2012-02-06 20:14:40	lemburg	set	messages: + msg152767
2012-02-06 19:44:53	lemburg	set	messages: + msg152764
2012-02-06 19:34:15	Jim.Jewett	set	messages: + msg152763
2012-02-06 19:11:43	dmalcolm	set	messages: + msg152760
2012-02-06 19:07:45	dmalcolm	set	files: + add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch, fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch, add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch, fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch messages: + msg152758
2012-02-06 18:54:50	lemburg	set	messages: + msg152755
2012-02-06 18:53:40	fx5	set	messages: + msg152754
2012-02-06 18:31:37	Jim.Jewett	set	messages: + msg152753
2012-02-06 17:07:34	lemburg	set	messages: + msg152747
2012-02-06 15:47:07	Jim.Jewett	set	messages: + msg152740
2012-02-06 13:12:40	lemburg	set	messages: + msg152734
2012-02-06 12:22:27	pitrou	set	messages: + msg152732
2012-02-06 10:20:34	lemburg	set	messages: + msg152731
2012-02-06 09:53:02	vstinner	set	messages: + msg152730
2012-02-06 06:11:13	loewis	set	messages: + msg152723
2012-02-02 01:30:44	vstinner	set	messages: + msg152453
2012-02-02 01:18:27	dmalcolm	set	files: + add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch, fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch, add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch, fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch messages: + msg152452
2012-02-01 03:29:15	dmalcolm	set	files: + results-16.txt messages: + msg152422
2012-01-31 01:34:15	dmalcolm	set	files: + optin-hash-randomization-for-2.6-dmalcolm-2012-01-30-001.patch messages: + msg152364
2012-01-30 23:41:44	gz	set	messages: + msg152362
2012-01-30 22:22:46	dmalcolm	set	files: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch messages: + msg152352
2012-01-30 19:55:53	Jim.Jewett	set	messages: + msg152344
2012-01-30 17:31:17	dmalcolm	set	files: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch messages: + msg152335
2012-01-30 08:16:05	loewis	set	messages: + msg152315
2012-01-30 07:45:49	gregory.p.smith	set	messages: + msg152311
2012-01-30 07:15:04	zbysz	set	messages: + msg152309
2012-01-30 01:44:15	dmalcolm	set	files: + unnamed messages: + msg152300
2012-01-30 01:39:23	dmalcolm	set	files: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch messages: + msg152299
2012-01-29 22:51:25	loewis	set	messages: + msg152276
2012-01-29 22:50:20	Mark.Shannon	set	messages: + msg152275
2012-01-29 22:39:15	Jim.Jewett	set	messages: + msg152271
2012-01-29 22:36:59	barry	set	messages: + msg152270
2012-01-29 00:06:29	dmalcolm	set	messages: + msg152204
2012-01-28 23:56:24	terry.reedy	set	messages: + msg152203
2012-01-28 23:24:41	pitrou	set	messages: + msg152200
2012-01-28 23:14:28	dmalcolm	set	files: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch messages: + msg152199
2012-01-28 20:05:10	benjamin.peterson	set	messages: + msg152186
2012-01-28 19:26:04	dmalcolm	set	messages: + msg152183
2012-01-28 05:13:39	dmalcolm	set	files: + optin-hash-randomization-for-3.1-dmalcolm-2012-01-27-001.patch messages: + msg152149
2012-01-28 03:03:11	benjamin.peterson	set	messages: + msg152146
2012-01-27 21:42:37	dmalcolm	set	messages: + msg152125
2012-01-27 21:02:34	loewis	set	messages: + msg152118
2012-01-27 20:59:39	skorgu	set	nosy: + skorgu
2012-01-27 20:25:13	pitrou	set	messages: + msg152117
2012-01-27 19:32:10	loewis	set	messages: + msg152112
2012-01-27 17:45:10	Jim.Jewett	set	messages: + msg152104
2012-01-27 08:42:52	loewis	set	messages: + msg152070
2012-01-27 06:25:19	gregory.p.smith	set	messages: + msg152066
2012-01-27 02:26:28	loewis	set	messages: + msg152060
2012-01-27 01:19:14	pitrou	set	messages: + msg152057
2012-01-26 23:43:50	loewis	set	messages: + msg152051
2012-01-26 23:22:32	alex	set	messages: + msg152046
2012-01-26 23:03:35	loewis	set	messages: + msg152043
2012-01-26 22:43:57	alex	set	messages: + msg152041
2012-01-26 22:42:24	loewis	set	messages: + msg152040
2012-01-26 22:34:51	loewis	set	messages: + msg152039
2012-01-26 22:13:19	dmalcolm	set	messages: + msg152037
2012-01-26 21:04:28	alex	set	messages: + msg152033
2012-01-26 21:00:16	loewis	set	nosy: + loewis messages: + msg152030
2012-01-25 23:14:03	pitrou	set	messages: + msg151984
2012-01-25 21:34:39	fx5	set	messages: + msg151977
2012-01-25 20:23:40	dmalcolm	set	messages: + msg151973
2012-01-25 19:28:09	pitrou	set	messages: + msg151970
2012-01-25 19:19:31	dmalcolm	set	messages: + msg151967
2012-01-25 19:13:06	pitrou	set	messages: + msg151966
2012-01-25 19:04:24	Jim.Jewett	set	messages: + msg151965
2012-01-25 18:29:41	Jim.Jewett	set	messages: + msg151961
2012-01-25 18:14:07	Jim.Jewett	set	messages: + msg151960
2012-01-25 18:05:26	pitrou	set	messages: + msg151959
2012-01-25 17:49:18	dmalcolm	set	files: + hybrid-approach-dmalcolm-2012-01-25-002.patch messages: + msg151956
2012-01-25 13:12:24	fx5	set	messages: + msg151944
2012-01-25 12:47:36	alex	set	messages: + msg151942
2012-01-25 12:45:34	dmalcolm	set	messages: + msg151941
2012-01-25 11:06:01	dmalcolm	set	files: + hybrid-approach-dmalcolm-2012-01-25-001.patch messages: + msg151939
2012-01-24 00:44:44	gregory.p.smith	set	messages: + msg151870
2012-01-24 00:42:45	Jim.Jewett	set	messages: + msg151869
2012-01-24 00:14:31	PaulMcMillan	set	messages: + msg151867
2012-01-23 21:39:31	lemburg	set	messages: + msg151850
2012-01-23 21:31:59	dmalcolm	set	files: + backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch messages: + msg151847
2012-01-23 16:45:03	lemburg	set	messages: + msg151826
2012-01-23 16:43:24	lemburg	set	files: + hash-attack-3.patch, integercollision.py messages: + msg151825
2012-01-23 13:56:33	pitrou	set	messages: + msg151815
2012-01-23 13:40:27	pitrou	set	messages: + msg151814
2012-01-23 13:38:26	lemburg	set	messages: + msg151813
2012-01-23 13:07:25	lemburg	set	files: + hash-attack-2.patch messages: + msg151812
2012-01-23 04:04:42	dmalcolm	set	messages: + msg151798
2012-01-23 03:48:50	dmalcolm	set	messages: + msg151796
2012-01-23 00:22:36	vstinner	set	messages: + msg151794
2012-01-22 11:40:31	vstinner	set	files: - random-5.patch
2012-01-22 11:40:30	vstinner	set	files: - random-7.patch
2012-01-22 11:40:16	vstinner	set	files: - random-fix_tests.patch
2012-01-22 11:40:12	vstinner	set	files: - random-6.patch
2012-01-22 03:43:37	PaulMcMillan	set	messages: + msg151758
2012-01-22 02:13:47	dmalcolm	set	messages: + msg151756
2012-01-21 23:47:57	alex	set	messages: + msg151754
2012-01-21 23:42:30	gregory.p.smith	set	messages: + msg151753
2012-01-21 22:45:29	pitrou	set	messages: + msg151748
2012-01-21 22:41:58	dmalcolm	set	messages: + msg151747
2012-01-21 22:20:47	pitrou	set	messages: + msg151745
2012-01-21 21:07:41	dmalcolm	set	messages: + msg151744
2012-01-21 18:57:38	pitrou	set	messages: + msg151739
2012-01-21 17:07:56	dmalcolm	set	messages: + msg151737
2012-01-21 17:02:55	dmalcolm	set	files: + amortized-probe-counting-dmalcolm-2012-01-21-003.patch messages: + msg151735
2012-01-21 15:36:01	zbysz	set	messages: + msg151734
2012-01-21 14:27:09	pitrou	set	messages: + msg151731
2012-01-21 03:16:24	dmalcolm	set	files: + amortized-probe-counting-dmalcolm-2012-01-20-002.patch messages: + msg151714
2012-01-20 22:55:15	dmalcolm	set	files: + hash-collision-counting-dmalcolm-2012-01-20-001.patch messages: + msg151707
2012-01-20 18:11:34	vstinner	set	messages: + msg151703
2012-01-20 17:42:07	Jim.Jewett	set	messages: + msg151701
2012-01-20 17:39:25	gvanrossum	set	messages: + msg151700
2012-01-20 17:31:08	Jim.Jewett	set	messages: + msg151699
2012-01-20 14:42:49	neologix	set	messages: + msg151691
2012-01-20 12:58:04	vstinner	set	messages: + msg151689
2012-01-20 11:17:32	lemburg	set	messages: + msg151685
2012-01-20 10:52:35	neologix	set	messages: + msg151684
2012-01-20 10:43:09	fx5	set	messages: + msg151682
2012-01-20 10:39:46	neologix	set	messages: + msg151681
2012-01-20 09:30:41	fx5	set	messages: + msg151680
2012-01-20 09:03:16	neologix	set	messages: + msg151679
2012-01-20 04:58:36	fx5	set	messages: + msg151677
2012-01-20 01:11:24	vstinner	set	messages: + msg151664
2012-01-20 00:38:01	lemburg	set	messages: + msg151662
2012-01-19 18:05:52	fx5	set	messages: + msg151647
2012-01-19 15:13:20	lemburg	set	messages: + msg151633
2012-01-19 15:11:54	lemburg	set	messages: + msg151632
2012-01-19 14:43:52	alex	set	messages: + msg151629
2012-01-19 14:37:53	lemburg	set	messages: + msg151628
2012-01-19 14:31:43	pitrou	set	messages: + msg151626
2012-01-19 14:27:36	lemburg	set	messages: + msg151625
2012-01-19 13:13:42	vstinner	set	messages: + msg151620
2012-01-19 13:03:16	eric.araujo	set	messages: + msg151617
2012-01-19 01:15:24	terry.reedy	set	messages: + msg151604
2012-01-19 00:46:44	gvanrossum	set	messages: + msg151596
2012-01-18 23:46:12	pitrou	set	messages: + msg151590
2012-01-18 23:44:34	gvanrossum	set	messages: + msg151589
2012-01-18 23:37:47	terry.reedy	set	messages: + msg151586
2012-01-18 23:31:25	gregory.p.smith	set	messages: + msg151585
2012-01-18 23:30:12	pitrou	set	messages: + msg151584
2012-01-18 23:25:37	gregory.p.smith	set	messages: + msg151583
2012-01-18 23:23:12	terry.reedy	set	messages: + msg151582
2012-01-18 22:52:46	vstinner	set	messages: + msg151574
2012-01-18 21:14:01	pitrou	set	messages: + msg151567
2012-01-18 21:10:50	gvanrossum	set	messages: + msg151566
2012-01-18 21:05:30	pitrou	set	messages: + msg151565
2012-01-18 19:08:19	gvanrossum	set	messages: + msg151561
2012-01-18 18:59:56	lemburg	set	messages: + msg151560
2012-01-18 10:01:42	vstinner	set	messages: + msg151528
2012-01-18 06:16:55	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg151519
2012-01-17 19:59:51	Jim.Jewett	set	nosy: + Jim.Jewett messages: + msg151484
2012-01-17 16:46:05	eric.araujo	set	messages: + msg151474
2012-01-17 16:35:50	vstinner	set	messages: + msg151472
2012-01-17 16:23:22	eric.araujo	set	messages: + msg151468
2012-01-17 12:36:34	vstinner	set	messages: + msg151449
2012-01-17 12:21:33	vstinner	set	files: + random-8.patch messages: + msg151448
2012-01-17 02:10:41	vstinner	set	messages: + msg151422
2012-01-17 01:57:17	vstinner	set	files: + random-fix_tests.patch
2012-01-17 01:53:50	vstinner	set	files: + random-7.patch messages: + msg151419
2012-01-16 18:58:52	lemburg	set	messages: + msg151402
2012-01-16 18:29:00	eric.snow	set	nosy: + eric.snow messages: + msg151401
2012-01-16 12:45:16	vstinner	set	messages: + msg151353
2012-01-13 10:17:28	zbysz	set	messages: + msg151167
2012-01-13 00:48:55	vstinner	set	messages: + msg151159
2012-01-13 00:36:06	vstinner	set	files: + bench_startup.py messages: + msg151158
2012-01-13 00:08:23	vstinner	set	files: + random-6.patch messages: + msg151157
2012-01-12 10:02:06	grahamd	set	nosy: + grahamd messages: + msg151122
2012-01-12 09:27:35	lemburg	set	messages: + msg151121
2012-01-12 08:53:23	fx5	set	nosy: + fx5 messages: + msg151120
2012-01-11 21:46:12	neologix	set	nosy: + neologix messages: + msg151092
2012-01-11 19:07:03	pitrou	set	messages: + msg151078
2012-01-11 18:18:16	pitrou	set	messages: + msg151074
2012-01-11 18:05:28	lemburg	set	messages: + msg151073
2012-01-11 17:38:10	lemburg	set	messages: + msg151071
2012-01-11 17:34:32	mark.dickinson	set	nosy: + mark.dickinson messages: + msg151070
2012-01-11 17:28:00	pitrou	set	messages: + msg151069
2012-01-11 16:03:19	lemburg	set	messages: + msg151065
2012-01-11 15:41:09	lemburg	set	messages: + msg151064
2012-01-11 14:55:54	Mark.Shannon	set	messages: + msg151063
2012-01-11 14:45:34	pitrou	set	messages: + msg151062
2012-01-11 14:34:17	lemburg	set	messages: + msg151061
2012-01-11 09:56:11	vstinner	set	messages: + msg151048
2012-01-11 09:28:30	lemburg	set	messages: + msg151047
2012-01-10 23:07:45	vstinner	set	files: - random-4.patch
2012-01-10 23:07:43	vstinner	set	files: - random-3.patch
2012-01-10 23:07:40	vstinner	set	files: - random-2.patch
2012-01-10 23:07:37	vstinner	set	files: - random.patch
2012-01-10 23:07:08	vstinner	set	files: + random-5.patch messages: + msg151033
2012-01-10 22:15:05	vstinner	set	files: + random-4.patch messages: + msg151031
2012-01-10 14:26:57	pitrou	set	messages: + msg151017
2012-01-10 11:37:59	vstinner	set	files: + random-3.patch messages: + msg151012
2012-01-09 18:21:49	terry.reedy	set	messages: - msg150846
2012-01-09 12:16:13	lemburg	set	messages: + msg150934
2012-01-09 09:35:36	zbysz	set	nosy: + zbysz
2012-01-08 14:26:13	pitrou	set	messages: + msg150866
2012-01-08 14:23:09	pitrou	set	messages: + msg150865
2012-01-08 12:36:35	terry.reedy	set	messages: - msg150837
2012-01-08 12:35:48	terry.reedy	set	messages: - msg150848
2012-01-08 11:47:18	lemburg	set	messages: + msg150859
2012-01-08 11:33:27	lemburg	set	messages: + msg150857
2012-01-08 10:20:27	PaulMcMillan	set	messages: + msg150856
2012-01-08 05:55:10	v+python	set	messages: + msg150848
2012-01-08 05:37:00	christian.heimes	set	messages: + msg150847
2012-01-08 05:18:55	v+python	set	messages: + msg150846
2012-01-08 02:40:41	PaulMcMillan	set	messages: + msg150840
2012-01-08 00:32:59	v+python	set	messages: + msg150837
2012-01-08 00:21:48	alex	set	messages: + msg150836
2012-01-08 00:19:15	v+python	set	files: + SafeDict.py messages: + msg150835
2012-01-07 23:53:44	gz	set	nosy: + gz messages: + msg150832
2012-01-07 23:24:48	tim.peters	set	nosy: + tim.peters messages: + msg150829
2012-01-07 13:17:34	lemburg	set	messages: + msg150795
2012-01-06 22:03:46	skrah	set	nosy: + skrah
2012-01-06 21:53:34	pitrou	set	messages: + msg150771
2012-01-06 20:56:41	PaulMcMillan	set	messages: + msg150769
2012-01-06 20:50:22	terry.reedy	set	messages: + msg150768
2012-01-06 20:48:08	Arach	set	nosy: + Arach
2012-01-06 19:53:31	PaulMcMillan	set	messages: + msg150766
2012-01-06 17:59:39	lemburg	set	messages: + msg150756
2012-01-06 17:03:08	lemburg	set	messages: + msg150748
2012-01-06 16:35:04	vstinner	set	messages: + msg150738
2012-01-06 12:56:48	lemburg	set	messages: + msg150727
2012-01-06 12:56:08	lemburg	set	messages: + msg150726
2012-01-06 12:52:20	lemburg	set	files: + hash-attack.patch messages: + msg150725
2012-01-06 12:49:16	lemburg	set	messages: + msg150724
2012-01-06 09:31:12	Mark.Shannon	set	messages: + msg150719
2012-01-06 09:08:10	Mark.Shannon	set	messages: + msg150718
2012-01-06 02:57:40	terry.reedy	set	messages: + msg150713
2012-01-06 02:50:28	PaulMcMillan	set	messages: + msg150712
2012-01-06 01:50:07	alex	set	messages: + msg150708
2012-01-06 01:44:17	christian.heimes	set	messages: + msg150707
2012-01-06 01:09:47	vstinner	set	messages: + msg150706
2012-01-06 00:23:10	vstinner	set	messages: + msg150702
2012-01-05 22:49:32	vstinner	set	messages: + msg150699
2012-01-05 21:40:03	PaulMcMillan	set	messages: + msg150694
2012-01-05 20:21:21	v+python	set	nosy: + v+python
2012-01-05 12:41:26	pitrou	set	messages: + msg150668
2012-01-05 10:41:40	Mark.Shannon	set	messages: + msg150665
2012-01-05 10:20:26	christian.heimes	set	messages: + msg150662
2012-01-05 09:43:32	Mark.Shannon	set	messages: + msg150659
2012-01-05 09:01:14	lemburg	set	messages: + msg150656
2012-01-05 06:25:12	Huzaifa.Sidhpurwala	set	nosy: + Huzaifa.Sidhpurwala messages: + msg150655
2012-01-05 01:17:03	christian.heimes	set	messages: + msg150652
2012-01-05 01:09:05	vstinner	set	files: + random-2.patch messages: + msg150651
2012-01-05 01:05:58	vstinner	set	messages: + msg150650
2012-01-05 00:58:43	vstinner	set	messages: + msg150649
2012-01-05 00:58:38	christian.heimes	set	messages: + msg150648
2012-01-05 00:57:04	PaulMcMillan	set	messages: + msg150647
2012-01-05 00:53:57	christian.heimes	set	messages: + msg150646
2012-01-05 00:49:03	vstinner	set	messages: + msg150645
2012-01-05 00:44:25	PaulMcMillan	set	messages: + msg150644
2012-01-05 00:39:32	pitrou	set	messages: + msg150643
2012-01-05 00:36:51	christian.heimes	set	messages: + msg150642
2012-01-05 00:36:10	vstinner	set	messages: + msg150641
2012-01-05 00:31:43	PaulMcMillan	set	messages: + msg150639
2012-01-05 00:11:02	vstinner	set	messages: + msg150638
2012-01-05 00:02:47	vstinner	set	messages: + msg150637
2012-01-05 00:01:02	pitrou	set	messages: + msg150636
2012-01-04 23:54:25	vstinner	set	messages: + msg150635
2012-01-04 23:42:51	vstinner	set	files: + random.patch keywords: + patch messages: + msg150634
2012-01-04 17:58:10	lemburg	set	messages: + msg150625
2012-01-04 17:44:50	alex	set	messages: + msg150622
2012-01-04 17:41:21	terry.reedy	set	messages: + msg150621
2012-01-04 17:22:42	lemburg	set	messages: + msg150620
2012-01-04 17:18:30	lemburg	set	messages: + msg150619
2012-01-04 16:42:05	lemburg	set	nosy: + lemburg messages: + msg150616
2012-01-04 15:08:36	barry	set	messages: + msg150613
2012-01-04 14:52:27	eric.araujo	set	nosy: + eric.araujo messages: + msg150609
2012-01-04 11:02:59	pitrou	set	messages: + msg150601
2012-01-04 09:59:35	Mark.Shannon	set	nosy: + Mark.Shannon
2012-01-04 06:00:38	PaulMcMillan	set	messages: + msg150592
2012-01-04 05:09:59	vstinner	set	messages: + msg150589
2012-01-04 05:00:38	jcea	set	nosy: + jcea
2012-01-04 03:08:14	vstinner	set	messages: + msg150577
2012-01-04 02:16:27	Zhiping.Deng	set	nosy: + Zhiping.Deng
2012-01-04 02:14:54	pitrou	set	messages: + msg150570
2012-01-04 01:58:04	pitrou	set	messages: + msg150569
2012-01-04 01:54:52	vstinner	set	messages: + msg150568
2012-01-04 01:30:01	pitrou	set	messages: + msg150565
2012-01-04 01:00:55	vstinner	set	messages: + msg150563
2012-01-04 00:55:05	terry.reedy	set	nosy: + terry.reedy messages: + msg150562
2012-01-04 00:38:29	christian.heimes	set	messages: + msg150560
2012-01-04 00:33:10	Arfrever	set	nosy: + Arfrever
2012-01-04 00:22:36	vstinner	set	messages: + msg150559
2012-01-03 23:52:47	PaulMcMillan	set	nosy: + PaulMcMillan messages: + msg150558
2012-01-03 22:19:51	alex	set	nosy: + alex
2012-01-03 22:08:19	christian.heimes	set	messages: + msg150543
2012-01-03 22:02:45	barry	set	messages: + msg150541
2012-01-03 21:48:21	dmalcolm	set	nosy: + dmalcolm
2012-01-03 21:43:34	benjamin.peterson	set	messages: + msg150534
2012-01-03 21:20:59	vstinner	set	messages: + msg150533
2012-01-03 20:56:19	vstinner	set	nosy: + vstinner
2012-01-03 20:49:39	barry	set	messages: + msg150532
2012-01-03 20:47:44	gvanrossum	set	messages: + msg150531
2012-01-03 20:31:16	christian.heimes	set	dependencies: + Random number generator in Python core messages: + msg150529 stage: needs patch
2012-01-03 20:24:32	pitrou	set	messages: + msg150526
2012-01-03 20:19:25	christian.heimes	set	messages: + msg150525 stage: needs patch -> (no value)
2012-01-03 19:48:53	pitrou	set	nosy: + pitrou, christian.heimes stage: needs patch
2012-01-03 19:43:25	gvanrossum	set	nosy: + gvanrossum
2012-01-03 19:36:49	barry	create