Message 151939 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dmalcolm
Recipients	Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, lemburg, mark.dickinson, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
Date	2012-01-25.11:05:43
SpamBayes Score	4.440892e-16
Marked as misclassified	No
Message-id	<1327489561.69.0.280670857261.issue13703@psf.upfronthosting.co.za>
In-reply-to

Content
I'm attaching a patch which implements a hybrid approach: hybrid-approach-dmalcolm-2012-01-25-001.patch This is a blend of various approaches from the discussion, taking aspects of both hash randomization and collision-counting. It incorporates code from amortized-probe-counting-dmalcolm-2012-01-21-003.patch backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch random-8.patch along with ideas from: http://mail.python.org/pipermail/python-dev/2012-January/115812.html The patch is against the default branch (although my primary goal here is eventual backporting). As per haypo's random-8.patch, a randomization seed is read at startup. By default, the existing hash() values are preserved, and no randomization is performed until a dict comes under attack. This preserves existing behaviors (such as dict ordering) under non-attack conditions. For large dictionaries, it reuses the ma_smalltable area to track the amortized cost of all modifications to this dictionary. When the cost exceeds a set threshold, we convert the dictionary's ma_lookup function from lookdict/lookdict_unicode to a "paranoid" variant. These variants ignore the hash passed in, and instead uses a new function: PyObject_RandomizedHash(obj) to give a second hash value, which is fixed value for a given object within the process, but not predictable to an attacker for the most high-risk types (PyUnicodeObject and PyBytesObject). This patch is intended as a base for backporting, and takes it as given that we can't expand PyTypeObject or hide something in one of the PyMethods tables; iirc we've run out of tp_flags in 2., hence we're forced to implement PyObject_RandomizedHash via direct ob_type comparison, for the most high-risk types. As noted in http://bugs.python.org/issue13703#msg151870: > I would NOT worry about hash repeatability for integers and > complex data structures. It is not at the core of the common problem > (maybe a couple application specific problems but not a general "all > python web apps" severity problem). > Doing it for base byte string and unicode string like objects is > sufficient. [We can of course implement hash randomization by default in 3.3, but I care more about getting a fix into the released branches ASAP] Upon transition of a dict to paranoid mode, the hash values become unpredictable to an attacker, and all PyDictEntries are rebuilt based on the new hash values. Handling the awkward case within custom ma_lookup functions allows us to move most of the patch from out of the fast path, and lookdict/lookdict_unicode only need minimal changes (stat gathering for the above cost analysis tracking). Once a dict has transitioned to paranoid mode, it isn't using PyObject_Hash anymore, and thus isn't using cached object values, performing a more expensive calculation, but I believe this calculation is essentially constant-time. This preserves hash() and dict order for the cases where you're not under attack, and gracefully handles the attack without having to raise an exception: it doesn't introduce any new exception types. It preserves ABI, assuming no-one else is reusing ma_smalltable. It is suitable for backporting to 3.2, 2.7, and earlier (I'm investigating fixing this going all the way back to Python 2.2) Under the old implementation, there were 4 types of PyDictObject, given these two booleans: * "small vs large" i.e ma_table == ma_smalltable vs ma_table != ma_smalltable * "all keys are str" vs arbitary keys i.e ma_lookdict == lookdict_unicode vs lookdict Under this implementation, this doubles to 8 kinds, adding the boolean: * normal hash vs randomized hash (normal vs "paranoid"). This is expressed via the ma_lookdict callback, adding two new variants, lookdict_unicode_paranoid and lookdict_paranoid Note that if a paranoid dict goes small again (ma_table == ma_smalltable), it stays paranoid. This is for simplicity: it avoids having to rebuild all of the non-randomized me_hash values again (which could fail). Naturally the patch adds selftests. I had to add some diagnostic methods to support them; dict gains _stats() and _make_paranoid() methods, and sys gains a _getrandomizedhash() method. These could be hidden more thoroughly if need be (see DICT_PROTECTION_TRACKING in dictobject.c). Amongst other things, the selftests measure wallclock time taken for various dict operations (and so might introduce failures on a heavily-loaded machine, I guess). Hopefully this approach is a viable way forward. Caveats and TODO items: TODO: I haven't yet tuned the safety threshold. According to http://bugs.python.org/issue13703#msg151850: > slot collisions are much more frequent than > hash collisions, which only account for less than 0.01% of all > collisions. > > It also shows that slot collisions in the low 1-10 range are > most frequent, with very few instances of a dict lookup > reaching 20 slot collisions (less than 0.0002% of all > collisions). This suggests that the threshold of 32 slot/hash collisions per lookup may already be high enough. TODO: in a review of an earlier version of the complexity detection idea, Antoine Pitrou suggested that make the protection scale factor be a run-time configurable value, rather than a #define. This isn't done yet. TODO: run more extensive tests (e.g. Django and Twisted), monitoring the worst-case complexity that's encountered TODO: not yet benchmarked and optimized. I want to get feedback on the approach before I go in and hand-optimize things (e.g. by hand-inlining check_iter_count, and moving the calculations out of the loop etc). I believe any performance issues ought to be fixable, in that the we can get the cost of this for the "we're not under attack" case to be negligible, and the "under attack" case should transition from O(N^2) to O(N), albeit it with a larger constant factor. TODO: this doesn't cover sets, but assuming this approach works, the patch can be extended to cover it in an analogous way. TODO: should it cover PyMemoryViewObject, buffer object, etc? TODO: should it cover the hashing in Modules/expat/xmlparse.c? FWIW I rip this code out when doing my downstream builds in RHEL and Fedora, and instead dynamically link against a system copy of expat TODO: only tested on Linux so far (which is all I've got). Fedora 15 x86_64 fwiw Doc/using/cmdline.rst \| 6 Include/bytesobject.h \| 2 Include/object.h \| 8 Include/pythonrun.h \| 2 Include/unicodeobject.h \| 2 Lib/os.py \| 17 -- Lib/test/regrtest.py \| 5 Lib/test/test_dict.py \| 298 +++++++++++++++++++++++++++++++++++++ Lib/test/test_hash.py \| 53 ++++++ Lib/test/test_os.py \| 35 +++- Makefile.pre.in \| 1 Modules/posixmodule.c \| 126 ++------------- Objects/bytesobject.c \| 7 Objects/dictobject.c \| 369 +++++++++++++++++++++++++++++++++++++++++++++- Objects/object.c \| 37 ++++ Objects/unicodeobject.c \| 51 ++++++ PCbuild/pythoncore.vcproj \| 4 Python/pythonrun.c \| 3 Python/sysmodule.c \| 16 + b/Python/random.c \| 268 +++++++++++++++++++++++++++++++++ 20 files changed, 1173 insertions(+), 137 deletions(-)

I'm attaching a patch which implements a hybrid approach:
  hybrid-approach-dmalcolm-2012-01-25-001.patch

This is a blend of various approaches from the discussion, taking aspects of both hash randomization *and* collision-counting.

It incorporates code from
  amortized-probe-counting-dmalcolm-2012-01-21-003.patch
  backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch
  random-8.patch
along with ideas from:
  http://mail.python.org/pipermail/python-dev/2012-January/115812.html

The patch is against the default branch (although my primary goal here is eventual backporting).

As per haypo's random-8.patch, a randomization seed is read at startup.

By default, the existing hash() values are preserved, and no randomization is performed until a dict comes under attack.  This preserves existing behaviors (such as dict ordering) under non-attack conditions.

For large dictionaries, it reuses the ma_smalltable area to track the amortized cost of all modifications to this dictionary.

When the cost exceeds a set threshold, we convert the dictionary's ma_lookup function from lookdict/lookdict_unicode to a "paranoid" variant.  These variants ignore the hash passed in, and instead uses a new function:
   PyObject_RandomizedHash(obj)
to give a second hash value, which is fixed value for a given object within the process, but not predictable to an attacker for the most high-risk types (PyUnicodeObject and PyBytesObject).

This patch is intended as a base for backporting, and takes it as given that we can't expand PyTypeObject or hide something in one of the Py*Methods tables; iirc we've run out of tp_flags in 2.*, hence we're forced to implement PyObject_RandomizedHash via direct ob_type comparison, for the most high-risk types.  

As noted in http://bugs.python.org/issue13703#msg151870:

> I would NOT worry about hash repeatability for integers and
> complex data structures.  It is not at the core of the common problem
> (maybe a couple application specific problems but not a general "all
> python web apps" severity problem).

> Doing it for base byte string and unicode string like objects is
> sufficient.

[We can of course implement hash randomization by default in 3.3, but I care more about getting a fix into the released branches ASAP]

Upon transition of a dict to paranoid mode, the hash values become unpredictable to an attacker, and all PyDictEntries are rebuilt based on the new hash values.

Handling the awkward case within custom ma_lookup functions allows us to move most of the patch from out of the fast path, and lookdict/lookdict_unicode only need minimal changes (stat gathering for the above cost analysis tracking).

Once a dict has transitioned to paranoid mode, it isn't using PyObject_Hash anymore, and thus isn't using cached object values, performing a more expensive calculation, but I believe this calculation is essentially constant-time.

This preserves hash() and dict order for the cases where you're not under attack, and gracefully handles the attack without having to raise an exception: it doesn't introduce any new exception types.

It preserves ABI, assuming no-one else is reusing ma_smalltable.

It is suitable for backporting to 3.2, 2.7, and earlier (I'm investigating fixing this going all the way back to Python 2.2)

Under the old implementation, there were 4 types of PyDictObject, given these two booleans:
  * "small vs large" i.e ma_table == ma_smalltable vs ma_table != ma_smalltable
  * "all keys are str" vs arbitary keys i.e ma_lookdict == lookdict_unicode vs lookdict

Under this implementation, this doubles to 8 kinds, adding the boolean:
  * normal hash vs randomized hash (normal vs "paranoid").

This is expressed via the ma_lookdict callback, adding two new variants, lookdict_unicode_paranoid and lookdict_paranoid

Note that if a paranoid dict goes small again (ma_table == ma_smalltable), it stays paranoid.  This is for simplicity: it avoids having to rebuild all of the non-randomized me_hash values again (which could fail).

Naturally the patch adds selftests.  I had to add some diagnostic methods to support them; dict gains _stats() and _make_paranoid() methods, and sys gains a _getrandomizedhash() method.  These could be hidden more thoroughly if need be (see DICT_PROTECTION_TRACKING in dictobject.c).  Amongst other things, the selftests measure wallclock time taken for various dict operations (and so might introduce failures on a heavily-loaded machine, I guess).

Hopefully this approach is a viable way forward.

Caveats and TODO items:

TODO: I haven't yet tuned the safety threshold.  According to http://bugs.python.org/issue13703#msg151850:
> slot collisions are much more frequent than
> hash collisions, which only account for less than 0.01% of all
> collisions.
>
> It also shows that slot collisions in the low 1-10 range are
> most frequent, with very few instances of a dict lookup
> reaching 20 slot collisions (less than 0.0002% of all
> collisions).

This suggests that the threshold of 32 slot/hash collisions per lookup may already be high enough.

TODO: in a review of an earlier version of the complexity detection idea, Antoine Pitrou suggested that make the protection scale factor be a run-time configurable value, rather than a #define.  This isn't done yet.

TODO: run more extensive tests (e.g. Django and Twisted), monitoring the worst-case complexity that's encountered

TODO: not yet benchmarked and optimized.  I want to get feedback on the approach before I go in and hand-optimize things (e.g. by hand-inlining check_iter_count, and moving the calculations out of the loop etc).  I believe any performance issues ought to be fixable, in that the we can get the cost of this for the "we're not under attack" case to be negligible, and the "under attack" case should transition from O(N^2) to O(N), albeit it with a larger constant factor.

TODO: this doesn't cover sets, but assuming this approach works, the patch can be extended to cover it in an analogous way.

TODO: should it cover PyMemoryViewObject, buffer object, etc?

TODO: should it cover the hashing in Modules/expat/xmlparse.c?  FWIW I rip this code out when doing my downstream builds in RHEL and Fedora, and instead dynamically link against a system copy of expat

TODO: only tested on Linux so far (which is all I've got).  Fedora 15 x86_64 fwiw

 Doc/using/cmdline.rst     |    6 
 Include/bytesobject.h     |    2 
 Include/object.h          |    8 
 Include/pythonrun.h       |    2 
 Include/unicodeobject.h   |    2 
 Lib/os.py                 |   17 --
 Lib/test/regrtest.py      |    5 
 Lib/test/test_dict.py     |  298 +++++++++++++++++++++++++++++++++++++
 Lib/test/test_hash.py     |   53 ++++++
 Lib/test/test_os.py       |   35 +++-
 Makefile.pre.in           |    1 
 Modules/posixmodule.c     |  126 ++-------------
 Objects/bytesobject.c     |    7 
 Objects/dictobject.c      |  369 +++++++++++++++++++++++++++++++++++++++++++++-
 Objects/object.c          |   37 ++++
 Objects/unicodeobject.c   |   51 ++++++
 PCbuild/pythoncore.vcproj |    4 
 Python/pythonrun.c        |    3 
 Python/sysmodule.c        |   16 +
 b/Python/random.c         |  268 +++++++++++++++++++++++++++++++++
 20 files changed, 1173 insertions(+), 137 deletions(-)

History
Date	User	Action	Args
2012-01-25 11:06:04	dmalcolm	set	recipients: + dmalcolm, lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, grahamd, Arfrever, v+python, alex, zbysz, skrah, gz, neologix, Arach, Mark.Shannon, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, Jim.Jewett, PaulMcMillan, fx5
2012-01-25 11:06:01	dmalcolm	set	messageid: <1327489561.69.0.280670857261.issue13703@psf.upfronthosting.co.za>
2012-01-25 11:06:01	dmalcolm	link	issue13703 messages
2012-01-25 11:05:56	dmalcolm	create