Author Jim.Jewett
Recipients Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, lemburg, mark.dickinson, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
Date 2012-01-25.18:14:06
SpamBayes Score 0.0
Marked as misclassified No
Message-id <>
In-reply-to <>
On Wed, Jan 25, 2012 at 6:06 AM, Dave Malcolm <>
added the comment:

>  hybrid-approach-dmalcolm-2012-01-25-001.patch

> As per haypo's random-8.patch, a randomization seed is read at startup.

Why not wait until it is needed?  I suspect a lot of scripts will
never need it for any dict, so why add the overhead to startup?

> Once a dict has transitioned to paranoid mode, it isn't using
> PyObject_Hash anymore, and thus isn't using cached object values

The alternative hashes could be stored in an id-keyed dict

 performing a more expensive calculation, but I believe this
calculation is essentially constant-time.
> This preserves hash() and dict order for the cases where you're not under attack, and gracefully handles the attack without having to raise an exception: it doesn't introduce any new exception types.
> It preserves ABI, assuming no-one else is reusing ma_smalltable.
> It is suitable for backporting to 3.2, 2.7, and earlier (I'm investigating fixing this going all the way back to Python 2.2)
> Under the old implementation, there were 4 types of PyDictObject, given these two booleans:
>  * "small vs large" i.e ma_table == ma_smalltable vs ma_table != ma_smalltable
>  * "all keys are str" vs arbitary keys i.e ma_lookdict == lookdict_unicode vs lookdict
> Under this implementation, this doubles to 8 kinds, adding the boolean:
>  * normal hash vs randomized hash (normal vs "paranoid").
> This is expressed via the ma_lookdict callback, adding two new variants, lookdict_unicode_paranoid and lookdict_paranoid
> Note that if a paranoid dict goes small again (ma_table == ma_smalltable), it stays paranoid.  This is for simplicity: it avoids having to rebuild all of the non-randomized me_hash values again (which could fail).
> Naturally the patch adds selftests.  I had to add some diagnostic methods to support them; dict gains _stats() and _make_paranoid() methods, and sys gains a _getrandomizedhash() method.  These could be hidden more thoroughly if need be (see DICT_PROTECTION_TRACKING in dictobject.c).  Amongst other things, the selftests measure wallclock time taken for various dict operations (and so might introduce failures on a heavily-loaded machine, I guess).
> Hopefully this approach is a viable way forward.
> Caveats and TODO items:
> TODO: I haven't yet tuned the safety threshold.  According to
>> slot collisions are much more frequent than
>> hash collisions, which only account for less than 0.01% of all
>> collisions.
>> It also shows that slot collisions in the low 1-10 range are
>> most frequent, with very few instances of a dict lookup
>> reaching 20 slot collisions (less than 0.0002% of all
>> collisions).
> This suggests that the threshold of 32 slot/hash collisions per lookup may already be high enough.
> TODO: in a review of an earlier version of the complexity detection idea, Antoine Pitrou suggested that make the protection scale factor be a run-time configurable value, rather than a #define.  This isn't done yet.
> TODO: run more extensive tests (e.g. Django and Twisted), monitoring the worst-case complexity that's encountered
> TODO: not yet benchmarked and optimized.  I want to get feedback on the approach before I go in and hand-optimize things (e.g. by hand-inlining check_iter_count, and moving the calculations out of the loop etc).  I believe any performance issues ought to be fixable, in that the we can get the cost of this for the "we're not under attack" case to be negligible, and the "under attack" case should transition from O(N^2) to O(N), albeit it with a larger constant factor.
> TODO: this doesn't cover sets, but assuming this approach works, the patch can be extended to cover it in an analogous way.
> TODO: should it cover PyMemoryViewObject, buffer object, etc?
> TODO: should it cover the hashing in Modules/expat/xmlparse.c?  FWIW I rip this code out when doing my downstream builds in RHEL and Fedora, and instead dynamically link against a system copy of expat
> TODO: only tested on Linux so far (which is all I've got).  Fedora 15 x86_64 fwiw
>  Doc/using/cmdline.rst     |    6
>  Include/bytesobject.h     |    2
>  Include/object.h          |    8
>  Include/pythonrun.h       |    2
>  Include/unicodeobject.h   |    2
>  Lib/                 |   17 --
>  Lib/test/      |    5
>  Lib/test/     |  298 +++++++++++++++++++++++++++++++++++++
>  Lib/test/     |   53 ++++++
>  Lib/test/       |   35 +++-
>           |    1
>  Modules/posixmodule.c     |  126 ++-------------
>  Objects/bytesobject.c     |    7
>  Objects/dictobject.c      |  369 +++++++++++++++++++++++++++++++++++++++++++++-
>  Objects/object.c          |   37 ++++
>  Objects/unicodeobject.c   |   51 ++++++
>  PCbuild/pythoncore.vcproj |    4
>  Python/pythonrun.c        |    3
>  Python/sysmodule.c        |   16 +
>  b/Python/random.c         |  268 +++++++++++++++++++++++++++++++++
>  20 files changed, 1173 insertions(+), 137 deletions(-)
> ----------
> Added file:
> _______________________________________
> Python tracker <>
> <>
> _______________________________________
Date User Action Args
2012-01-25 18:14:07Jim.Jewettsetrecipients: + Jim.Jewett, lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, grahamd, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, PaulMcMillan, fx5
2012-01-25 18:14:07Jim.Jewettlinkissue13703 messages
2012-01-25 18:14:06Jim.Jewettcreate