classification
Title: dict RuntimeError workaround
Type: enhancement Stage: resolved
Components: Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Jim.Jewett, asvetlov, gregory.p.smith, gvanrossum, haypo, ncoghlan, pitrou, python-dev, r.david.murray, rhettinger, skrah
Priority: normal Keywords: patch

Created on 2012-03-26 18:16 by Jim.Jewett, last changed 2012-05-13 20:49 by gvanrossum. This issue is now closed.

Files
File name Uploaded Description Edit
hammer_dict.py ncoghlan, 2012-03-31 17:00 Attempt to trigger the 3.3 dict RuntimeError just with threaded code
hammer_dict_eq.py skrah, 2012-03-31 21:34
hammer_dict_switchinterval.py skrah, 2012-04-01 12:59
dictobject.c.patch gvanrossum, 2012-04-01 23:51 review
Messages (31)
msg156843 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-03-26 18:16
Per the 3.3 WhatsNew:
"""issue 14205: A dict lookup now raises a RuntimeError if the dict is modified during the lookup. If you implement your own comparison function for objects used as dict keys and the dict is shared by multiple threads, access to the dict should be protected by a lock."""

This should be easier to do.  My suggestion would be that the change to (general) lookdict (or possibly an additional transition, if you want a less-slow path for non-string non-container builtins) create the lock, and acquire/release that lock.  At a minimum, there should be a dict subclass that does this.

[Note that this is arguably a regression, since previous python versions would just retry, which would be enough to protect innocent but unlucky code.]
msg156846 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-26 18:35
I must admit to being concerned by the possible impact of this change as well.
msg157061 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-03-29 16:12
> I must admit to being concerned by the possible impact of this change as well.

So am I.
msg157080 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-03-29 18:39
On Thu, Mar 29, 2012 at 9:12 AM, Antoine Pitrou <report@bugs.python.org> wrote:
>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
>> I must admit to being concerned by the possible impact of this change as well.
>
> So am I.

I think it's time to bring this up in python-dev .
msg157207 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-03-31 17:00
Attached script is a first cut at a multi-threading stress test for the new dict behaviour. It should be significantly worse than anything a real world app is likely to be doing to a dictionary that is shared between threads without any form of synchronisation.

It's late here though, so it's very possible it's only passing because I missed something in the implementation of the stress test.
msg157212 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-31 17:20
Thanks, Nick, this is much better than speculation.
msg157229 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-03-31 21:34
By adding a slow __eq__() method to Nick's script I can trigger the
RuntimeError reliably.
msg157254 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-01 04:00
But the sleep(0.1) forces a thread switch so I consider that still cheating -- nobody in their right mind would consider calling sleep() inside __hash__.

You can probably get it to fail without this by just doing a bunch of random computation in the __hash__ (and using a low sys.setswitchinterval() value).  Or consider creating a key that consists of e.g. a tuple of 100 or 1000 Key instances -- hashing that will call the __hash__ on each of those, wasting a lot of cycles.
msg157270 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-01 10:10
> But the sleep(0.1) forces a thread switch so I consider that still
> cheating -- nobody in their right mind would consider calling sleep()
> inside __hash__.

Well, cheating is fair game when trying to test borderline cases, isn't
it?
msg157283 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-04-01 12:59
OK, here's a version with a low switch interval. Of course it's also
contrived, but it works.

Generally I'd appreciate the RuntimeError, since it's a hint that 
something needs to be rewritten in an application.


It might be a problem if this started to occur sporadically in a
production app under a high system load. But I haven't managed to
trigger the exception just by increasing the system load. I always
had to use the same low switch interval.
msg157285 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-01 13:25
> OK, here's a version with a low switch interval. Of course it's also
> contrived, but it works.

The drawback of using setswitchinterval() is that it makes the test less
reusable by other implementations (or perhaps it will succeed without
actually checking the desired behaviour).
msg157313 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-01 16:27
Antoine: I don't think the point of this code is to come up with a unit (or other) test for the behavior, but to try to determine empirically whether or not this error is likely to be an issue in naive production code (whether it is existing 3.x code or stuff ported from Python2). Thus the mention of "cheating" (doing things production code would not be doing). 

The answer so far appears to be "no", which is good.

Which in the context of this particular issue then raises the question of whether there is in fact any additional support beyond the normal threading lock facilities that we want to provide in the stdlib.

And the answer to that is thus probably no as well, since code likely to run into the error is also likely to need locking around the dict in question *anyway*.  

Which is what Guido's intuition was to begin with.
msg157315 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-01 16:39
> Antoine: I don't think the point of this code is to come up with a
> unit (or other) test for the behavior, but to try to determine
> empirically whether or not this error is likely to be an issue in
> naive production code (whether it is existing 3.x code or stuff ported
> from Python2). Thus the mention of "cheating" (doing things production
> code would not be doing). 
> 
> The answer so far appears to be "no", which is good.

I find this a bit lacking. Production code is used in all kinds of
settings that we didn't simulate here. Besides, a very sporadic bug is
no better than an easily reproduced one. The tracker already has its
share of people pointing at weird sporadic errors in their log files.

> And the answer to that is thus probably no as well, since code likely
> to run into the error is also likely to need locking around the dict
> in question *anyway*.

Depends on the application really.
msg157330 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-04-01 20:58
> Well, cheating is fair game when trying to test 
> borderline cases, isn't it?

It is fair game (and necessary) when it comes to
exposing bugs that are hard to reproduce.

I'm wary of the original RuntimeError patch because
* it modifies code that has been stable for over a decade
* in order to "fix" a crasher that no one cares about
* while possibly breaking code that currently works
* and breaking it in a way that is hard to reproduce or diagnose.
msg157333 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-04-01 23:01
A thought prompted by Raymond's comment: did we ever try just protecting
the retry as a recursive call? If we can stop the stack blowing up, it
seems to me we'd get the best of both worlds (that is, crashes become
RuntimeError, but naive multi-threaded code is unaffected).
msg157335 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-01 23:44
On Sun, Apr 1, 2012 at 1:58 PM, Raymond Hettinger
<report@bugs.python.org> wrote:
[...]
> I'm wary of the original RuntimeError patch because
[...]

I had retorts to most of what you wrote, but decided not to post them.
Instead, I really want to have a serious discussion about this issue,
without falling back on "if it ain't broke don't fix it". There was
definitely *something* wrong with the original code: there was a known
crasher and it had XXX comments in it, and some folks cared enough to
attempt to fix it.

In the end I see this more as a matter of clarifying the semantics and
limitations of dict operations: *should* the language guarantee that
the built-in dict implementation supports correct lookup even if the
in the middle of the lookup the hash table is resized by another
thread? Is this a reasonable guarantee to impose on all Python
implementations? Because I don't see how this can be guaranteed
without a lock; but I haven't tried very hard, so this is not a
rhetorical question.

- - -

On Sun, Apr 1, 2012 at 4:01 PM, Nick Coghlan <report@bugs.python.org> wrote:
> A thought prompted by Raymond's comment: did we ever try just protecting
> the retry as a recursive call? If we can stop the stack blowing up, it
> seems to me we'd get the best of both worlds (that is, crashes become
> RuntimeError, but naive multi-threaded code is unaffected).

Hm... We considered a variety of fixes in
http://bugs.python.org/issue14205 but I don't think we tried that.
(The nastiness is that it's strictly a C-level recursion -- lookdict()
calls lookdict().)

Another thing we didn't try was manually eliminating the tail
recursion in lookdict() using a goto (and perhaps some variable
assignments). I don't see the potential for an infinite loop as a
problem, since one could be created just as easily by putting "while
True: pass" inside a user-defined __hash__().

PS. I'm not surprised your originall hammer_dict.py didn't provoke the
problem -- it only overrides __hash__(), but lookdict() never calls
that. It only calls the compare function; the hash is computed once
and passed into it.

Finally, the guard in lookdict() looks fishy to me:

  if (ep0 == mp->ma_table && ep->me_key == startkey) {

Can't this evaluate to true when the dict has shrunk dramatically in
size? I think an extra test for "mask == mp->ma_mask" ought to be
added.
msg157337 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-01 23:51
Here's a hack that uses goto instead of recursion to restore the original behavior, less the stack overflow.  With this, hammer_dict_switchinterval.py loops forever (which is I think what it's supposed to do if RuntimeError is never raised).
msg157340 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-02 01:02
Why delete that?

On Sunday, April 1, 2012, Raymond Hettinger wrote:

>
> Changes by Raymond Hettinger <raymond.hettinger@gmail.com <javascript:;>>:
>
>
> ----------
> Removed message: http://bugs.python.org/msg157338
>
> _______________________________________
> Python tracker <report@bugs.python.org <javascript:;>>
> <http://bugs.python.org/issue14417>
> _______________________________________
>
msg157562 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-04-05 11:25
A counter can be added to dictobject.c.patch to avoid an infinite loop.
msg157604 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-05 16:50
> A counter can be added to dictobject.c.patch to avoid an infinite loop.

But why bother? There was no counter originally -- it would continue
until it ran out of stack. Since this can only be triggered if there
is Python code in the __eq__, that means that it is still
interruptable with ^C.

Does this mean I should just check it in? But I asked, and never got
an answer, whether the original stress test had been converted into a
unittest. I'd like that to happen before I check this in. Also there
are probably docs I've missed. Somebody please help!
msg157605 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-05 16:59
> Does this mean I should just check it in? But I asked, and never got
> an answer, whether the original stress test had been converted into a
> unittest. I'd like that to happen before I check this in. Also there
> are probably docs I've missed. Somebody please help!

I think your approach is fine. As far as I can tell, the original
crasher hasn't been converted into an unit test, since Victor's test in
934aaf2191d0 doesn't seem prone to triggering an infinite loop.

As far as doc changes go, you should probably remove the what's new
entry from a428f85de29c, and the doc addition from 0255bafbccf2.

Besides, does the test suite pass with your patch? It looks like
Victor's test, which checks that RuntimeError is raised, should now
fail.
msg157616 - (view) Author: Jim Jewett (Jim.Jewett) Date: 2012-04-05 20:17
>> A counter can be added to dictobject.c.patch to avoid
>> an infinite loop.

> But why bother? 

Without a termination condition, how is this different from just reverting the original change?  Is it just the slightly improvement in efficiency?
msg157617 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-05 20:20
> Without a termination condition, how is this different from just
> reverting the original change?

The difference is that it's a (presumably interruptible) infinite loop,
not a stack overflow. There's no crash and therefore no security issue.
msg157629 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-04-05 22:39
> The difference is that it's a (presumably interruptible) infinite loop,
> not a stack overflow. There's no crash and therefore no security issue.

Ok, it looks fair. The patch looks good, but as written by Antoine: tests may need to be fixed.

Is anyone motivated to "convert" removed crashers to unit tests? Not me. Crashers were only written to crash Python. New tests should be written instead.
msg159523 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-28 14:20
Still no progress on this bug. Should I just check in my simple patch? But there's much more to do -- docs, and unittests. Volunteers? It's not hard, just work.
msg159524 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-04-28 14:32
> Still no progress on this bug. Should I just check in my simple patch?
> But there's much more to do -- docs, and unittests. Volunteers? It's
> not hard, just work.

Well, in general the person writing the patch should also write the
tests ;-)
I have this bug in my bookmarks somewhere, so I might come to it and
write a patch if nobody does it before, though.
msg159609 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-29 14:30
I could check it in, but I probably would mess up something (which branches are affected?).  Let me know if you want me to.

The priorities after that would be:

1) update docs (the warning about RuntimeError needs to be moderated)
2) convert stress tests to unittests
msg159612 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-04-29 14:33
Actually my patch doesn't even apply cleanly.  I suspect the dict refactoring for shared keys interfered.  Someone please help!
msg160544 - (view) Author: Roundup Robot (python-dev) Date: 2012-05-13 18:50
New changeset 93748e2d64e3 by Antoine Pitrou in branch 'default':
Issue #14417: Mutating a dict during lookup now restarts the lookup instead of raising a RuntimeError (undoes issue #14205).
http://hg.python.org/cpython/rev/93748e2d64e3
msg160546 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-05-13 18:54
I have committed an updated patch, with a fix to Victor's test. I don't think anything else remains to be done.
msg160575 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-05-13 20:49
Thanks Antoine!
History
Date User Action Args
2012-05-13 20:49:02gvanrossumsetmessages: + msg160575
2012-05-13 18:54:52pitrousetstatus: open -> closed
resolution: fixed
messages: + msg160546

stage: resolved
2012-05-13 18:50:18python-devsetnosy: + python-dev
messages: + msg160544
2012-04-29 14:33:05gvanrossumsetmessages: + msg159612
2012-04-29 14:30:15gvanrossumsetmessages: + msg159609
2012-04-28 14:32:15pitrousetmessages: + msg159524
2012-04-28 14:20:36gvanrossumsetmessages: + msg159523
2012-04-05 22:39:08hayposetmessages: + msg157629
2012-04-05 20:20:23pitrousetmessages: + msg157617
2012-04-05 20:17:46Jim.Jewettsetmessages: + msg157616
2012-04-05 16:59:25pitrousetmessages: + msg157605
2012-04-05 16:50:06gvanrossumsetmessages: + msg157604
2012-04-05 11:25:02hayposetmessages: + msg157562
2012-04-02 01:02:37gvanrossumsetmessages: + msg157340
2012-04-01 23:58:29rhettingersetmessages: - msg157338
2012-04-01 23:56:54rhettingersetmessages: + msg157338
2012-04-01 23:51:27gvanrossumsetfiles: + dictobject.c.patch
keywords: + patch
messages: + msg157337
2012-04-01 23:44:30gvanrossumsetmessages: + msg157335
2012-04-01 23:01:12ncoghlansetmessages: + msg157333
2012-04-01 20:58:10rhettingersetnosy: + rhettinger
messages: + msg157330
2012-04-01 16:39:06pitrousetmessages: + msg157315
2012-04-01 16:27:04r.david.murraysetmessages: + msg157313
2012-04-01 13:25:40pitrousetmessages: + msg157285
2012-04-01 12:59:57skrahsetfiles: + hammer_dict_switchinterval.py

messages: + msg157283
2012-04-01 10:10:08pitrousetmessages: + msg157270
2012-04-01 04:00:01gvanrossumsetmessages: + msg157254
2012-03-31 21:34:11skrahsetfiles: + hammer_dict_eq.py
nosy: + skrah
messages: + msg157229

2012-03-31 20:40:27Arfreversetnosy: + Arfrever
2012-03-31 17:20:57r.david.murraysetmessages: + msg157212
2012-03-31 17:00:06ncoghlansetfiles: + hammer_dict.py
nosy: + ncoghlan
messages: + msg157207

2012-03-30 22:58:08gregory.p.smithsetnosy: + gregory.p.smith
2012-03-29 20:06:16asvetlovsetnosy: + asvetlov
2012-03-29 18:39:30gvanrossumsetmessages: + msg157080
2012-03-29 16:12:11pitrousetnosy: + pitrou
messages: + msg157061
2012-03-27 00:56:25hayposetnosy: + haypo
2012-03-26 19:52:06hayposetnosy: + gvanrossum
2012-03-26 18:35:48r.david.murraysetnosy: + r.david.murray
messages: + msg156846
2012-03-26 18:16:51Jim.Jewettcreate