Title: Logger race condition - loses lines if removeHandler called from another thread while logging
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.4, Python 3.5, Python 2.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benspiller, vinay.sajip
Priority: normal Keywords:

Created on 2018-11-07 17:43 by benspiller, last changed 2019-06-20 17:28 by vinay.sajip.

File name Uploaded Description Edit benspiller, 2018-11-07 17:43
Messages (6)
msg329429 - (view) Author: Ben Spiller (benspiller) * Date: 2018-11-07 17:43
I just came across a fairly serious thread-safety / race condition bug in the logging.Loggers class, which causes random log lines to be lost i.e. not get passed to some of the registered handlers, if (other, unrelated) handlers are being added/removed using add/removeHandler from another thread during logging. This potentially affects all log handler classes, though for timing reasons I've found it easiest to reproduce with the logging.FileHandler. 

See attached test program that reproduces this. 

I did some debugging and looks like although add/removeHandler are protected by _acquireLock(), they modify the self.handlers list in-place, and the callHandlers method iterates over self.handlers with no locking - so if you're unlucky you can end up with some of your handlers not being called. 

A trivial way to fix the bug is by editing callHandlers and copying the list before iterating over it:
- for hdlr in c.handlers:
+ for hdlr in list(c.handlers):

However since that could affect the performance of routine log statements a better fix is probably to change the implementation of add/removeHandler to avoid in-place modification of self.handlers so that (as a result of the GIL) it'll be safe to iterate over the list in callHandlers, e.g. change removeHandler like this:

- self.handlers.remove(hdlr)
+ newhandlers = list(self.handlers)
+ newhandlers.remove(hdlr)
+ self.handlers = hdlr

(and the equivalent in addHandler)
msg346058 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-19 15:35
The other alternative would be to lock around callHandlers(). With the change you propose to addHandler/removeHandler, there are no errors, but the output of your test program is (with the lock acquisition/release in place):

Thread finished after 468 iterations
Thread finished after 488 iterations
Thread finished after 476 iterations
Thread finished after 462 iterations

If the lock acquisition/release is removed from addHandler/removeHandler, then there are still no errors, and the output is:

Thread finished after 479 iterations
Thread finished after 453 iterations
Thread finished after 468 iterations
Thread finished after 469 iterations

If I leave addHandler/removeHandler as they were and add locking around callHandlers(), there are no errors and the output is:

Thread finished after 566 iterations
Thread finished after 608 iterations
Thread finished after 612 iterations
Thread finished after 605 iterations

This seems to suggest that locking around callHandlers() is better (more performant) than the copying of handler lists involved in your suggestion. Any comments on that? Also it will work in non-GIL environments like Jython/IronPython.

The callHandlers() locking will of course add a cost even for those situations where handlers aren't added and removed in multi-threading scenarios, but I haven't benchmarked it yet. It's generally not good practice to add/remove handlers willy-nilly in threads, though of course it's not forbidden (e.g. the configuration functionality allows a thread to listen for configuration changes at runtime and then reconfigure logging, which adds/removes handlers in the thread. However, loggers are disabled while the reconfiguration is in progress, and some loss of messages would be expected to be tolerable in such a case).
msg346119 - (view) Author: Ben Spiller (benspiller) * Date: 2019-06-20 14:03
I'd definitely suggest we go for a solution that doesn't hit performance of normal logging when you're not adding/removing things, being as that's the more common case. I guess that's the reason why callHandlers was originally implemented without grabbing the mutex, and we should probably keep it that way. Logging can be a performance-critical part of some applications and I feel more comfortable about the fix (and more confident it won't get vetoed :)) if we can avoid changing callHandlers(). 

You make a good point about ensuring the solution works for non-GIL python versions. I thought about it some more... correct me if I'm wrong but as far as I can see the second idea I suggested should do that, i.e.
- self.handlers.remove(hdlr)
+ newhandlers = list(self.handlers)
+ newhandlers.remove(hdlr)
+ self.handlers = hdlr

... which effectively changes the model so that the _value_ of the self.handlers list is immutable (only which list the self.handlers reference points to changes), so without relying on any GIL locking callHandlers will still see the old list or the new list but never see an inconsistent value, since such a list never exists. That solves the read-write race condition; we'd still want to keep the existing locking in add/removeHandler which prevents write-write race conditions. 

What do you think?
msg346135 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-20 16:49
> I'd definitely suggest we go for a solution that doesn't hit performance of normal logging

I agree, but correctness is important. I'm tending to the following:

1. Introduce a lockCallHandling module-level variable, defaulting to False to maximise logging performance (and follow current behaviour) and settable to True for situations such as your example, where adding/removing handlers from threads is wanted.

2. In Logger.handle, add an extra check for lockCallHandling to decide whether to lock/release around callHandlers().

BTW a simple benchmark of performance of locking around callHanders vs. not locking, using a FileHandler and default StreamHandler, showed an average difference of ~ 480 usec per iteration (mean time of 47.310 secs unlocked vs. 47.784 locked, for a million iterations of logger.debug() - a 1% increase).

> the second idea I suggested should do that

Yes, but the copying seems to make things slower, as suggested by the output of your script (in terms of the iteration counts in a fixed amount of time).
msg346139 - (view) Author: Ben Spiller (benspiller) * Date: 2019-06-20 17:07
Interesting conversation. :)

Yes I agree correctness is definitely top priority. :) I'd go further and say I'd prefer correctness to be always there automatically, rather than something the user must remember to enable by setting a flag such as lockCallHandling. (as an aside, adding a separate extra code path and option like that would require a bit more doc and testing changes than just fixing the bug by making the self.handlers list immutable, which is a small and simple change not needing extra doc). 

I'm also not convinced it's worth optimizing the performance of add/remove logger (which sounds like it's the goal of the callHandlers-locking approach you suggest, if I'm understanding correctly?) - since in a realistic application is always that's going to be vastly less frequent than invoking callhandlers. Especially if it reduces performance of the main logging, which is invoked much more often. Though admittedly the 1% regression you quoted isn't so bad (assuming that's true in CPython/IronPython/Jython/others). The test program I provided is a contrived way of quickly reproducing the race condition, but I certainly wouldn't use it for measuring or optimizing performance as it wasn't designed for that - the ratio of add/remove loggers to callhandlers calls is likely to be unrepresentative of a real application, and there's vastly more contention on calling add/remove loggers than you'd see in the wild. 

Do you see any downsides to the immutable self.handlers approach, other than performance of add/remove logger being a little lower?

Personally I think we're on safer ground if we permit add/remove logger be slightly slower (but at least correct! correctness trumps performance), but only if we avoid regressing the more important performance of logging itself. 

Does that seem reasonable?
msg346143 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2019-06-20 17:28
> I'd prefer correctness to be always there automatically, rather than something the user must remember to enable by setting a flag such as lockCallHandling

Yes, I agree, I was just looking at different approaches while mulling all this over. Correctness-by-default can of course always be implemented by setting lockCallHandling to True by default, which is what I have currently in my benchmark testing.

Re. your change:

- self.handlers.remove(hdlr)
+ newhandlers = list(self.handlers)
+ newhandlers.remove(hdlr)
+ self.handlers = hdlr

Did you mean self.handlers = newhandlers in that last line? That's what I assumed you meant, but you've posted that twice now, so I'm not sure what you're getting at.

I agree that having a slower add/removeHandler is acceptable - certainly better than slowing down Logger.handle(). But I'm not sure about

> so without relying on any GIL locking callHandlers will still see the old list or the new list

on platforms other than CPython, since we can't be sure where thread pre-emption might happen (i.e. it could happen inside a source code line boundary) and locking at least has the benefit of having defined semantics.

The other issue is where someone might have subclassed and overridden add/removeHandler methods, for whatever reason. I know it's unlikely, but one never knows.

I think we're broadly on the same page. I'm just not yet sure what the best approach is :-)
Date User Action Args
2019-06-20 17:28:20vinay.sajipsetmessages: + msg346143
2019-06-20 17:07:09benspillersetmessages: + msg346139
2019-06-20 16:49:42vinay.sajipsetmessages: + msg346135
2019-06-20 14:03:34benspillersetmessages: + msg346119
2019-06-19 15:35:08vinay.sajipsetmessages: + msg346058
2018-11-07 17:44:59ned.deilysetnosy: + vinay.sajip
2018-11-07 17:43:57benspillercreate