Issue 36533: logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

msg 339506 created
issue 36533 nosy, messages edited ok
clear this message

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/80714

classification

Title:	logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential)
Type:	behavior	Stage:	commit review
Components:	Library (Lib)	Versions:	Python 3.8, Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	gregory.p.smith	Nosy List:	cagney, ebizzinfotech, gregory.p.smith, hugh, kevans, koubaa, lukasz.langa, ned.deily, rhettinger
Priority:	release blocker	Keywords:	3.7regression, patch

Created on 2019-04-05 08:15 by gregory.p.smith, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 12704	merged	gregory.p.smith, 2019-04-06 08:00
PR 13170	merged	gregory.p.smith, 2019-05-07 19:18
PR 21986		koubaa, 2020-09-07 15:38
PR 22205		adelfino, 2020-09-15 20:22
PR 22651		kevans, 2020-10-11 20:42

Messages (19)
msg339472 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-05 08:15
I'm spawning a dicussion buried in the way too long thread of https://bugs.python.org/issue6721 over here into its own specific issue to treat as a 3.7 release blocker for a rollback or repair decision before 3.7.4. https://github.com/python/cpython/commit/3b699932e5ac3e76031bbb6d700fbea07492641d I believe that was released in 3.7.1 is leading to a behavior regression for an application (the Fedora installer's libreswan kvmrunner?). Full details can be found in the messages of the other issue starting with: https://bugs.python.org/issue6721#msg329474 TL;DR - logging.Handler instances each have their own threading.Rlock. libreswan implements at least one logging.Handler subclass. That subclass's custom emit() implementation directly calls potentially many other sub-handlers emit() methods. Some of those emit() methods (such as logging.StreamHandler) call flush() which acquires the handler's lock. So they've got a dependency between these two locks, the first's must be acquired before the second. But the logging module APIs have no concept of sub-handlers and lock ordering. I see many flaws with the libreswan code's design (I'm already ignoring the futility of threading + fork) but this still caused a behavior regression in the stable 3.7 release. (more comments coming as followups to avoid a wall of text with too many topics)
msg339474 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-05 08:23
That custom DebugHandler's emit() implementation that calls into one or more sub-handlers suggests that libreswan _might_ be able to fix it in the custom DebugHandler by implementing custom acquire() and release() methods... BUT that is a fundamentally flawed problem: It requires acquiring multiple locks in a single atomic operation OR guaranteeing that they will always be acquired in a single specific order. Given the logging module's fork-time code maintains no order, even implementing a custom acquire() and release() method in your custom DebugHandler would not be sufficient as sub-handler locks could be acquired first during fork. also, that would be assuming your sub-handlers are not also used directly elsewhere within all possible coexisting logger configurations. An implementable _gross workaround_ that libreswan or anything else in this boat could implement satisfying that constraint would be to force all potentially related handlers to share the same single RLock instance. :(
msg339475 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-05 08:41
Within the logging module we could go beyond using a WeakSet and maintain an ordering. But we'd need to allow a way for Handler subclasses to indicate what order matters; that'd require a new API (not suitable for 3.7) An ordering itself may be insufficient as code can implement arbitrary Handler code with cyclic graphs of handlers calling directly into other handlers such that no possible order could ever satisfy them all. This pulls us back into "stop having multiple locks" for any code that creates relationships between handlers. Which is good advice for Handler authors and something we should document if it isn't already. Side note: We should also document that Handler.emit() as an API MUST be called with the Handler's lock acquired via Handler.acquire() first. That is the undocumented contracted way logging.Logger.handle()'s code always calls into handler emit().... The libreswan code calling other handler's emit() methods without doing that smells wrong. But that has no relation to their current problem: Even if it did, it'd still deadlock, just sooner rather than within the other handler's emit()->flush()->acquire() call chain. Logging library users are better off never directly calling another handler. Put message onto queues with a dedicated thread processing those into the relevant handlers. Only block waiting for those queued items to have been processed _after_ releasing your own lock if you want to maintain the synchronicity of logging API calls returning only after the message has been handled.
msg339480 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-05 09:02
Now to back up: Why was this application using fork() in a threaded application at all? That is a fundamental flaw. Should we be doing work to support things that so far merely _happen_ to work on such broken designs? Another alternative for the DebugHandler implementation in whatever code implemented it is for it to de-register itself from the logging library's private WeakSet of handler locks to acquire at fork time from its own constructor. We don't have a public API for this so the workaround doing that on 3.7.1 - 3.7.3 would look like: def __init__(self, ...): super().__init__(...) if hasattr(logging, '_at_fork_acquire_release_weakset'): logging._at_fork_acquire_release_weakset.discard(self) This means it'd still have the bug the code already had on all prior to versions of Python before this logging library "acquire the locks before fork" fix went in: Namely if the forked child tries to log it could deadlock due to forking while the inherited logging handler lock held.
msg339481 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-05 09:02
A stdlib alternative to this whole mess would be to avoid acquiring the logging locks before fork() as we currently do and just blindly re-initialize all of them afterwards under the assumption that they "can't" be protecting anything in a newly forked child process. Handlers that need specific resource synchronization around fork would then be required to deal with their own os.register_at_fork() calls. (ex: to avoid multiple processes writing to the same file or fd at once)
msg339503 - (view)	Author: cagney (cagney)	Date: 2019-04-05 15:11
On Fri, 5 Apr 2019 at 04:15, Gregory P. Smith <report@bugs.python.org> wrote: > > > New submission from Gregory P. Smith <greg@krypto.org>: > > I'm spawning a dicussion buried in the way too long thread of https://bugs.python.org/issue6721 over here into its own specific issue to treat as a 3.7 release blocker for a rollback or repair decision before 3.7.4. > > https://github.com/python/cpython/commit/3b699932e5ac3e76031bbb6d700fbea07492641d > > I believe that was released in 3.7.1 is leading to a behavior regression for an application (the Fedora installer's libreswan kvmrunner?). Full details can be found in the messages of the other issue starting with: > https://bugs.python.org/issue6721#msg329474 Two separate applications have tripped up on this: - Fedora's installer aka anaconda I don't know any further details. - libreswan's test framework aka kvmrunner kvmrunner uses pexpect pexpect uses ptyprocess ptyprocess uses pty.fork() + os.exec*(); but, hey, who knew! I didn't until this week. this seems to be a long standing concern: https://github.com/pexpect/ptyprocess/issues/43 > TL;DR - logging.Handler instances each have their own threading.Rlock. > libreswan implements at least one logging.Handler subclass. That subclass's custom emit() implementation directly calls potentially many other sub-handlers emit() methods. Some of those emit() methods (such as logging.StreamHandler) call flush() which acquires the handler's lock. # Implement log-level inversion. # # Ref: https://docs.python.org/2/howto/logging.html#logging-flow # # By default, a parent (root) logger, regardless of its log-level, # will log all the records logged by a child. For instance, if a # child logger is logging at DEBUG-level, then the parent will log # (display on the console) those DEBUG-level records even when it has # been configured to log only INFO-level records. This is because the # default log-level ("Logger enabled for level of call?") check is # only applied once at record-creation. # # This code allows DEBUG-level logging to a file, while simultaneously # (the inversion) restricting console log records to just INFO-level # say. The logging.Handler's lock ensures that the two sub-streams are kept in ordered. It prevents threads interleaving their writes. > So they've got a dependency between these two locks, the first's must be acquired before the second. > > But the logging module APIs have no concept of sub-handlers and lock ordering. > > I see many flaws with the libreswan code's design (I'm already ignoring the futility of threading + fork) but this still caused a behavior regression in the stable 3.7 release. Always first understand the problem. > (more comments coming as followups to avoid a wall of text with too many topics) > > ---------- > assignee: gregory.p.smith > components: Library (Lib) > keywords: 3.7regression > messages: 339472 > nosy: cagney, gregory.p.smith, ned.deily, vstinner > priority: release blocker > severity: normal > status: open > title: logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) > type: behavior > versions: Python 3.7 > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue36533> > _______________________________________
msg339505 - (view)	Author: cagney (cagney)	Date: 2019-04-05 16:05
On Fri, 5 Apr 2019 at 05:02, Gregory P. Smith <report@bugs.python.org> wrote: > > > Gregory P. Smith <greg@krypto.org> added the comment: > > A stdlib alternative to this whole mess would be to avoid acquiring the logging locks before fork() as we currently do and just blindly re-initialize all of them afterwards under the assumption that they "can't" be protecting anything in a newly forked child process. Handlers that need specific resource synchronization around fork would then be required to deal with their own os.register_at_fork() calls. (ex: to avoid multiple processes writing to the same file or fd at once) FYI, below is a simpler, but related, test. _at_fork_acquire_release_weakset doesn't seem to be locked: Ignoring exception from logging atfork <Handler (NOTSET)> release method: cannot release un-acquired lock Exception ignored in: <function _after_at_fork_weak_calls at 0x7f7307550378> Traceback (most recent call last): File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line 269, in _after_at_fork_weak_calls _at_fork_weak_calls('release') File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line 254, in _at_fork_weak_calls for instance in _at_fork_acquire_release_weakset: File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 60, in __iter__ for itemref in self.data: RuntimeError: Set changed size during iteration Exception in thread Thread-1: Traceback (most recent call last): File "/home/python/v3.7.3/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/home/python/v3.7.3/lib/python3.7/threading.py", line 865, in run self._target(self._args, *self._kwargs) File "./btc.py", line 11, in lockie h = logging.Handler() File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line 824, in __init__ self.createLock() File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line 847, in createLock _register_at_fork_acquire_release(self) File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line 250, in _register_at_fork_acquire_release _at_fork_acquire_release_weakset.add(instance) File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 83, in add self._commit_removals() File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 56, in _commit_removals discard(l.pop()) IndexError: pop from empty list ---- import logging import os import sys import threading def lockie(): while True: # this adds handle to _at_fork_acquire_release_weakset #sys.stdout.write(".") #sys.stdout.flush() h = logging.Handler() threading.Thread(target=lockie).start() for n in range(0,100000): if n % 100 == 0: sys.stdout.write("%d" % n) sys.stdout.flush() pid = os.fork() if pid != 0: os.wait() else: os._exit(0)
msg339529 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-06 08:03
Here's a PR that goes the re-initialization route rather than attempting to acquire/release anything other than the single module _lock.
msg339650 - (view)	Author: cagney (cagney)	Date: 2019-04-08 15:32
I think the only pragmatic solution here is to add an optional parameter to logging.basicConfig() that specifies that the logger should use a single global lock; and then start documenting that thread locks and fork() don't work well together. And note that this solution is pragmatic, not correct (@dhr, when we discussed this off line, pointed out that since python's using threads then a "correct" solution would be to use some sort of inter-process lock). For instance, if I were to implement emit() as something contrived like: with lock: s = record.to-string() for i in 1 .. s'length: b[i] = s[i] for i in 1 .. b'length: stdout.write(b[i]).flush() b = [] then when fork() breaks 'lock' the guarantee that the code is atomic is also broken: - when the child enters the code 'b' is undefined - the guarantee that log records don't interleave is lost while a global lock would help mitigate the first case it really isn't a "correct" fix.
msg339688 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-09 00:44
The logging library has never guaranteed that it wouldn't interleave or duplicate buffered io output when fork() in involved. We should not start trying to claim that it does. Too complicated and fragile. It can't. Users who want that should implement their own single logging handler that is well behaved in whatever regards they need. there are a ton of options for them. A basicConfig setting to switch the mode of operation to a single lock instead of per-handler locks could only happen in 3.8 (new feature). I'll ponder if it is even feasible to add that after the re-init change goes in. Even with such a feature, library user authored Handler subclasses would always be free to violate that. We can't protect people from themselves.
msg339760 - (view)	Author: cagney (cagney)	Date: 2019-04-09 15:19
I pointed out two issues with breaking the locks. Your response addressed the less important issue: - the guarantee that log records don't interleave is lost Since the documentation seems to be silent the guarantee (or expectation) is implied - logging systems emit records atomically - if this isn't true across fork() then the exception should be documented. but it didn't address the more fundamental problem: - when the child enters the code 'b' is undefined i.e., breaking the lock will allow the child to access data in an undefined state. This will result in either core dumps or hangs (presumably the motivation for the original change was to prevent this?). For instance, if the fork() happens while a thread is holding the <stdio.h> FILE lock on a log file, then the child trying to access that FILE will hang.
msg339825 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-10 05:44
> """ Since the documentation seems to be silent the guarantee (or expectation) is implied - logging systems emit records atomically - if this isn't true across fork() then the exception should be documented. """ We can add a note to the docs. As a general rule: documentation being silent means there are zero guarantees. Behavior of the existing implementations over time is the only source of truth. > """ but it didn't address the more fundamental problem: - when the child enters the code 'b' is undefined i.e., breaking the lock will allow the child to access data in an undefined state. This will result in either core dumps or hangs (presumably the motivation for the original change was to prevent this?). """ nope. the contrived emit() pseudocode from msg339650 never defined b in the first place. that code, if massaged into python syntax would raise an UnboundLocalError no matter who executed it. a fork() from another thread would not change that. There is no concept of undefined/uninitialized state from the Python perspective. The motivation for the original fix was entirely deadlock avoidance in code that uses fork. It failed and introduced an alternate form of deadlock in code that had been lucky up until that point. Thus my new PR proposing to fix the regression by making that not fail yet still preventing locks from starting out held in the child process.
msg339863 - (view)	Author: cagney (cagney)	Date: 2019-04-10 13:45
> nope. the contrived emit() pseudocode from msg339650 never defined b in the first place. that code, if massaged into python syntax would raise an UnboundLocalError no matter who executed it. a fork() from another thread would not change that. There is no concept of undefined/uninitialized state from the Python perspective. Wow! It is pseudo code, right? (I figured using Ada like attributes was a bit of a giveaway). Nitpicking it achieves nothing (if you feel that your language needs 'b' pre-defined then, hey!, pre-define it; I'm pretty sure that won't fix the problem of operations such b[i] = 2 not being atomic yet alone thread-safe in Python). Lets instead focus on my point that breaking the locks won't fix the problem. For instance, above. Or "For instance, if the fork() happens while a thread is holding the <stdio.h> FILE lock on a log file, then the child trying to access that FILE will hang."
msg339864 - (view)	Author: cagney (cagney)	Date: 2019-04-10 13:55
BTW, non-visible change might be to use a global readers-writer lock where fork() is the writer. https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock
msg339882 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-04-10 19:08
The problem i am fixing is merely making Python's logging library not the source of this deadlock because it was not the source in the past. I am solving a regression in CPython behavior between 3.7.0 and 3.7.1 that led to a logging.Handler lock related deadlock in two identified problematic applications. Breaking the logging.Handler owned locks does solve that unless someone has concrete proof from these specific applications that it does otherwise. It would be helpful if someone willing to actually be constructive would test them with my PR's patch applied to their 3.7.3 interpreter as confirmation.
msg341543 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2019-05-06 15:47
This deadlock is a release blocker for 3.8.0. FTR, I don't consider it blocking alphas and betas.
msg341768 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-05-07 16:18
New changeset 64aa6d2000665efb1a2eccae176df9520bf5f5e6 by Gregory P. Smith in branch 'master': bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704) https://github.com/python/cpython/commit/64aa6d2000665efb1a2eccae176df9520bf5f5e6
msg341815 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-05-07 20:29
New changeset 3b4b28efbde63502709bede7c5f9403ec6f37428 by Gregory P. Smith in branch '3.7': [3.7] bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704) (GH-13170) https://github.com/python/cpython/commit/3b4b28efbde63502709bede7c5f9403ec6f37428
msg341816 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2019-05-07 20:32
The regression should be fixed. It'd be helpful if owners of applications that were running into this could test their applications with this specific change.

History
Date	User	Action	Args
2022-04-11 14:59:13	admin	set	github: 80714
2020-10-11 20:42:06	kevans	set	nosy: + kevans pull_requests: + pull_request21628
2020-10-03 13:38:21	adelfino	set	nosy: - adelfino
2020-10-03 07:08:03	scoder	set	nosy: - scoder
2020-10-03 07:07:34	scoder	set	pull_requests: - pull_request21520
2020-10-03 06:56:08	scoder	set	nosy: + scoder pull_requests: + pull_request21520
2020-09-15 20:22:21	adelfino	set	nosy: + adelfino pull_requests: + pull_request21316
2020-09-07 15:38:07	koubaa	set	nosy: + koubaa pull_requests: + pull_request21219
2020-09-07 07:38:14	kulikjak	set	nosy: - kulikjak
2020-09-07 07:38:10	kulikjak	set	pull_requests: - pull_request21198
2020-09-05 19:32:02	kulikjak	set	nosy: + kulikjak pull_requests: + pull_request21198
2020-08-29 17:05:16	gregory.p.smith	set	pull_requests: - pull_request21111
2020-08-29 16:34:00	rhettinger	set	nosy: + rhettinger pull_requests: + pull_request21111
2019-05-07 20:32:05	gregory.p.smith	set	status: open -> closed versions: + Python 3.8 messages: + msg341816 resolution: fixed stage: patch review -> commit review
2019-05-07 20:29:47	gregory.p.smith	set	messages: + msg341815
2019-05-07 19:18:40	gregory.p.smith	set	pull_requests: + pull_request13086
2019-05-07 18:52:36	gregory.p.smith	set	versions: - Python 3.8
2019-05-07 16:18:25	gregory.p.smith	set	messages: + msg341768
2019-05-06 15:47:09	lukasz.langa	set	messages: + msg341543
2019-04-10 19:08:47	gregory.p.smith	set	messages: + msg339882
2019-04-10 13:55:40	cagney	set	messages: + msg339864
2019-04-10 13:45:47	cagney	set	messages: + msg339863
2019-04-10 12:50:27	vstinner	set	nosy: - vstinner
2019-04-10 05:44:21	gregory.p.smith	set	messages: + msg339825
2019-04-09 15:19:01	cagney	set	messages: + msg339760
2019-04-09 00:44:39	gregory.p.smith	set	messages: + msg339688
2019-04-08 15:32:23	cagney	set	messages: + msg339650
2019-04-06 08:03:15	gregory.p.smith	set	nosy: + lukasz.langa messages: + msg339529 versions: + Python 3.8
2019-04-06 08:00:55	gregory.p.smith	set	keywords: + patch stage: patch review pull_requests: + pull_request12628
2019-04-05 17:03:11	hugh	set	nosy: + hugh
2019-04-05 16:21:38	ned.deily	set	messages: - msg339506
2019-04-05 16:15:16	ebizzinfotech	set	nosy: + ebizzinfotech messages: + msg339506
2019-04-05 16:05:36	cagney	set	messages: + msg339505
2019-04-05 15:11:07	cagney	set	messages: + msg339503
2019-04-05 09:02:41	gregory.p.smith	set	messages: + msg339481
2019-04-05 09:02:35	gregory.p.smith	set	messages: + msg339480
2019-04-05 08:41:40	gregory.p.smith	set	messages: + msg339475
2019-04-05 08:23:09	gregory.p.smith	set	messages: + msg339474
2019-04-05 08:15:51	gregory.p.smith	create