Issue 42097: Python 3.7.9 logging/threading/fork hang

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/86263

classification

Title:	Python 3.7.9 logging/threading/fork hang
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	Adq, gregory.p.smith, izbyshev, terry.reedy, vinay.sajip
Priority:	normal	Keywords:

Created on 2020-10-20 14:36 by Adq, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test.py	Adq, 2020-10-20 14:36	Test case which hangs.

Messages (9)
msg379132 - (view)	Author: Andrew de Quincey (Adq)	Date: 2020-10-20 14:36
We've been having some mysterious hangs in our threaded code. Debugging with GDB shows it ends up deadlocking during logging inside logging._acquireLock() I googled and found someone else with the same issue: https://stackoverflow.com/questions/62059975/python-multiprocessing-logging-hangs I'm able to replicate this on python 3.7.9 on two entirely separate machines (one ubuntu, one arch linux), though I had to bump up the number of iterations to 1000.
msg379133 - (view)	Author: Andrew de Quincey (Adq)	Date: 2020-10-20 14:40
Debugging a bit more shows our code is deadlocked at: File "/usr/lib/python3.7/logging/__init__.py", line 219, in _acquireLock _lock.acquire() File "/usr/lib/python3.7/logging/config.py", line 501, in configure logging._acquireLock() File "/usr/lib/python3.7/logging/config.py", line 800, in dictConfig dictConfigClass(config).configure() File "/home/backend/backend/lib/request.py", line 591, in init logging.config.dictConfig(getLoggingDict(loglevel=self.loglevel)) Looking at other threads, I found this: File "/usr/lib/python3.7/logging/__init__.py", line 219, in _acquireLock _lock.acquire() File "/usr/lib/python3.7/logging/__init__.py", line 776, in _removeHandlerRef acquire() Garbage-collecting File "/home/backend/backend/lib/common/log/logstash.py", line 13, in fix_log_message for k, v in logmessage.items(): File "/home/backend/backend/lib/common/log/logstash.py", line 14, in fix_log_message logmessage[k] = self.fix_log_message(v) File "/home/backend/backend/lib/common/log/logstash.py", line 14, in fix_log_message logmessage[k] = self.fix_log_message(v) File "/home/backend/backend/lib/common/log/logstash.py", line 14, in fix_log_message logmessage[k] = self.fix_log_message(v) File "/home/backend/backend/lib/common/log/logstash.py", line 66, in format return self.serialize(self.fix_log_message(message)) File "/home/backend/venv/lib/python3.7/site-packages/logstash/handler_tcp.py", line 25, in makePickle return self.formatter.format(record) + b'\n' File "/usr/lib/python3.7/logging/handlers.py", line 631, in emit s = self.makePickle(record) File "/usr/lib/python3.7/logging/__init__.py", line 894, in handle self.emit(record) File "/usr/lib/python3.7/logging/__init__.py", line 1586, in callHandlers hdlr.handle(record) File "/usr/lib/python3.7/logging/__init__.py", line 1524, in handle self.callHandlers(record) File "/usr/lib/python3.7/logging/__init__.py", line 1514, in _log self.handle(record) File "/usr/lib/python3.7/logging/__init__.py", line 1378, in info self._log(INFO, msg, args, **kwargs) It looks like its trying to free a lock that is already locked during garbage collection?
msg379135 - (view)	Author: Andrew de Quincey (Adq)	Date: 2020-10-20 14:51
My colleague reports it does happen on their Mac as well, but that it took about 50 tries for it to trigger.
msg379139 - (view)	Author: Andrew de Quincey (Adq)	Date: 2020-10-20 15:05
Before anyone suggests it might be the logstash code we're calling in the comment, it ain't. The bug is easily replicated using entirely python built in code. See test.py attached. If you're having trouble triggering it (seems to vary per machine), find: pl.map(do_work, range(10000)) And add a few zeros into the range()
msg379186 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2020-10-21 03:25
Have you found this in 3.8+? Since last June, 3.7 only gets security fixes, and this might have been fixed in later versions.
msg379275 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2020-10-22 03:59
If you use os.fork() or any of the multiprocessing start methods that call os.fork() with a process involving threads, this is expected behavior. os.fork() cannot be used in processes that have threads without potential for deadlock. Specifically, make sure you explicitly call: multiprocessing.set_start_method('spawn') before using multiprocessing in any application where threads exist. It is possible to use 'forkserver' as well, but only if you ensure the multiprocessing forkserver process is started before your application has launched any threads.
msg379660 - (view)	Author: Andrew de Quincey (Adq)	Date: 2020-10-26 11:23
Hi, sorry, I was ill on Friday, apologies for the delayed reply. We're not using os.fork() directly, and I'm afraid the multiprocessing thing I mentioned was a bit of a red herring. We're using a capped pool of threads to spawn subprocesses (using subprocess.) and wait for them to complete.. so we'll do something like spawn max 6 threads, each of which will kick off a subprocess and stuff them in a list. Then in the main loop, we wait for one of the threads to complete and start a new one when it has done so. I don't _believe_ this should trigger the os.fork()/logging bug? This all worked fine in 2. and in 3.5. Its only in 3.7 that we have started to have this hanging problem. Unfortunately the (production) environment that has the problem is currently running 3.7. We do have plans to upgrade to 3.8, but I can't just switch it.
msg379668 - (view)	Author: Alexey Izbyshev (izbyshev) *	Date: 2020-10-26 15:49
(Restored test.py attachment) The issue happens due to an incorrect usage of `multiprocessing.Pool`. ``` # Set up multiprocessing pool, initialising logging in each subprocess with multiprocessing.Pool(initializer=process_setup, initargs=(get_queue(),)) as pl: # 100 seems to work fine, 500 fails most of the time. # If you're having trouble reproducing the error, try bumping this number up to 1000 pl.map(do_work, range(10000)) if _globalListener is not None: # Stop the listener and join the thread it runs on. # If we don't do this, we may lose log messages when we exit. _globalListener.stop() ``` Leaving `with` statement causes `pl.terminate()` [1, 2] Since multiprocessing simply sends SIGTERM to all workers, a worker might be killed while it holds the cross-process lock guarding `_globalQueue`. In this case, `_globalListener.stop()` blocks forever trying to acquire that lock (to add a sentinel to `_globalQueue` to make a background thread stop monitoring it). Consider using `Pool.close()` and `Pool.join()` to properly wait for task completion. [1] https://docs.python.org/3.9/library/multiprocessing.html#multiprocessing.pool.Pool.terminate [2] https://docs.python.org/3.9/library/multiprocessing.html#programming-guidelines
msg379669 - (view)	Author: Alexey Izbyshev (izbyshev) *	Date: 2020-10-26 15:58
By the way, I don't see a direct relation between `test.py` (which doesn't use `subprocess` directly) and your comment describing `subprocess` usage with threads. So if you think that the bug in `test.py` is unrelated to the problem you face, feel free to reopen.

History
Date	User	Action	Args
2022-04-11 14:59:37	admin	set	github: 86263
2020-10-26 15:58:44	izbyshev	set	messages: + msg379669
2020-10-26 15:49:38	izbyshev	set	status: open -> closed nosy: + izbyshev messages: + msg379668 resolution: not a bug stage: resolved
2020-10-26 15:47:16	izbyshev	set	files: + test.py
2020-10-26 11:23:10	Adq	set	messages: + msg379660
2020-10-22 03:59:17	gregory.p.smith	set	messages: + msg379275
2020-10-21 17:16:18	pitrou	set	nosy: + gregory.p.smith
2020-10-21 03:25:55	terry.reedy	set	nosy: + terry.reedy messages: + msg379186
2020-10-20 16:08:53	xtreak	set	nosy: + vinay.sajip
2020-10-20 15:31:54	Adq	set	files: - test.py
2020-10-20 15:05:18	Adq	set	messages: + msg379139
2020-10-20 14:51:41	Adq	set	messages: + msg379135
2020-10-20 14:40:35	Adq	set	messages: + msg379133
2020-10-20 14:36:31	Adq	create