classification
Title: Python 3.7.9 logging/threading/fork hang
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Adq, gregory.p.smith, izbyshev, terry.reedy, vinay.sajip
Priority: normal Keywords:

Created on 2020-10-20 14:36 by Adq, last changed 2020-10-26 15:58 by izbyshev. This issue is now closed.

Files
File name Uploaded Description Edit
test.py Adq, 2020-10-20 14:36 Test case which hangs.
Messages (9)
msg379132 - (view) Author: Andrew de Quincey (Adq) Date: 2020-10-20 14:36
We've been having some mysterious hangs in our threaded code. Debugging with GDB shows it ends up deadlocking during logging inside logging._acquireLock()

I googled and found someone else with the same issue: https://stackoverflow.com/questions/62059975/python-multiprocessing-logging-hangs

I'm able to replicate this on python 3.7.9 on two entirely separate machines (one ubuntu, one arch linux), though I had to bump up the number of iterations to 1000.
msg379133 - (view) Author: Andrew de Quincey (Adq) Date: 2020-10-20 14:40
Debugging a bit more shows our code is deadlocked at:
  File "/usr/lib/python3.7/logging/__init__.py", line 219, in _acquireLock
    _lock.acquire()
  File "/usr/lib/python3.7/logging/config.py", line 501, in configure
    logging._acquireLock()
  File "/usr/lib/python3.7/logging/config.py", line 800, in dictConfig
    dictConfigClass(config).configure()
  File "/home/backend/backend/lib/request.py", line 591, in init
    logging.config.dictConfig(getLoggingDict(loglevel=self.loglevel))



Looking at other threads, I found this:
  File "/usr/lib/python3.7/logging/__init__.py", line 219, in _acquireLock
    _lock.acquire()
  File "/usr/lib/python3.7/logging/__init__.py", line 776, in _removeHandlerRef
    acquire()
  Garbage-collecting
  File "/home/backend/backend/lib/common/log/logstash.py", line 13, in fix_log_message
    for k, v in logmessage.items():
  File "/home/backend/backend/lib/common/log/logstash.py", line 14, in fix_log_message
    logmessage[k] = self.fix_log_message(v)
  File "/home/backend/backend/lib/common/log/logstash.py", line 14, in fix_log_message
    logmessage[k] = self.fix_log_message(v)
  File "/home/backend/backend/lib/common/log/logstash.py", line 14, in fix_log_message
    logmessage[k] = self.fix_log_message(v)
  File "/home/backend/backend/lib/common/log/logstash.py", line 66, in format
    return self.serialize(self.fix_log_message(message))
  File "/home/backend/venv/lib/python3.7/site-packages/logstash/handler_tcp.py", line 25, in makePickle
    return self.formatter.format(record) + b'\n'
  File "/usr/lib/python3.7/logging/handlers.py", line 631, in emit
    s = self.makePickle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1586, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1524, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1514, in _log
    self.handle(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 1378, in info
    self._log(INFO, msg, args, **kwargs)



It looks like its trying to free a lock that is already locked during garbage collection?
msg379135 - (view) Author: Andrew de Quincey (Adq) Date: 2020-10-20 14:51
My colleague reports it does happen on their Mac as well, but that it took about 50 tries for it to trigger.
msg379139 - (view) Author: Andrew de Quincey (Adq) Date: 2020-10-20 15:05
Before anyone suggests it might be the logstash code we're calling in the comment, it ain't. 

The bug is easily replicated using entirely python built in code. See test.py attached.

If you're having trouble triggering it (seems to vary per machine), find:
        pl.map(do_work, range(10000))

And add a few zeros into the range()
msg379186 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-10-21 03:25
Have you found this in 3.8+?  Since last June, 3.7 only gets security fixes, and this might have been fixed in later versions.
msg379275 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-22 03:59
If you use os.fork() or any of the multiprocessing start methods that call os.fork() with a process involving threads, this is expected behavior.  os.fork() cannot be used in processes that have threads without potential for deadlock.

Specifically, make sure you explicitly call:

 multiprocessing.set_start_method('spawn')

before using multiprocessing in any application where threads exist.

It is possible to use 'forkserver' as well, but only if you ensure the multiprocessing forkserver process is started before your application has launched any threads.
msg379660 - (view) Author: Andrew de Quincey (Adq) Date: 2020-10-26 11:23
Hi, sorry, I was ill on Friday, apologies for the delayed reply.

We're not using os.fork() directly, and I'm afraid the multiprocessing thing I mentioned was a bit of a red herring.

We're using a capped pool of threads to spawn subprocesses (using subprocess.*) and wait for them to complete.. so we'll do something like spawn max 6 threads, each of which will kick off a subprocess and stuff them in a list. Then in the main loop, we wait for one of the threads to complete and start a new one when it has done so. I don't _believe_ this should trigger the os.fork()/logging bug?

This all worked fine in 2.* and in 3.5. Its only in 3.7 that we have started to have this hanging problem.

Unfortunately the (production) environment that has the problem is currently running 3.7. We do have plans to upgrade to 3.8, but I can't just switch it.
msg379668 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-10-26 15:49
(Restored test.py attachment)

The issue happens due to an incorrect usage of `multiprocessing.Pool`.

```
    # Set up multiprocessing pool, initialising logging in each subprocess
    with multiprocessing.Pool(initializer=process_setup, initargs=(get_queue(),)) as pl:
        # 100 seems to work fine, 500 fails most of the time.
        # If you're having trouble reproducing the error, try bumping this number up to 1000
        pl.map(do_work, range(10000))

    if _globalListener is not None:
        # Stop the listener and join the thread it runs on.
        # If we don't do this, we may lose log messages when we exit.
        _globalListener.stop()
```

Leaving `with` statement causes `pl.terminate()` [1, 2]
Since multiprocessing simply sends SIGTERM to all workers, a worker might be killed while it holds the cross-process lock guarding `_globalQueue`. In this case, `_globalListener.stop()` blocks forever trying to acquire that lock (to add a sentinel to `_globalQueue` to make a background thread stop monitoring it).

Consider using `Pool.close()` and `Pool.join()` to properly wait for task completion.

[1] https://docs.python.org/3.9/library/multiprocessing.html#multiprocessing.pool.Pool.terminate
[2] https://docs.python.org/3.9/library/multiprocessing.html#programming-guidelines
msg379669 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-10-26 15:58
By the way, I don't see a direct relation between `test.py` (which doesn't use `subprocess` directly) and your comment describing `subprocess` usage with threads. So if you think that the bug in `test.py` is unrelated to the problem you face, feel free to reopen.
History
Date User Action Args
2020-10-26 15:58:44izbyshevsetmessages: + msg379669
2020-10-26 15:49:38izbyshevsetstatus: open -> closed

nosy: + izbyshev
messages: + msg379668

resolution: not a bug
stage: resolved
2020-10-26 15:47:16izbyshevsetfiles: + test.py
2020-10-26 11:23:10Adqsetmessages: + msg379660
2020-10-22 03:59:17gregory.p.smithsetmessages: + msg379275
2020-10-21 17:16:18pitrousetnosy: + gregory.p.smith
2020-10-21 03:25:55terry.reedysetnosy: + terry.reedy
messages: + msg379186
2020-10-20 16:08:53xtreaksetnosy: + vinay.sajip
2020-10-20 15:31:54Adqsetfiles: - test.py
2020-10-20 15:05:18Adqsetmessages: + msg379139
2020-10-20 14:51:41Adqsetmessages: + msg379135
2020-10-20 14:40:35Adqsetmessages: + msg379133
2020-10-20 14:36:31Adqcreate