This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: multiprocessing.Queue deadlock
Type: crash Stage:
Components: Library (Lib) Versions: Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: davin, pitrou, rpurdie
Priority: normal Keywords:

Created on 2020-09-04 10:19 by rpurdie, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
simplified.py rpurdie, 2020-09-04 10:19
Messages (3)
msg376350 - (view) Author: Richard Purdie (rpurdie) Date: 2020-09-04 10:19
We're having some problems with multiprocessing.Queue where the parent process ends up hanging with zombie children. The code is part of bitbake, the task execution engine behind OpenEmbedded/Yocto Project.

I've cut down our code to the pieces in question in the attached file. It doesn't give a runnable test case unfortunately but does at least show what we're doing. Basically, we have a set of items to parse, we create a set of multiprocessing.Process() processes to handle the parsing in parallel. Jobs are queued in one queue and results are fed back to the parent via another. There is a quit queue that takes sentinels to cause the subprocesses to quit.

If something fails to parse, shutdown with clean=False is called, the sentinels are sent. the Parser() process calls results.cancel_join_thread() on the results queue. We do this since we don't care about the results any more, we just want to ensure everyting exits cleanly. This is where things go wrong. The Parser processes and their queues all turn into zombies. The parent process ends up stuck in self.result_queue.get(timeout=0.25) inside shutdown().

strace shows its acquired the locks and is doing a read() on the os.pipe() it created. Unfortunately since the parent still has a write channel open to the same pipe, it hangs indefinitely.

If I change the code to do:

        self.result_queue._writer.close()
        while True:
            try:
               self.result_queue.get(timeout=0.25)
            except (queue.Empty, EOFError):
                break

i.e. close the writer side of the pipe by poking at the queue internals, we don't see the hang. The .close() method would close both sides.

We create our own process pool since this code dates from python 2.x days and multiprocessing pools had issues back when we started using this. I'm sure it would be much better now but we're reluctant to change what has basically been working. We drain the queues since in some cases we have clean shutdowns where cancel_join_thread() hasn't been used and we don't want those cases to block.

My question is whether this is a known issue and whether there is some kind of API to close just the write side of the Queue to avoid problems like this?
msg376351 - (view) Author: Richard Purdie (rpurdie) Date: 2020-09-04 10:27
I should also add that if we don't use cancel_join_thread() in the parser processes, things all work out ok. There is therefore seemingly something odd about the state that is leaving things in.
This issue doesn't occur every time, its maybe 1 in 40 runs where we throw parsing errors but I can brute force reproduce it.
msg376357 - (view) Author: Richard Purdie (rpurdie) Date: 2020-09-04 11:43
Even my hack to call _writer.close() doesn't seem to be enough, it makes the problem rarer but there is still an issue. 
Basically, if you call cancel_join_thread() in one process, the queue is potentially totally broken in all other processes that may be using it. If for example another has called join_thread() as it was exiting and has queued data at the same time as another process exits using cancel_join_thread() and exits holding the write lock, you'll deadlock on the processes now stuck in join_thread() waiting for a lock they'll never get.
I suspect the answer is "don't use cancel_join_thread()" but perhaps the docs need a note to point out that if anything is already potentially exiting, it can deadlock? I'm not sure you can actually use the API safely unless you stop all users from exiting and synchronise that by other means?
History
Date User Action Args
2022-04-11 14:59:35adminsetgithub: 85880
2020-09-07 23:44:27ned.deilysetnosy: + pitrou, davin
2020-09-04 11:43:44rpurdiesetmessages: + msg376357
2020-09-04 10:27:31rpurdiesetmessages: + msg376351
2020-09-04 10:19:47rpurdiecreate