We're having some problems with multiprocessing.Queue where the parent process ends up hanging with zombie children. The code is part of bitbake, the task execution engine behind OpenEmbedded/Yocto Project.
I've cut down our code to the pieces in question in the attached file. It doesn't give a runnable test case unfortunately but does at least show what we're doing. Basically, we have a set of items to parse, we create a set of multiprocessing.Process() processes to handle the parsing in parallel. Jobs are queued in one queue and results are fed back to the parent via another. There is a quit queue that takes sentinels to cause the subprocesses to quit.
If something fails to parse, shutdown with clean=False is called, the sentinels are sent. the Parser() process calls results.cancel_join_thread() on the results queue. We do this since we don't care about the results any more, we just want to ensure everyting exits cleanly. This is where things go wrong. The Parser processes and their queues all turn into zombies. The parent process ends up stuck in self.result_queue.get(timeout=0.25) inside shutdown().
strace shows its acquired the locks and is doing a read() on the os.pipe() it created. Unfortunately since the parent still has a write channel open to the same pipe, it hangs indefinitely.
If I change the code to do:
self.result_queue._writer.close()
while True:
try:
self.result_queue.get(timeout=0.25)
except (queue.Empty, EOFError):
break
i.e. close the writer side of the pipe by poking at the queue internals, we don't see the hang. The .close() method would close both sides.
We create our own process pool since this code dates from python 2.x days and multiprocessing pools had issues back when we started using this. I'm sure it would be much better now but we're reluctant to change what has basically been working. We drain the queues since in some cases we have clean shutdowns where cancel_join_thread() hasn't been used and we don't want those cases to block.
My question is whether this is a known issue and whether there is some kind of API to close just the write side of the Queue to avoid problems like this?
|