This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: multiprocessing deadlock
Type: crash Stage: resolved
Components: Library (Lib) Versions: Python 3.6
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: Windson Yang, gobbedy, vstinner
Priority: normal Keywords:

Created on 2018-07-06 03:42 by gobbedy, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
multiprocess_torch.py gobbedy, 2018-07-06 03:42
Messages (5)
msg321146 - (view) Author: Guillaume Perrault-Archambault (gobbedy) * Date: 2018-07-06 03:42
The simple code attached causes a deadlock in Linux.

Problem is I have to slightly muck around with it depending on the distro and python version to get it to deadlock.

On the cluster I use the most (python 3.6.3, CentOS Linux release 7.4.1708, pytorch 0.4.0 with no CUDA), the code attached causes a deadlock.
msg321175 - (view) Author: Windson Yang (Windson Yang) * Date: 2018-07-06 15:18
I'm can't reproduce the deadlock, maybe it's related to torch package? Can you try without torch to see if this happens again?
msg321176 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-06 15:21
IMHO it's an issue with your usage of the torch module which is not part of the Python stdlib, so I suggest to close this issue as "third party" or "not a bug".
msg321185 - (view) Author: Guillaume Perrault-Archambault (gobbedy) * Date: 2018-07-06 17:24
Hi Victor and Yang,

Thanks for your fast replies.

I did initially think it could be a torch issue. Indeed, I have an
equivalent numpy testcase that does not deadlock. However, the fact that it
gets stuck inside a multiprocessing wait statement makes me think it's
still a multiprocessing issue.

I've spent two weeks full time on this issue. Over at torch forums I've had
no replies (
https://discuss.pytorch.org/t/multiprocessing-code-works-using-numpy-but-deadlocked-using-pytorch/20473
).

On stackexchange I only got a workaround suggestion that works sporadically
(
https://stackoverflow.com/questions/51093970/multiprocessing-code-works-using-numpy-but-deadlocked-using-pytorch).
Basically I can get rid of the deadlock (sometimes) if I impose only one
thread per process. But this is not a solution anyway.

I have tried stepping through the code, but because it is multiprocessed,
you cannot step through it (at least not in the conventional way, since the
main thread is not doing the heavy lifting).

I've tried adding print statements in the multiprocess library and mucking
around with it a bit, but debugging multi-processed code in this way is an
absolute nightmare because you can't even trust the order in which print
statements display on the screen. And probably more relevant, I'm out of my
league here.

I'm really at a complete dead end. I'm blocked and my work cannot progress
without fixing this issue. I'd be very grateful if you could try to
reproduce and rule out the multiprocessing library. If you need help
reproducing I can send a different testcase that deadlocked on my friend's
Mac (for him, the original testcase did not deadlock).

Testcase I attached in my original post it sometimes deadlocks and
sometimes doesn't, depending on the machine I run on. So I'm not suprised
you got no deadlock when you tried to reproduce.

I can always get it deadlocking on Linux/Mac though, by tweaking the code.

To give you a sense of how unreliably it deadlocks, just removing the for
loop in the code (which is outside the multiprocessing portion of the
code!) somehow gets rid of the deadlock. Also, it never deadlocks on
Windows.

If you could provide any help on this issue I'd be very grateful.

Regards,
Guillaume.

On Fri, Jul 6, 2018 at 11:21 AM STINNER Victor <report@bugs.python.org>
wrote:

>
> STINNER Victor <vstinner@redhat.com> added the comment:
>
> IMHO it's an issue with your usage of the torch module which is not part
> of the Python stdlib, so I suggest to close this issue as "third party" or
> "not a bug".
>
> ----------
> nosy: +vstinner
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue34059>
> _______________________________________
>
msg321188 - (view) Author: Guillaume Perrault-Archambault (gobbedy) * Date: 2018-07-06 18:51
A friend of mine has suggested a fix that seems to work for now (upgrade numpy from 1.14.3 to 1.14.5). This makes no sense at all but it does seem to work for now. I have a strong suspicion that this is just masking the problem and that it will reappear.

However, since it works I would not want you to waste any time on this. I will reopen if the deadlock reappears!

I do apologize if you already spent a lot of time on this.

Regards,
Guillaume
History
Date User Action Args
2022-04-11 14:59:02adminsetgithub: 78240
2018-07-06 18:51:56gobbedysetstatus: open -> closed
resolution: third party
messages: + msg321188

stage: resolved
2018-07-06 17:24:58gobbedysetmessages: + msg321185
2018-07-06 15:21:47vstinnersetnosy: + vstinner
messages: + msg321176
2018-07-06 15:18:04Windson Yangsetnosy: + Windson Yang
messages: + msg321175
2018-07-06 03:42:18gobbedycreate