classification
Title: multiprocessing: passing file descriptor using reduction breaks duplex pipes on darwin
Type: behavior Stage:
Components: Library (Lib), macOS Versions: Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: davin, frickenate, ned.deily, pitrou, ronaldoussoren
Priority: normal Keywords:

Created on 2017-12-05 08:56 by frickenate, last changed 2017-12-06 12:06 by ronaldoussoren.

Messages (5)
msg307649 - (view) Author: Nate (frickenate) Date: 2017-12-05 08:56
In multiprocessing/reduction.py, there is a hack workaround in the sendfds() and recvfds() methods for darwin, as determined by the "ACKNOWLEDGE" constant. There is a reference to issue #14669 in the code related to why this was added in the first place. This bug exists in both 3.6.3 and the latest 3.7.0a2.

When a file descriptor is received, this workaround/hack sends an acknowledgement message to the sender. The problem is that this completely breaks Duplex pipes depending on the timing of the acknowledgement messages, as your "sock.send(b'A')" and "sock.recv(1) != b'A'" calls are being interwoven with my own messages.

Specifically, I have a parent process with child processes. I send socket file descriptors from the parent to the children, and am also duplexing messages from the child processes to the parent. If I am in the process of sending/receiving a message around the same time as your workaround is performing this acknowledge step, then your workaround corrupts the pipe. 

In a multi-process program, each end of a pipe must only be read or written to by a single process, but this workaround breaks this requirement. A different workaround must be found for the original bug that prompted this "acknowledge" step to be added, because library code must not be interfering with the duplex pipe.
msg307650 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-05 09:20
See https://bugs.python.org/issue6560 for the original issue delineating the problems we had with fd passing on macOS.

I don't know whether Apple finally fixed the underlying issue.  If that was the case, I assume we might be seeing "unexpected successes" in test_socket?  Ned, Ronald, is that right?

Nate, if you want to investigate the underlying issue and see whether the workaround is still needed and/or another workaround is possible, your help is welcome.
msg307700 - (view) Author: Nate (frickenate) Date: 2017-12-06 04:08
According to https://developer.apple.com/library/content/qa/qa1541/_index.html some bugs were fixed in 10.5. Not sure if the original attempt to patch the problem was happening on < 10.5, or if this was still a problem in 10.5+.

I can't for the life of me find it again, but I had found another source that claimed the true fixes for OS X came out with 10.7.

In any case, because this code is specifically part of the multiprocessing package, whereby it should be *expected* for multiple processes to be accessing the pipe, it's disastrous for this code to be reading/writing an acknowledge packet in this manner.

This is a hard case to test for, as timing matters. The duplex pipe doesn't get confused/corrupted unless one process is sending/receiving a message over the pipe at the same moment that another process is executing your acknowledge logic. It's reproducible, but not 100%.

Personally, I've restructured to using one pipe exclusively for file descriptor passing, and using a separate Queue (or Pipe pair) for custom message passing. If a better fix cannot be established, at a minimum the documentation for multiprocessing and the Pipe class should be updated with a big red warning about passing file descriptors on OS X/macOS/darwin.
msg307714 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-12-06 10:17
Le 06/12/2017 à 05:08, Nate a écrit :
> 
> This is a hard case to test for, as timing matters. The duplex pipe doesn't get confused/corrupted unless one process is sending/receiving a message over the pipe at the same moment that another process is executing your acknowledge logic. It's reproducible, but not 100%.

Our test runner has support for running a test in a loop until it fails.
For example `./python -m test -m "*FDPass*" -F -v test_socket`

Combined with perhaps a new test case, this could help you diagnose if
indeed the workaround is obsolete.

(perhaps our resident macOS experts can help too :-))
msg307733 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2017-12-06 12:06
I don't know if the issue has been fixed on macOS, and I'd be surprised if Ned would know this without testing.

Anyways, I think it is worthwhile to perform the testing that Antoine mentioned on a recent version of macOS (I'd start on 10.13, than work backward when the issue isn't present there). 

A big question is how far back we want to support. The binary installers still support macOS 10.6, even though that's long out of support.
History
Date User Action Args
2017-12-06 12:06:38ronaldoussorensetmessages: + msg307733
2017-12-06 10:17:53pitrousetmessages: + msg307714
2017-12-06 04:08:14frickenatesetmessages: + msg307700
2017-12-05 09:20:16pitrousetnosy: + davin
2017-12-05 09:20:06pitrousetnosy: + ronaldoussoren, pitrou, ned.deily
messages: + msg307650
components: + macOS
2017-12-05 08:56:06frickenatecreate