classification
Title: multiprocessing: serialization must ensure that contexts are compatible (the same)
Type: crash Stage: needs patch
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: arcivanov, augustogoulart, davin, pitrou, taleinat, vstinner
Priority: normal Keywords:

Created on 2018-04-01 05:58 by arcivanov, last changed 2018-11-14 17:28 by davin.

Files
File name Uploaded Description Edit
test_lock_sigsegv.py arcivanov, 2018-04-01 05:58
testing_on_fedora.png augustogoulart, 2018-11-13 14:22
coredump arcivanov, 2018-11-14 09:31 coredump (Fedora 29)
Messages (11)
msg314762 - (view) Author: Arcadiy Ivanov (arcivanov) Date: 2018-04-01 05:58
While working on GH gevent/gevent#993 I've encountered a stall trying to read from an mp.Queue passed to mp.Process's target as an argument. Trying to print out the lock state in child process I encountered as SEGV in Lock's __repr__. I originally thought it was due to gevent/greenlet stack magic, but it wasn't. 

This happens when `fork` context Queue (default) is used with `spawn` context Process (obvious stupidity on my part, alas shouldn't crash).

Python 3.6.4 from PyEnv
Fedora 27

```
$ python test_lock_sigsegv.py 
Parent r_q: <Lock(owner=None)>, <Lock(owner=None)>, <BoundedSemaphore(value=2147483647, maxvalue=2147483647)>
-11
```

```
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __new_sem_getvalue (sem=0x7fc877f54000, sval=sval@entry=0x7fffb130db9c) at sem_getvalue.c:38
38        *sval = atomic_load_relaxed (&isem->data) & SEM_VALUE_MASK;
...
#0  __new_sem_getvalue (sem=0x7fc877f54000, sval=sval@entry=0x7fffb130db9c) at sem_getvalue.c:38
#1  0x00007f1116aeb202 in semlock_getvalue (self=<optimized out>) at /tmp/python-build.20171219170845.6548/Python-3.6.4/Modules/_multiprocessing/semaphore.c:531
```

At a minimum I think there should be a check trying to reduce arguments via incompatible context's process to prevent a SEGV.

Test attached.
msg314792 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-04-01 23:45
Thanks for the report.  Indeed I think it would be worth preventing this programmer error.
msg329491 - (view) Author: Augusto Goulart (augustogoulart) * Date: 2018-11-09 01:12
I couldn't reproduce the error on Debian 9 nor OSX, although I tried tweaking the test script a little bit to force the error. Arcadiy, did you tried reproducing the same issue in a different platform? Did someone report something similar in recent issues on gevent?
msg329719 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2018-11-12 06:49
On Win10 I've also failed to reproduce the reported issue with the supplied script.  I tried with Python versions 3.6.3, 3.7.0, and a recent build of the master branch (to be 3.8).

Can someone try to reproduce this on Fedora?
msg329845 - (view) Author: Augusto Goulart (augustogoulart) * Date: 2018-11-13 14:22
I've tested on Fedora 29 server and also failed to reproduce the error.
msg329892 - (view) Author: Arcadiy Ivanov (arcivanov) Date: 2018-11-14 09:20
@gus.goulart you have reproduced it. The screenshot showing `-11` means the process dumped core. Because it's the child that dumps core, it's masked by abrt.

Observe:

$ python3 --version
Python 3.7.1
$ python3 ~/Downloads/test_lock_sigsegv.py 
Parent r_q: <Lock(owner=None)>, <Lock(owner=None)>, <BoundedSemaphore(value=2147483647, maxvalue=2147483647)>
-11
$ abrt
61bdd28 1x /usr/bin/python3.7 2018-11-14 04:18:06
$ uname -a
Linux myhost 4.18.17-300.fc29.x86_64 #1 SMP Mon Nov 5 17:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
msg329893 - (view) Author: Arcadiy Ivanov (arcivanov) Date: 2018-11-14 09:23
@taleinat The above has been reproduced on Fedora 29.
msg329898 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-11-14 10:35
> At a minimum I think there should be a check trying to reduce arguments via incompatible context's process to prevent a SEGV.

I'm not sure that I understand the bug. The reproducer script pass a multiprocessing.Queue to a child process and then the child crash when attempting to call multiprocessing.synchronize.Lock.__repr__().

Does the child reuse a copy of the lock of the parent process? Or does the child create a new SemLock?


I reproduced the bug on Fedora 26. I attached the child process in gdb. The crash occurs on sem_getvalue() in the child process.

Program received signal SIGSEGV, Segmentation fault.
0x00007f29a5156610 in sem_getvalue@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007f29a5156610 in sem_getvalue@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#1  0x00007f299c60e7bb in semlock_getvalue (self=0x7f299a95e2b0, _unused_ignored=0x0)
    at /home/haypo/prog/python/master/Modules/_multiprocessing/semaphore.c:541
#2  0x0000000000434537 in _PyMethodDef_RawFastCallKeywords (method=0x7f299c8102e0 <semlock_methods+192>, 
    self=<_multiprocessing.SemLock at remote 0x7f299a95e2b0>, args=0x7f299c5f47e8, nargs=0, kwnames=0x0) at Objects/call.c:629
#3  0x0000000000607aff in _PyMethodDescr_FastCallKeywords (descrobj=<method_descriptor at remote 0x7f299ca42520>, args=0x7f299c5f47e0, nargs=1, 
    kwnames=0x0) at Objects/descrobject.c:288
#4  0x0000000000512f92 in call_function (pp_stack=0x7ffd3591f730, oparg=1, kwnames=0x0) at Python/ceval.c:4595
(...)

(gdb) py-bt
Traceback (most recent call first):
  File "/home/haypo/prog/python/master/Lib/multiprocessing/synchronize.py", line 170, in __repr__
    elif self._semlock._get_value() == 1:
  File "/home/haypo/prog/python/master/test_lock_sigsegv.py", line 20, in child
    print("Child r_q: %r, %r, %r" % (r_q._rlock, r_q._wlock, r_q._sem), flush=True)
  File "/home/haypo/prog/python/master/Lib/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/haypo/prog/python/master/Lib/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/haypo/prog/python/master/Lib/multiprocessing/spawn.py", line 130, in _main
    return self._bootstrap()
  File "/home/haypo/prog/python/master/Lib/multiprocessing/spawn.py", line 629, in spawn_main
  File "<string>", line 1, in <module>
msg329908 - (view) Author: Augusto Goulart (augustogoulart) * Date: 2018-11-14 13:53
@vstinner, on Debian 9 I can see the problem as well but wasn't able to debug with the level of details you did. Could you please share the process you followed?

What I found was:

./python -X dev test_lock_sigsegv.py
Parent r_q: <Lock(owner=None)>, <Lock(owner=None)>, <BoundedSemaphore(value=2147483647, maxvalue=2147483647)>
Fatal Python error: Segmentation fault

Current thread 0x00007fab36124480 (most recent call first):
  File "/home/gus/Workspace/cpython/Lib/multiprocessing/synchronize.py", line 170 in __repr__
  File "/home/gus/Workspace/cpython/test_lock_sigsegv.py", line 17 in child
  File "/home/gus/Workspace/cpython/Lib/multiprocessing/process.py", line 99 in run
  File "/home/gus/Workspace/cpython/Lib/multiprocessing/process.py", line 297 in _bootstrap
  File "/home/gus/Workspace/cpython/Lib/multiprocessing/spawn.py", line 130 in _main
  File "/home/gus/Workspace/cpython/Lib/multiprocessing/spawn.py", line 117 in spawn_main
  File "<string>", line 1 in <module>
-11

Using GDB:

(gdb) set follow-fork-mode child
(gdb) run test_lock_sigsegv.py 
Starting program: /home/gus/Workspace/cpython/python test_lock_sigsegv.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Parent r_q: <Lock(owner=None)>, <Lock(owner=None)>, <BoundedSemaphore(value=2147483647, maxvalue=2147483647)>
[New process 4941]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
process 4941 is executing new program: /home/gus/Workspace/cpython/python
-11
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Inferior 2 (process 4941) exited normally]
(gdb) where
No stack.
(gdb) py-bt
Unable to locate python frame
(gdb)
msg329909 - (view) Author: Arcadiy Ivanov (arcivanov) Date: 2018-11-14 14:08
@vstinner

> I'm not sure that I understand the bug.

The bug is, if a user makes an error and passes a Queue from context 'fork' to a child that is spawned using 'spawn', the passed Queue is, for obvious reasons, broken. 

The 'print("Child r_q: %r, %r, %r" % (r_q._rlock, r_q._wlock, r_q._sem), flush=True)' is simply a demonstration of a broken state of the SemLock observed in the child. 

The expected fix would be to stop the mixed context use of MP objects on the API level (ValueError?) or at least prevent a segfault.
msg329913 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-11-14 16:02
> The bug is, if a user makes an error and passes a Queue from context 'fork' to a child that is spawned using 'spawn', the passed Queue is, for obvious reasons, broken. 

Ok. I rewrote the issue title.
History
Date User Action Args
2018-11-14 17:28:52davinsetnosy: + davin
2018-11-14 16:02:18vstinnersetmessages: + msg329913
2018-11-14 16:01:51vstinnersettitle: SEGV in mp.synchronize.Lock.__repr__ in spawn'ed proc if ctx mismatched -> multiprocessing: serialization must ensure that contexts are compatible (the same)
2018-11-14 14:08:05arcivanovsetmessages: + msg329909
2018-11-14 13:53:16augustogoulartsetmessages: + msg329908
2018-11-14 10:35:40vstinnersetmessages: + msg329898
2018-11-14 09:31:13arcivanovsetfiles: + coredump
2018-11-14 09:23:49arcivanovsetmessages: + msg329893
2018-11-14 09:20:44arcivanovsetmessages: + msg329892
2018-11-13 14:22:01augustogoulartsetfiles: + testing_on_fedora.png

messages: + msg329845
2018-11-12 06:49:01taleinatsetmessages: + msg329719
2018-11-09 01:13:18augustogoulartsetnosy: + taleinat
2018-11-09 01:12:39augustogoulartsetnosy: + vstinner, augustogoulart
messages: + msg329491
2018-04-01 23:45:19pitrousetversions: + Python 3.7, Python 3.8
nosy: + pitrou

messages: + msg314792

stage: needs patch
2018-04-01 05:58:11arcivanovcreate