classification
Title: Segfault in test_multiprocessing
Type: Stage:
Components: Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: haypo, neologix, pitrou, python-dev
Priority: normal Keywords:

Created on 2011-06-10 09:11 by haypo, last changed 2011-06-17 14:10 by haypo. This issue is now closed.

Messages (13)
msg138060 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-10 09:11
test_multiprocessing segfaults in a loop. The crash occurs in _Condition.release() on waiter.release(), called from Queue._finalize_close(). Possible related changes:

 - a5c8b6ebe895: new sigwait() test using thread (issue #8407)
 - 6d6099f7fe89: add sentinels to multiprocessing (issue #9205)

Example of a crash:

[333/356/1] test_multiprocessing
Fatal Python error: Segmentation fault

Thread 0x01cde800:
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/threading.py", line 237 in wait
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/queues.py", line 252 in _feed
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/threading.py", line 690 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/threading.py", line 737 in _bootstrap_inner
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/threading.py", line 710 in _bootstrap

Current thread 0xa000d000:
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/threading.py", line 298 in notify
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/queues.py", line 227 in _finalize_close
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/util.py", line 202 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/process.py", line 270 in _bootstrap
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/forking.py", line 133 in __init__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/process.py", line 134 in start
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/pool.py", line 220 in _repopulate_pool
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/pool.py", line 157 in __init__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/__init__.py", line 231 in Pool
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_multiprocessing.py", line 2186 in test_main
  File "./Lib/test/regrtest.py", line 1043 in runtest_inner
  File "./Lib/test/regrtest.py", line 841 in runtest
  File "./Lib/test/regrtest.py", line 668 in main
  File "./Lib/test/regrtest.py", line 1618 in <module>

There are approximatively 698 crashes in last tests on "x86 Tiger 3.x"!
http://www.python.org/dev/buildbot/all/builders/x86%20Tiger%203.x/builds/2722/steps/test/logs/stdio

Most occured on Queue._finalize_close() -> _Condition.release() -> waiter.release().
msg138064 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-10 09:26
> a5c8b6ebe895: new sigwait() test using thread (issue #8407)
> 6d6099f7fe89: add sentinels to multiprocessing (issue #9205)


The first segfaults occured in build #2719 (cedceeb45030):
http://www.python.org/dev/buildbot/all/builders/x86%20Tiger%203.x/builds/2719

In the timeline, 6d6099f7fe89 < cedceeb45030 < a5c8b6ebe895, so the change is more likely coming from 6d6099f7fe89.
msg138067 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-06-10 09:40
Victor, how can there be hundreds of crashes? Isn't the process supposed to terminate when a crash occurs?

There are several crashes in test_signal, so it's not only test_multiprocessing:


Thread 0xa000d000:
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 172 in test_main
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 407 in _executeTestPart
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 462 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 514 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1166 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1254 in _run_suite
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1280 in run_unittest
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 687 in test_main
  File "./Lib/test/regrtest.py", line 1043 in runtest_inner
  File "./Lib/test/regrtest.py", line 841 in runtest
  File "./Lib/test/regrtest.py", line 668 in main
  File "./Lib/test/regrtest.py", line 1618 in <module>
Fatal Python error: Segmentation fault


Thread 0xa000d000:
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 248 in test_wakeup_fd_early
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 407 in _executeTestPart
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 462 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 514 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1166 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1254 in _run_suite
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1280 in run_unittest
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 687 in test_main
  File "./Lib/test/regrtest.py", line 1043 in runtest_inner
  File "./Lib/test/regrtest.py", line 841 in runtest
  File "./Lib/test/regrtest.py", line 668 in main
  File "./Lib/test/regrtest.py", line 1618 in <module>
Fatal Python error: Segmentation fault



Thread 0xa000d000:
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 372 in readpipe_interrupted
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 394 in test_siginterrupt_on
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 407 in _executeTestPart
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 462 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 514 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1166 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1254 in _run_suite
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1280 in run_unittest
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 687 in test_main
  File "./Lib/test/regrtest.py", line 1043 in runtest_inner
  File "./Lib/test/regrtest.py", line 841 in runtest
  File "./Lib/test/regrtest.py", line 668 in main
  File "./Lib/test/regrtest.py", line 1618 in <module>
Fatal Python error: Segmentation fault



Thread 0xa000d000:
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 495 in test_itimer_prof
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 407 in _executeTestPart
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 462 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 514 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1166 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1254 in _run_suite
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1280 in run_unittest
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 687 in test_main
  File "./Lib/test/regrtest.py", line 1043 in runtest_inner
  File "./Lib/test/regrtest.py", line 841 in runtest
  File "./Lib/test/regrtest.py", line 668 in main
  File "./Lib/test/regrtest.py", line 1618 in <module>
Fatal Python error: Segmentation fault



Thread 0xa000d000:
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 472 in test_itimer_virtual
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 407 in _executeTestPart
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 462 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 514 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1166 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1254 in _run_suite
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1280 in run_unittest
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_signal.py", line 687 in test_main
  File "./Lib/test/regrtest.py", line 1043 in runtest_inner
  File "./Lib/test/regrtest.py", line 841 in runtest
  File "./Lib/test/regrtest.py", line 668 in main
  File "./Lib/test/regrtest.py", line 1618 in <module>
Fatal Python error: Segmentation fault
msg138069 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-10 09:42
> Victor, how can there be hundreds of crashes?
> Isn't the process supposed to terminate when a crash occurs?

Yes, a process does terminate on SIGSEGV, but multiprocessing creates subprocesses: I suppose that crashes occur in child processes.

For test_signal, I have to investigate this.
msg138074 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-10 11:02
Le vendredi 10 juin 2011 à 09:40 +0000, Antoine Pitrou a écrit :
> There are several crashes in test_signal

Commit a17710e27ea2 should fix some (all?) test_signal crashes.
msg138132 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-10 21:10
It looks like the sentinel doesn't handle fatal death of the child process:

test test_multiprocessing crashed -- Traceback (most recent call last):
  File "./Lib/test/regrtest.py", line 1043, in runtest_inner
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_multiprocessing.py", line 2189, in test_main
    ManagerMixin.manager.start()
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/managers.py", line 531, in start
    self._address = reader.recv()
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/connection.py", line 273, in recv
    buf = self._recv_bytes(sentinels=sentinels)
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4, sentinels)
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/multiprocessing/connection.py", line 413, in _recv
    raise EOFError
EOFError
msg138154 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-06-11 09:59
I think it might be related to Issue #6721.

Using a mutex/condition variable after fork (from the child process) is unsafe: it can lead to deadlocks, and on OS-X, it seems like it can lead to segfaults.

Normally, Queue's synchronization primitives are reinitialized after fork, see Queue._after_fork() method.

But here, what happens is the following (well, that's an hypothesis):

Lib/multiprocessing/process.py", line 270 in _bootstrap
:
        _current_process = self
        util._finalizer_registry.clear()
        util._run_after_forkers()

- the old _current_process' reference count drops to zero
- it's deallocated, and since it holds the last reference to a Queue, the queue is finalized
- as a consequence, the Queue._finalize_close() callback (registered through a Finalize object) is called
- Queue._finalize_close() tries to acquire, signal and release the _notempty Condition, but the Condition hasn't been reinitialized yet, because util._run_after_forkers() is called 2 lines below
- segfault

It's probably been triggered by Antoine's patches, but I'm pretty sure this bug has always been there.

I think that moving util._run_after_forkers() up 2 lines should fix the segfaults, but with that change test_number_of_objects fails (I didn't investigate why).
msg138163 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-06-11 16:12
Less disruptive approach:

        old_process = _current_process
        _current_process = self
        try:
            util._finalizer_registry.clear()
            util._run_after_forkers()
        finally:
            del old_process

This will delay finalization of the old process object until after _run_after_forkers() is executed, without (hopefully) messing with semantics.
msg138168 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-06-11 16:44
Yes, I also tried this.
This fixed the test_multiprocessing failure, but I think it triggered a failure in test_concurrent_futures (didin't look in more detail).
Would it be possible to try this on the buildbot to see if it fixes the segfaults?
msg138190 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-11 23:34
> Would it be possible to try this on the buildbot to see if it fixes
> the segfaults?

You can fork cpython, modify the code, and run a custom buildbot on your
clone.
msg138494 - (view) Author: Roundup Robot (python-dev) Date: 2011-06-17 10:38
New changeset e6e7e42efdc2 by Victor Stinner in branch '3.2':
Issue #12310: finalize the old process after _run_after_forkers()
http://hg.python.org/cpython/rev/e6e7e42efdc2

New changeset a73e5c1f57d7 by Victor Stinner in branch 'default':
(Merge 3.2) Issue #12310: finalize the old process after _run_after_forkers()
http://hg.python.org/cpython/rev/a73e5c1f57d7
msg138495 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-17 10:39
Let's try on "real" buildbots. If the commit fixes the issue on 3.x, I will port the fix to Python 2.7.
msg138511 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-17 14:10
test_multiprocessing pass with success on PPC Tiger 3.x (and x86 Tiger 3.x, but the segfaults only occurred on PPC), but this issue is a sporadic issue. I close the issue because I hope that it is closed, but reopen it if you still see segfaults on PPC Tiger 3.x.
History
Date User Action Args
2011-06-17 14:10:27hayposetstatus: open -> closed
resolution: fixed
messages: + msg138511
2011-06-17 10:39:25hayposetmessages: + msg138495
2011-06-17 10:38:34python-devsetnosy: + python-dev
messages: + msg138494
2011-06-11 23:34:21hayposetmessages: + msg138190
2011-06-11 16:44:45neologixsetmessages: + msg138168
2011-06-11 16:12:38pitrousetmessages: + msg138163
2011-06-11 09:59:03neologixsetnosy: + neologix
messages: + msg138154
2011-06-10 21:10:43hayposetmessages: + msg138132
2011-06-10 11:02:09hayposetmessages: + msg138074
2011-06-10 09:42:49hayposetmessages: + msg138069
2011-06-10 09:40:29pitrousetmessages: + msg138067
2011-06-10 09:26:48hayposetmessages: + msg138064
2011-06-10 09:11:53haypocreate