classification
Title: multiprocessing cleanup occasionally throws exception
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Jorge Herskovic, Marandil, davin, ned.deily, pitrou, sbt
Priority: normal Keywords: patch

Created on 2015-06-21 15:08 by Jorge Herskovic, last changed 2017-06-13 15:54 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
run_finalizers.patch pitrou, 2017-06-13 09:32 review
finalizestress.py pitrou, 2017-06-13 10:03
Pull Requests
URL Status Linked Edit
PR 2159 merged pitrou, 2017-06-13 10:31
PR 2166 merged pitrou, 2017-06-13 15:15
PR 2167 merged pitrou, 2017-06-13 15:23
PR 2168 merged pitrou, 2017-06-13 15:32
Messages (22)
msg245597 - (view) Author: Jorge Herskovic (Jorge Herskovic) Date: 2015-06-21 15:08
I'm writing a library on top of multiprocessing. As part of the test suite, I create and destroy dozens of processes repeatedly. About once in 50 runs, the tests complete successfully but the program crashes with:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 286, in _exit_function
    _run_finalizers(0)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 246, in _run_finalizers
    items = [x for x in list(_finalizer_registry.items()) if f(x)]
RuntimeError: dictionary changed size during iteration

(this is on OS X 10.10.3, Python 3.4.3; I've been unable to replicate the behavior on Python 2.7.9, 3.5.0b2, or pypy despite hundreds of runs)

The problematic code is available at https://github.com/jherskovic/mpetl and the tests should be run with nosetests
msg245629 - (view) Author: Jorge Herskovic (Jorge Herskovic) Date: 2015-06-22 13:47
This happens reliably between 1-5% of the time on my home Mac (a 4.0 GHz i7). My work Mac Pro, a lot slower, doesn't exhibit this behavior on the same 3.4.3 interpreter.

Could it be related to a concurrency issue in the interpreter?

We're attempting to replicate on other systems.
msg245687 - (view) Author: Jorge Herskovic (Jorge Herskovic) Date: 2015-06-23 13:01
More data:

I've been unable to replicate this behavior on Ubuntu 15.04 on Python 3.4.3 over ~20 thousand test runs. (Same machine, running in a VM)

An overnight repeated run on the original machine on OS X, 3.4.3, official distribution gave an actual frequency of 0.1% (7 crashes in 6980 runs).

Approximately 3,300 runs on Python 2.7.9 on the same machine have not resulted in any crashes.

I now built 3.4.3 from source using the latest clang and am trying it again, on the same machine, instead of using the official distribution. So far, 0 crashes in approximately 3,300 runs so far.

At the moment, then, the problem seems localized to the official 3.4.3 binaries for OS X.
msg245704 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-06-24 02:14
Can you show the version info from the Python 3.4 where you see the failures?

/usr/local/bin/python3.4 -c 'import sys;print(sys.version)'
msg245705 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-06-24 02:54
Exactly how are you building the Python 3.4 that does not fail for you?  FWIW, I am able to reproduce the failure easily on a 10.10.3 system using python.org 64-bit/32-bit 3.4.3, using a current MacPorts source-built 3.4.3 (which is built with clang), and repeatedly using the current head of the 3.4 branch (post-3.4.3) built with clang, both with and without pydebug enabled.
msg245707 - (view) Author: Jorge Herskovic (Jorge Herskovic) Date: 2015-06-24 03:27
The failing Python:
3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

The non-failing Python:
3.4.3 (default, Jun 23 2015, 06:33:02)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]

The non-failing python, I just downloaded the tarball and ran ./configure --prefix=/opt/python343, make, sudo make install, and then created a virtualenv using that binary.

Given what you're telling me, I'll run some more loops with the "non-failing" python. I'm grateful you've been able to reproduce the problem, FWIW.
msg245713 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-06-24 03:52
FWIW, building 3.4.3 from source as you described results in failures about 30% of the time.  With some of the earlier tests, it failed even more frequently.  This is running on a 2.2GHz I7.
msg245731 - (view) Author: Jorge Herskovic (Jorge Herskovic) Date: 2015-06-24 11:45
20,000+ iterations of the test suite with my homebuilt 3.4.3 throw no exceptions. 

Weirder and weirder. 

Any suggestions? Without any actual knowledge ("unencumbered by the thought process"), a concurrency issue in the interpreter itself is my guess. Especially since I can't reproduce it at all on other versions. 

I'll see if I can come up with a shorter, self-contained test that can illustrate the issue too.
msg245738 - (view) Author: Jorge Herskovic (Jorge Herskovic) Date: 2015-06-24 13:17
Finally able to repro on my old Mac Pro at work (Universal 64/32bit 3.4.3 from python.org, OS X 10.10.3). For some reason I can't quite fathom, there were two exceptions this time.

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 286, in _exit_function
    _run_finalizers(0)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 246, in _run_finalizers
    items = [x for x in list(_finalizer_registry.items()) if f(x)]
RuntimeError: dictionary changed size during iteration
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/util.py", line 185, in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/multiprocessing/queues.py", line 207, in _finalize_close
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/threading.py", line 342, in notify
TypeError: 'NoneType' object is not callable
msg245739 - (view) Author: Jorge Herskovic (Jorge Herskovic) Date: 2015-06-24 13:28
Ok, sorry, one more important thing I omitted in the previous comment.

^^^ The second exception above seems to be related to a new test I added, one that includes a daemonic process which itself relies on daemonic threads. 

The cleanup of that seems to be FUBARed, and it makes the original exception much more likely to pop up as well.
msg245810 - (view) Author: Jorge Herskovic (Jorge Herskovic) Date: 2015-06-25 15:06
Ok, I *think* I tracked it down to Connections getting cleaned up in a non-orderly fashion, sometimes while _run_finalizers is executing. The following patch to lib/python3.4/multiprocessing/connection.py cures the problem on my machine.

@@ -124,10 +124,10 @@
         self._handle = handle
         self._readable = readable
         self._writable = writable
-
+        self._shutdown = util.Finalize(self, self._finalizer)
     # XXX should we use util.Finalize instead of a __del__?
-
-    def __del__(self):
+
+    def _finalizer(self):
         if self._handle is not None:
             self._close()
msg250137 - (view) Author: Davin Potts (davin) * (Python committer) Date: 2015-09-07 23:09
The 'crash' type is reserved for core dump type situations; 'behavior' is more appropriate for the misbehavior your describe.
msg294546 - (view) Author: Marcin Słowik (Marandil) Date: 2017-05-26 11:09
I can confirm, I have managed to accidentally reproduce this issue on a custom build of 3.6.0 on RHEL.
sys.version:
'3.6.0 (default, Jan 18 2017, 11:23:11)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)]'

The issue popped out of nowhere, on a code that was working w/out any hiccups on multiple systems; after an insignificant change was made to an unrelated part of code.

I managed to "patch it up" by forcingly closing and joining all Queues before letting the main process to stop.

Unfortunately, I cannot share the repro code (it's a) too large, b) closed), but I can confirm the issue was still open in 3.6.0.
msg295876 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 09:25
The simple answer here is _run_finalizers() is not thread-safe, but finalizers can be registered or cancelled from any thread.

I could not write a simple synthetic script to reproduce the issue, but it should nevertheless be fixable.
msg295880 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 09:32
Marcin, Jorge, Ned, could one of you apply the following simple patch and see if that seems to solve the issue?
msg295886 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 10:03
Ok, I got to write a reproducer, which fails reliably here.
msg295927 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 15:11
New changeset 1eb6c0074d17f4fd425cacfdda893d65f5f77f0a by Antoine Pitrou in branch 'master':
bpo-24484: Avoid race condition in multiprocessing cleanup (#2159)
https://github.com/python/cpython/commit/1eb6c0074d17f4fd425cacfdda893d65f5f77f0a
msg295929 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 15:32
Python 2.7 doesn't suffer from this bug, so no actual fix is needed. Still, I'll backport the additional test.
msg295931 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 15:51
New changeset 6fd03459957ee53941183212457bba19f977679f by Antoine Pitrou in branch '3.6':
[3.6] bpo-24484: Avoid race condition in multiprocessing cleanup (GH-2159) (#2166)
https://github.com/python/cpython/commit/6fd03459957ee53941183212457bba19f977679f
msg295933 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 15:52
New changeset d09f1674d70dd84c5be37b5fe11bca4913146c6b by Antoine Pitrou in branch '2.7':
[2.7] bpo-24484: Avoid race condition in multiprocessing cleanup (GH-2159) (#2168)
https://github.com/python/cpython/commit/d09f1674d70dd84c5be37b5fe11bca4913146c6b
msg295934 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 15:54
New changeset a0ecaab006849ade08fbcf8413fe1e92d9c8d71c by Antoine Pitrou in branch '3.5':
[3.5] bpo-24484: Avoid race condition in multiprocessing cleanup (GH-2159) (#2167)
https://github.com/python/cpython/commit/a0ecaab006849ade08fbcf8413fe1e92d9c8d71c
msg295935 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-13 15:54
This should be fixed now.  Please reopen if not.
History
Date User Action Args
2017-06-13 15:54:48pitrousetstatus: open -> closed
resolution: fixed
messages: + msg295935

stage: commit review -> resolved
2017-06-13 15:54:12pitrousetmessages: + msg295934
2017-06-13 15:52:31pitrousetmessages: + msg295933
2017-06-13 15:51:30pitrousetmessages: + msg295931
2017-06-13 15:32:48pitrousetstage: needs patch -> commit review
messages: + msg295929
versions: - Python 2.7
2017-06-13 15:32:20pitrousetpull_requests: + pull_request2219
2017-06-13 15:23:00pitrousetpull_requests: + pull_request2218
2017-06-13 15:15:15pitrousetpull_requests: + pull_request2217
2017-06-13 15:11:25pitrousetmessages: + msg295927
2017-06-13 10:31:20pitrousetpull_requests: + pull_request2210
2017-06-13 10:03:26pitrousetfiles: + finalizestress.py

messages: + msg295886
2017-06-13 09:32:05pitrousetfiles: + run_finalizers.patch
keywords: + patch
messages: + msg295880
2017-06-13 09:25:11pitrousetversions: + Python 2.7, Python 3.5, Python 3.6, Python 3.7, - Python 3.4
nosy: + pitrou

messages: + msg295876

stage: needs patch
2017-05-26 11:09:31Marandilsetnosy: + Marandil
messages: + msg294546
2015-09-07 23:09:25davinsettype: crash -> behavior
messages: + msg250137
2015-06-25 15:06:21Jorge Herskovicsetmessages: + msg245810
2015-06-24 13:28:58Jorge Herskovicsetmessages: + msg245739
2015-06-24 13:17:03Jorge Herskovicsetmessages: + msg245738
2015-06-24 11:45:46Jorge Herskovicsetmessages: + msg245731
2015-06-24 03:52:42ned.deilysetmessages: + msg245713
2015-06-24 03:27:28Jorge Herskovicsetmessages: + msg245707
2015-06-24 02:54:33ned.deilysetmessages: + msg245705
2015-06-24 02:14:05ned.deilysetnosy: + ned.deily
messages: + msg245704
2015-06-23 13:01:52Jorge Herskovicsetmessages: + msg245687
2015-06-22 13:47:57Jorge Herskovicsetmessages: + msg245629
2015-06-21 22:08:37ned.deilysetnosy: + sbt, davin
2015-06-21 15:08:26Jorge Herskoviccreate