classification
Title: signal_handler() is not reentrant: deadlock in Py_AddPendingCall()
Type: Stage:
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: haypo, pitrou, python-dev
Priority: normal Keywords: patch

Created on 2011-04-05 00:19 by haypo, last changed 2011-04-18 14:48 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
signal_reentrant.patch haypo, 2011-04-18 10:40 review
signal_versionadded.patch haypo, 2011-04-18 10:57
Messages (17)
msg132993 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-05 00:19
test_threadsignals hangs on "x86 Tiger 3.x" and "PPC Tiger 3.x":
---------------------
[279/354] test_threadsignals
Thread 0xa000d000:
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_threadsignals.py", line 46 in test_signals
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 387 in _executeTestPart
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 442 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/case.py", line 494 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1078 in run
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1166 in _run_suite
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/support.py", line 1192 in run_unittest
  File "/Users/db3l/buildarea/3.x.bolen-tiger/build/Lib/test/test_threadsignals.py", line 210 in test_main
  File "./Lib/test/regrtest.py", line 1032 in runtest_inner
  File "./Lib/test/regrtest.py", line 826 in runtest
  File "./Lib/test/regrtest.py", line 650 in main
  File "./Lib/test/regrtest.py", line 1607 in <module>
make: *** [buildbottest] Error 1
program finished with exit code 2
elapsedTime=3365.320090
---------------------
http://www.python.org/dev/buildbot/all/builders/x86%20Tiger%203.x/builds/2289/steps/test/logs/stdio
http://www.python.org/dev/buildbot/all/builders/PPC%20Tiger%203.x/builds/1700/steps/test/logs/stdio

test_threadsignals hangs at:
----------------
class ThreadSignals(unittest.TestCase):
    def test_signals(self):
        signalled_all.acquire()
        self.spawnSignallingThread()
        signalled_all.acquire() <~~~~ here
----------------

self.spawnSignallingThread() calls:
----------------
def send_signals():
    os.kill(process_pid, signal.SIGUSR1)
    os.kill(process_pid, signal.SIGUSR2)
    signalled_all.release()
----------------

The hang may be related to regrtest timeout: faulthandler had a bug, but the bug is supposed to be fixed (#11753, #11755).

See also issue #11223.
msg132995 - (view) Author: Roundup Robot (python-dev) Date: 2011-04-05 00:30
New changeset d14eac872a46 by Victor Stinner in branch 'default':
Issue #11768: add debug messages in test_threadsignals.test_signals
http://hg.python.org/cpython/rev/d14eac872a46
msg133257 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-07 20:59
The failure occurs also on Leopard:
----------------------
[159/354] test_threadsignals
Thread 0x00007fff7080f700:
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/test/test_threadsignals.py", line 53 in test_signals
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/unittest/case.py", line 387 in _executeTestPart
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/unittest/case.py", line 442 in run
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/unittest/case.py", line 494 in __call__
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/unittest/suite.py", line 105 in run
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/test/support.py", line 1078 in run
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/test/support.py", line 1166 in _run_suite
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/test/support.py", line 1192 in run_unittest
  File "/Users/pythonbuildbot/buildarea/3.x.hansen-osx-x86-2/build/Lib/test/test_threadsignals.py", line 218 in test_main
  File "./Lib/test/regrtest.py", line 1032 in runtest_inner
  File "./Lib/test/regrtest.py", line 826 in runtest
  File "./Lib/test/regrtest.py", line 650 in main
  File "./Lib/test/regrtest.py", line 1607 in <module>
make: *** [buildbottest] Error 1
program finished with exit code 2
elapsedTime=3089.949146
----------------------
http://www.python.org/dev/buildbot/all/builders/AMD64%20Leopard%203.x/builds/1132/steps/test/logs/stdio

Even with d14eac872a46, this trace doesn't give more information :-/ Should I flush manually sys.stderr?
msg133267 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-07 22:14
jseutter reproduced it after 112 and 49 runs on Snow Leopard using "regrtest.py -F -v --timeout=60 test_threadsignals" command:
-----------
[ 49] test_threadsignals
test_interrupted_timed_acquire (test.test_threadsignals.ThreadSignals) ... ok
test_lock_acquire_interruption (test.test_threadsignals.ThreadSignals) ... ok
test_lock_acquire_retries_on_intr (test.test_threadsignals.ThreadSignals) ... ok
test_rlock_acquire_interruption (test.test_threadsignals.ThreadSignals) ... ok
test_rlock_acquire_retries_on_intr (test.test_threadsignals.ThreadSignals) ... ok
test_signals (test.test_threadsignals.ThreadSignals) ... test_signals: acquire lock (thread 140735082572960)
test_signals: wait lock (thread 140735082572960)
Thread 0x00007fff709aaca0:
  File "/Users/jseutter/code/python/cpython/Lib/test/test_threadsignals.py", line 55 in test_signals
  File "/Users/jseutter/code/python/cpython/Lib/unittest/case.py", line 387 in _executeTestPart
  File "/Users/jseutter/code/python/cpython/Lib/unittest/case.py", line 442 in run
  File "/Users/jseutter/code/python/cpython/Lib/unittest/case.py", line 494 in __call__
  File "/Users/jseutter/code/python/cpython/Lib/unittest/suite.py", line 105 in run
  File "/Users/jseutter/code/python/cpython/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/jseutter/code/python/cpython/Lib/unittest/suite.py", line 105 in run
  File "/Users/jseutter/code/python/cpython/Lib/unittest/suite.py", line 67 in __call__
  File "/Users/jseutter/code/python/cpython/Lib/unittest/runner.py", line 168 in run
  File "/Users/jseutter/code/python/cpython/Lib/test/support.py", line 1166 in _run_suite
  File "/Users/jseutter/code/python/cpython/Lib/test/support.py", line 1192 in run_unittest
  File "/Users/jseutter/code/python/cpython/Lib/test/test_threadsignals.py", line 221 in test_main
  File "Lib/test/regrtest.py", line 1032 in runtest_inner
  File "Lib/test/regrtest.py", line 826 in runtest
  File "Lib/test/regrtest.py", line 650 in main
  File "Lib/test/regrtest.py", line 1607 in <module>
-----------

So for an unknown reason, send_signals() thread was not created or doesn't start.

Example of the output of a success:
----------
test_signals: acquire lock (thread 140735082572960)
test_signals: wait lock (thread 140735082572960)
send_signals: enter (thread 4326952960)
send_signals: raise SIGUSR1
send_signals: raise SIGUSR2
send_signals: release signalled_all
send_signals: exit (thread 4326952960)
test_signals: lock acquired
----------
msg133268 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-07 22:19
Oh, the traceback only contains one thread, so send_signals() thread was not created or failed very quickly (before its first print instruction).
msg133941 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-17 23:12
Oh, I reproduced the bug.
------------------------
[3021] test_threadsignals
test_signals (test.test_threadsignals.ThreadSignals) ... test_signals: acquire lock (thread -1610559488)
test_signals: wait lock (thread -1610559488)
send_signals: enter (thread 27159552)
send_signals: raise SIGUSR1
send_signals: raise SIGUSR2
send_signals: release signalled_all
send_signals: exit (thread 27159552)

^C
------------------------

There are 2 threads: faulthandler thread and the mainthread. Trace of the main thread:
-------
(gdb) where
#0  0xffff027a in ___spin_lock () at /System/Library/Frameworks/System.framework/PrivateHeaders/i386/cpu_capabilities.h:216
#1  0x90001433 in pthread_mutex_lock ()
#2  0x0011b34c in PyThread_acquire_lock_timed (lock=0x1124f10, microseconds=0, intr_flag=0) at Python/thread_pthread.h:450
#3  0x0011b73e in PyThread_acquire_lock (lock=0x1124f10, waitflag=0) at Python/thread_pthread.h:531
#4  0x000775b2 in Py_AddPendingCall (func=0x713dd <checksignals_witharg>, arg=0x0) at Python/ceval.c:509
#5  0x00071445 in signal_handler (sig_num=31) at ./Modules/signalmodule.c:189
#6  <signal handler called>
#7  0x9010d494 in spin_unlock ()
#8  0x90001620 in pthread_mutex_lock ()
#9  0x0011b34c in PyThread_acquire_lock_timed (lock=0x1124f10, microseconds=0, intr_flag=0) at Python/thread_pthread.h:450
#10 0x0011b73e in PyThread_acquire_lock (lock=0x1124f10, waitflag=0) at Python/thread_pthread.h:531
#11 0x000775b2 in Py_AddPendingCall (func=0x713dd <checksignals_witharg>, arg=0x0) at Python/ceval.c:509
#12 0x00071445 in signal_handler (sig_num=30) at ./Modules/signalmodule.c:189
#13 <signal handler called>
#14 0x900248c7 in semaphore_wait_signal_trap ()
#15 0x900288b4 in pthread_cond_wait ()
#16 0x0011b52c in PyThread_acquire_lock_timed (lock=0x1171180, microseconds=-1, intr_flag=1) at Python/thread_pthread.h:476
#17 0x00180746 in acquire_timed (lock=0x1171180, microseconds=-1) at ./Modules/_threadmodule.c:66
#18 0x001809aa in lock_PyThread_acquire_lock (self=0x14ff820, args=0x512038, kwds=0x0) at ./Modules/_threadmodule.c:133
#19 0x0005f58e in PyCFunction_Call (func=0x166c5f8, arg=0x512038, kw=0x0) at Objects/methodobject.c:84
#20 0x0008c1cd in call_function (pp_stack=0xbfff5f64, oparg=0) at Python/ceval.c:3859
#21 0x000869dd in PyEval_EvalFrameEx (f=0x1193798, throwflag=0) at Python/ceval.c:2657
...
-------

The main thread was waiting test_signals() lock (signalled_all) while it is was interrupted by a signal. The signal handler calls Py_AddPendingCall() which blocks on acquiring "pending_lock".
msg133942 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-17 23:27
> The main thread was waiting test_signals() lock (signalled_all) 
> while it is was interrupted by a signal. The signal handler calls
> Py_AddPendingCall() which blocks on acquiring "pending_lock".

Oh, the main thread receives SIGUSR1: the signal handler calls Py_AddPendingCall() to process it later. But while calling Py_AddPendingCall(), it receives a SIGUSR2: the same signal handler is called, and it calls Py_AddPendingCall().

And we have a deadlock because we try to acquire the lock twice, or try to acquire the lock before the lock was released.
msg133943 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-17 23:59
> The signal handler calls Py_AddPendingCall() which blocks on acquiring 
> "pending_lock".

It blocks in taking the mutex, not on waiting for the condition variable. Otherwise it wouldn't block (microseconds = 0).

I think the solution is to protect signal_handler() against reentrancy: 

diff --git a/Modules/signalmodule.c b/Modules/signalmodule.c
--- a/Modules/signalmodule.c
+++ b/Modules/signalmodule.c
@@ -185,10 +185,12 @@ signal_handler(int sig_num)
         Handlers[sig_num].tripped = 1;
         /* Set is_tripped after setting .tripped, as it gets
            cleared in PyErr_CheckSignals() before .tripped. */
-        is_tripped = 1;
-        Py_AddPendingCall(checksignals_witharg, NULL);
-        if (wakeup_fd != -1)
-            write(wakeup_fd, "\0", 1);
+        if (!is_tripped) {
+            is_tripped = 1;
+            Py_AddPendingCall(checksignals_witharg, NULL);
+            if (wakeup_fd != -1)
+                write(wakeup_fd, "\0", 1);
+        }
     }
 
 #ifndef HAVE_SIGACTION
msg133957 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-18 10:02
It looks like pthread_mutex_lock() and pthread_mutex_unlock() are not reentrant on some platforms (in some implementations of the pthread API).

Antoine: if I understand correctly your patch, if we have a pending signal, all next signals will be simply ignored.

I think that this issue is not specific to Mac OS X: signal_handler() can be called twice at the same time in two different threads or in the same thread (reentrant call).

pthread_sigmask() can be used to avoid reentrant call, but it has no effect on the second case: "signal_handler() called twice at the same time in two different threads".

Note: SA_NODEFER flag of sigaction() has no effect on this issue because signal_handler() is called twice to handle two different signals (SIGUSR1 and SIGUSR2).
msg133958 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-18 10:17
> Antoine: if I understand correctly your patch, if we have a pending
> signal, all next signals will be simply ignored.

I don't think so, please re-read.
msg133959 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-18 10:19
> pthread_sigmask() can be used to avoid reentrant call, 
> but it has no effect on the second case: "signal_handler() 
> called twice at the same time in two different threads".

Same problem if we use sa_mask field of sigaction() (e.g. use sigfillset(&context.sa_mask); in PyOS_setsig()).
msg133960 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-18 10:40
> I don't think so, please re-read.

Oh, I thought that Py_AddPendingCall() was used to store the pending signal. But no, it asks Python main loop to check which signal has been trigerred, and we can only do it once for all signals.

Attached patch should fix this issue:

 - signal_handler() and PyErr_SetInterrupt() only call Py_AddPendingCall() on the first signal (if is_tripped is zero): it should fix the deadlock and it is a micro optimization (win-win !)
 - PyErr_SetInterrupt() uses the wake up API (write "\0" into wakeup_fd, API introduced by issue #1583)
 - Remove debug code in test_threadsignals

Python 2.7 (and 2.6?) should also be patched: signal.set_wakeup_fd() and PySignal_SetWakeupFd() were introduced in 2.6, but there is no versionadded tag => signal_versionadded.patch
msg133961 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-18 10:52
Le lundi 18 avril 2011 à 10:40 +0000, STINNER Victor a écrit :
> Attached patch should fix this issue:
> 
>  - signal_handler() and PyErr_SetInterrupt() only call Py_AddPendingCall() on the first signal (if is_tripped is zero): it should fix the deadlock and it is a micro optimization (win-win !)

Tested on OSX: yes, it does fix this issue.

Should I apply the fix on 2.7, 3.1, 3.2 and 3.3? (or only 3.3)
msg133964 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-18 12:13
Well, since it is a bugfix, you should apply it to all versions.
msg133970 - (view) Author: Roundup Robot (python-dev) Date: 2011-04-18 14:30
New changeset 319f7af9ee5e by Victor Stinner in branch '3.1':
Issue #11768: The signal handler of the signal module only calls
http://hg.python.org/cpython/rev/319f7af9ee5e

New changeset 28ab8c6ad8f9 by Victor Stinner in branch '3.2':
(Merge 3.1) Issue #11768: The signal handler of the signal module only calls
http://hg.python.org/cpython/rev/28ab8c6ad8f9

New changeset fdcbc8453304 by Victor Stinner in branch 'default':
(Merge 3.2) Issue #11768: The signal handler of the signal module only calls
http://hg.python.org/cpython/rev/fdcbc8453304
msg133971 - (view) Author: Roundup Robot (python-dev) Date: 2011-04-18 14:34
New changeset 7d355c2ff3d4 by Victor Stinner in branch '2.7':
(Merge 3.1) Issue #11768: The signal handler of the signal module only calls
http://hg.python.org/cpython/rev/7d355c2ff3d4

New changeset ebd080593351 by Victor Stinner in branch '2.7':
Issue #11768: signal.set_wakeup_fd() and PySignal_SetWakeupFd() added in 2.6
http://hg.python.org/cpython/rev/ebd080593351
msg133973 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-18 14:48
The deadlock is fixed. Reopen the issue if you still see test_threadsignals hang on a buildbot.
History
Date User Action Args
2011-04-18 14:48:12hayposetstatus: open -> closed
resolution: fixed
messages: + msg133973
2011-04-18 14:34:34python-devsetmessages: + msg133971
2011-04-18 14:30:25python-devsetmessages: + msg133970
2011-04-18 12:13:06pitrousetmessages: + msg133964
2011-04-18 10:59:13hayposettitle: test_signals() of test_threadsignals failure on Mac OS X -> signal_handler() is not reentrant: deadlock in Py_AddPendingCall()
2011-04-18 10:57:11hayposetfiles: + signal_versionadded.patch
2011-04-18 10:52:48hayposetmessages: + msg133961
title: signal_handler() is not reentrant: deadlock in Py_AddPendingCall() -> test_signals() of test_threadsignals failure on Mac OS X
2011-04-18 10:42:02hayposettitle: test_signals() of test_threadsignals failure on Mac OS X -> signal_handler() is not reentrant: deadlock in Py_AddPendingCall()
2011-04-18 10:40:30hayposetfiles: + signal_reentrant.patch
keywords: + patch
messages: + msg133960
2011-04-18 10:19:00hayposetmessages: + msg133959
2011-04-18 10:17:14pitrousetmessages: + msg133958
2011-04-18 10:02:10hayposetmessages: + msg133957
2011-04-17 23:59:23pitrousetnosy: + pitrou
messages: + msg133943
2011-04-17 23:27:36hayposetmessages: + msg133942
2011-04-17 23:12:23hayposetmessages: + msg133941
2011-04-07 22:19:30hayposetmessages: + msg133268
2011-04-07 22:14:42hayposetmessages: + msg133267
2011-04-07 20:59:28hayposetmessages: + msg133257
title: test_signals() of test_threadsignals failure on Mac OS X Tiger -> test_signals() of test_threadsignals failure on Mac OS X
2011-04-05 00:30:08python-devsetnosy: + python-dev
messages: + msg132995
2011-04-05 00:19:17haypocreate