classification
Title: signal.pause() and signal handlers don't react to SIGCHLD in non-main thread
Type: Stage:
Components: Interpreter Core Versions: Python 3.7, Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: bkabrda, neologix, njs, pitrou, vstinner
Priority: normal Keywords:

Created on 2014-07-01 10:01 by bkabrda, last changed 2017-03-23 14:59 by gvanrossum.

Files
File name Uploaded Description Edit
signal_pause_doesnt_wake_up.py bkabrda, 2014-07-01 10:01
thread_signal.py vstinner, 2017-03-22 08:29
Pull Requests
URL Status Linked Edit
PR 768 closed vstinner, 2017-03-22 09:14
Messages (17)
msg222021 - (view) Author: Bohuslav "Slavek" Kabrda (bkabrda) * Date: 2014-07-01 10:01
Reproducer attached. To describe the problem in words, one needs to do this to reproduce this issue:
- Create a signal handler and register it using signal.signal to SIGCHLD.
- Create a thread that invokes a subprocess, e.g. "sleep 1" and then continues to do something for time period longer than the subprocess is running.
- Run this thread from main thread and call signal.pause() in the main thread.
- The pause() call is never interrupted and the signal handler is never run.

This happens because:
- Python adds signal_handler as a handler to the specified signal and stores the passed Python function in structure Handlers.
- When signal.pause() is run, the main thread releases the GIL, so that other threads can run.
- The non-main thread gets to lease and actually invokes the subprocess and then continues to do something.
- When subprocess finishes, it sends the signal *to the thread that invoked it* (assuming this thread is still running). This means that the signal.pause() call is not interrupted and main thread continues to sleep.
- The non-main thread adds handler call to the list of pending calls using Py_AddPendingCall.
- Pending calls are checked in Py_MakePendingCalls, which is called in every iteration of the bytecode evaluation loop in PyEval_EvalFrameEx.
- The problem is that since pause() isn't un-paused and hangs forever, the evaluation loop never gets to another iteration, hence can't do any pending call.

This happens on all Python versions I've tried, using pthread threading.
I think this could *possibly* be solved by issuing a pthread_kill from the non-main thread to the main thread to wake it up, but I'm not sure what all the implications of such a change would be.
msg222022 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-01 10:31
First of all, signals and threads usually don't play together. Most signal functions are *not* thread safe. The behaviour of signal functions are not well defined for threads.

Example with pause:

"pause()  causes  the  calling  process  (or  thread) to sleep until a signal is delivered that either terminates the process or causes the invocation of a signal-catching function."

What does it mean "or thread"? Sometimes the function waits for a signal from any thread, something only from the caller thread? :-p

I understood that pause() only waits for signals received in the caller thread, main thread in your case.

Depending on the platform, a signal may be delivered to a different thread :-/ Especially when a signal is send to the process, ex: "kill -USR1 pid" command on UNIX.

This issue is more a documentation issue: we should mention that pause() is limited to a thread.

Python signal handlers are only called from the main thread, even if signals can be received from any thread.

For your example, you can workaround pause() issue by calling "signal.pthread_sigmask(signal.SIG_BLOCK, [signal.SIGCHLD])" in your thread.
msg222028 - (view) Author: Bohuslav "Slavek" Kabrda (bkabrda) * Date: 2014-07-01 13:33
I think that the problem here is that the documentation sets an expectation that's not met, at least from my POV. Running the reproducer, I was expecting the main thread to unpause and execute the signal handler (citing: "Python signal handlers are always executed in the main Python thread"). I understand that this may be insanely hard to fix in code, so documentation should explicitly point out that there are some (perhaps unfixable) problems with Python's approach to threading+signals.
msg289913 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2017-03-21 04:53
It turns out that this bug is more general than signal.pause, and has caused problems for a few different people:
  https://github.com/dabeaz/curio/issues/118#issuecomment-287735781
  https://github.com/dabeaz/curio/issues/118#issuecomment-287798241
  https://github.com/dabeaz/curio/issues/118#issuecomment-287887843

The basic problem is that CPython's signal handling strategy on Unix-likes assumes that if a signal is delivered to the process at a time when the main thread is blocked in a syscall, then that syscall will promptly exit with EINTR. So for example, time.sleep has some clever code on Windows to make it interruptible with control-C, but on Unix we assume that the kernel will break the sleep for us.

The problem with this is that POSIX only guarantees that *some* thread will receive the signal -- not necessarily the main thread. In practice it seems like most implementations do deliver most signals to the main thread or CPython wouldn't have gotten away with this for as long as it has, but in particular it seems like Linux is happy to deliver SIGCHLD to random other threads. So the C-level signal handler runs in whatever thread, it sets the flag for the main thread to run the Python-level signal handler... and then the main thread sits there blocked in sleep() or select() or pause() or whatever, and never checks the flag.

A simple solution would be to make sure signals are always delivered to the main thread, by adjusting the C-level signal handler to do something like:

if (current_thread != main_thread) {
    pthread_kill(main_thread, sig_num);
    return;
}
msg289920 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-21 06:57
If the main thread waits on select() and uses a pipe with
signal.set_wakeup_fd(), it should be fine. Is that your case?

Asyncio requires to have an event loop running in the main thread to spawn
child processes, to get SIGCHLD signals to get notified of child
completions. Or you can ignore these signals and use the "safe" child
waiter using waitall() and then iterate on all children.
msg289952 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2017-03-22 00:55
I don't really have a specific use case personally -- for trio, I haven't found a way to make use of set_wakeup_fd because of various issues[1], but I'm also not planning to use SIGCHLD, so this isn't very urgent.

In general set_wakeup_fd can be a workaround, but I wanted to note this upstream because it's a bug in Python's signal handler logic. Note that Python already goes to great lengths to make e.g. signal handlers run during time.sleep on Windows; they ought to just work on Unix too.

--

[1] (Off-topic but in case you're curious: I register actual signal handlers anyway because I follow the philosophy that the wakeup pipe should only be used for wakeups rather than transmitting information, so as long as signal handlers work I can do my own wakeups, + for some reason set_wakeup_fd doesn't work for me on Windows (no idea why, can't reproduce in simplified tests, might be my fault, need to debug), + set_wakeup_fd's handling of buffer overflow is broken for my purposes.)
msg289953 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-22 01:10
2017-03-22 1:55 GMT+01:00 Nathaniel Smith <report@bugs.python.org>:
> + for some reason set_wakeup_fd doesn't work for me on Windows (no idea why, can't reproduce in simplified tests, might be my fault, need to debug),

I modified signal.set_wakeup_fd() to accept a socket handle on
Windows. Are you using socket.socketpair() on Windows? Or do you pass
a pipe?
msg289957 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2017-03-22 01:43
@haypo: It's a socketpair. It works fine when I set up a toy test case using set_wakeup_fd + select, and it works fine in my real code when I use CFFI cleverness to register a signal handler that manually writes a byte to my wakeup socket, but when I pass that exact same socket to set_wakeup_fd in my real code, it doesn't work. It's pretty mysterious, and I have no particular reason to think that the problem is in CPython as opposed to some stupid mistake I'm making.
msg289964 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2017-03-22 03:37
@haypo: okay, looked over things over for a third time and this time I found my very silly error :-). So I'm now able to use set_wakeup_fd on Windows (https://github.com/python-trio/trio/pull/108), but not on Unix (https://github.com/python-trio/trio/issues/109).

In any case, the issue here remains that one shouldn't have to use set_wakeup_fd for a signal to interrupt time.sleep etc.
msg289976 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-22 08:15
Nathaniel Smith added the comment:
> In any case, the issue here remains that one shouldn't have to use set_wakeup_fd for a signal to interrupt time.sleep etc.

CPython implements a C signal handler which sets a flag and then exits
immediately. The thread getting the signal will probably interrupts
the current blocking syscall with EINTR. CPython detects that a signal
flag was set, calls the *Python* signal handler. If the Python signal
handler raises an exception, the syscall is abandonned. Otherwise, the
syscall is restarted.

So "interrupting sleep" is only reliable if:

* there is only one thread (the "main" thread)
* the syscall is interrupted with EINTR
* the signal handler raises an exception

When you implement an event loop, raising an exception may be the best
design. In asyncio, the Python signal handler calls loop.call_soon()
to execute the final callback (3rd signal handler of the same
signal!). The rule in asynico is not call blocking syscalls, or at
least ensure that they don't block too long (ex: use aiofiles to
access the filesystem).
msg289977 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-22 08:29
Attached signal_pause_doesnt_wake_up.py is a little bit complex and not sure that it's reliable.

I wrote thread_signal.py which should have a more deterministic behaviour. Output on Linux:
---
main: main thread 140392674538560
thread: wait
main: spawn thread 140392542746368
main: sleep
main: send signal to thread
main: sleep again
Python signal handler, thread 140392674538560
main: wait thread
thread: done
main: done
---

As expected, the Python signal handler is called in the main thread.

time.sleep() in the thread *is* interrupted by SIGUSR1 (select() fails with EINTR), but PyErr_CheckSignals() does nothing since it's not the main thread. Then the sleep is automatically restarted (PEP 475).

The code works as expected, but I understand that the behaviour is surprising.

To be honest, I never understood the rationale behind "only execute signal handlers in the main thread". I was also surprised to no see a pthread_kill() in the C signal handler.

I dislike the idea of transfering the signal to the main thread from another thread in the C signal handler using pthread_kill(). Most code behave very badly when getting a signal, so getting a signal twice is likely to double the pain.

Moreover, pthread_sigmask() can block signals in the main thread for deliberate reasons.

If we should change something, I simply suggest to remove the arbitrary limitation from the C signal handler. I don't know the consequences yet.
msg289978 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-22 08:59
Hum, maybe I found the root issue: the C signal handler calls Py_AddPendingCall() which uses a lock to protect a global static pendingcalls array (of 32 items). The function tries 100 times in a row to acquire to lock, or does nothing (returns -1) if it fails to acquire the lock.

If we start to allow signals from any thread, this shared pendingcalls array can quickly become a source of race conditions like deadlocks or ignored callbacks.

To avoid deadlocks, IMHO the best is to have a per-thread array which consumes 512 bytes (on 64-bit, 32 items made of 2 pointers).

--

The _thread module has a strange _thread.interrupt_main() function.

--

From the point of view of the Python signal handler, the current "if (PyThread_get_thread_ident() != main_thread) return 0;" code in the C signal handler is somehow an implicit pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG)) on all threads except of the main thread, whereas Unix gives a fine control on these masks with the pthread_sigmask() function.

--

The Windows part is more tricky. A Windows Event object (created by CreateEvent() and retrieved by _PyOS_SigintEvent()) is used to interrupt a few blocking functions:

* my_fgets() used by PyOS_StdioReadline() to implemented "readline" (especially for the REPL)
* _PyOS_WindowsConsoleReadline()
* read_console_w() of Modules/_io/winconsole.c
* time.sleep() -- only if it's the main thread and delay != 0 seconds
* _multiprocessing.SemLock.acquire() -- only if called from the main thread
* _winapi.WaitForMultipleObjects()

The event is set by the SIGINT signal handler set by Python.

Extract of pysleep() comment:

/* Allow sleep(0) to maintain win32 semantics, and as decreed
 * by Guido, only the main thread can be interrupted.
 */
msg289980 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-22 09:13
thread_signal.py: Oops, I forgot to remove the "raise Exception" line from sighandler() (I used it for a quick test.)
msg289981 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-22 09:16
I created a WIP patch https://github.com/python/cpython/pull/768 to experiment allowing signal handlers from any threads. I expect many race conditions since I only removed sanity checks without adding new code to handle parallelism.
msg290024 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2017-03-23 04:57
Letting Python-level signal handlers run in arbitrary threads is an interesting idea, but it's a non-trivial change to Python semantics that may well break some programs (previously it was guaranteed that signal handlers couldn't race with main thread code), and it doesn't fully fix the bug here (because it's common for Python programs to contain threads which don't run Python code, e.g. if they use zmq). I can imagine some potential benefits, but I feel like this needs some substantial rationale?

OTOH the pthread_kill-based fix I suggested is like 3 lines and AFAICT fixes the problem exactly.

> I dislike the idea of transfering the signal to the main thread from another thread in the C signal handler using pthread_kill(). Most code behave very badly when getting a signal, so getting a signal twice is likely to double the pain.

This doesn't make much sense to me... CPython's pretty good at handling EINTR these days, and, well, the only case that's different is when you have a signal handler that needs to be run and the main thread is blocked in a syscall, which is exactly the case that's currently broken? 

> Moreover, pthread_sigmask() can block signals in the main thread for deliberate reasons.

Sure, and with pthread_kill() this would have the effect that the signal would be queued by the kernel and delivered after signal is unmasked. That seems like pretty sensible semantics to me; in fact it's exactly what you get in single-threaded code.

Compare that to right now where AFAICT pthread_sigmask() is pretty useless in the presence of threads, because of the thing where if one thread has a signal blocked then the kernel will pick another thread to deliver it to.

> From the point of view of the Python signal handler, the current "if (PyThread_get_thread_ident() != main_thread) return 0;" code in the C signal handler is somehow an implicit pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG)) on all threads except of the main thread, whereas Unix gives a fine control on these masks with the pthread_sigmask() function.

That code isn't in the C signal handler, it's in the code that the interpreter runs occasionally to check if the C signal handler has been called, right?
msg290030 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-03-23 08:24
> Using set_wakeup_fd would fix them... except that it generates spurious warning messages when the wakeup fd buffer is full, and there's no way to stop it

Are you using a pipe or a socket to set_wakeup_fd?  Pipes have rather small buffers.  In any case, since, as you mention, Tornado and Twisted use it, and they never complained, perhaps your example is too contrived or artificial.
msg290031 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-03-23 08:32
That said, the pthread_kill() solution deserves testing as well.
History
Date User Action Args
2017-03-23 14:59:56gvanrossumsetnosy: - gvanrossum
2017-03-23 08:32:58pitrousetmessages: + msg290031
2017-03-23 08:24:37pitrousetnosy: + gvanrossum
messages: + msg290030
2017-03-23 04:57:13njssetmessages: + msg290024
2017-03-22 09:16:11vstinnersetmessages: + msg289981
2017-03-22 09:14:57vstinnersetpull_requests: + pull_request675
2017-03-22 09:13:08vstinnersetmessages: + msg289980
2017-03-22 08:59:54vstinnersetnosy: + pitrou
messages: + msg289978
2017-03-22 08:29:48vstinnersetfiles: + thread_signal.py

messages: + msg289977
2017-03-22 08:15:39vstinnersetmessages: + msg289976
2017-03-22 03:37:48njssetmessages: + msg289964
2017-03-22 01:43:27njssetmessages: + msg289957
2017-03-22 01:10:04vstinnersetmessages: + msg289953
2017-03-22 00:55:05njssetmessages: + msg289952
2017-03-21 06:57:51vstinnersetmessages: + msg289920
2017-03-21 04:53:36njssetversions: + Python 3.6, Python 3.7
nosy: + njs
title: signal.pause() doesn't wake up on SIGCHLD in non-main thread -> signal.pause() and signal handlers don't react to SIGCHLD in non-main thread
messages: + msg289913

2014-07-01 13:33:01bkabrdasetmessages: + msg222028
2014-07-01 10:31:16vstinnersetnosy: + neologix
2014-07-01 10:31:10vstinnersetmessages: + msg222022
2014-07-01 10:15:35vstinnersetnosy: + vstinner
2014-07-01 10:01:35bkabrdacreate