Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signal.pause() and signal handlers don't react to SIGCHLD in non-main thread #66094

Open
bkabrda mannequin opened this issue Jul 1, 2014 · 18 comments
Open

signal.pause() and signal handlers don't react to SIGCHLD in non-main thread #66094

bkabrda mannequin opened this issue Jul 1, 2014 · 18 comments
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@bkabrda
Copy link
Mannequin

bkabrda mannequin commented Jul 1, 2014

BPO 21895
Nosy @gvanrossum, @pitrou, @vstinner, @njsmith
PRs
  • WIP (don't merge!): bpo-21895: Experiment signal handlers in any thread #768
  • Files
  • signal_pause_doesnt_wake_up.py
  • thread_signal.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2014-07-01.10:01:35.002>
    labels = ['interpreter-core', '3.7']
    title = "signal.pause() and signal handlers don't react to SIGCHLD in non-main thread"
    updated_at = <Date 2021-01-13.06:14:20.968>
    user = 'https://bugs.python.org/bkabrda'

    bugs.python.org fields:

    activity = <Date 2021-01-13.06:14:20.968>
    actor = 'gvanrossum'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2014-07-01.10:01:35.002>
    creator = 'bkabrda'
    dependencies = []
    files = ['35815', '46752']
    hgrepos = []
    issue_num = 21895
    keywords = []
    message_count = 18.0
    messages = ['222021', '222022', '222028', '289913', '289920', '289952', '289953', '289957', '289964', '289976', '289977', '289978', '289980', '289981', '290024', '290030', '290031', '384997']
    nosy_count = 6.0
    nosy_names = ['gvanrossum', 'pitrou', 'vstinner', 'njs', 'neologix', 'bkabrda']
    pr_nums = ['768']
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue21895'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6', 'Python 3.7']

    @bkabrda
    Copy link
    Mannequin Author

    bkabrda mannequin commented Jul 1, 2014

    Reproducer attached. To describe the problem in words, one needs to do this to reproduce this issue:

    • Create a signal handler and register it using signal.signal to SIGCHLD.
    • Create a thread that invokes a subprocess, e.g. "sleep 1" and then continues to do something for time period longer than the subprocess is running.
    • Run this thread from main thread and call signal.pause() in the main thread.
    • The pause() call is never interrupted and the signal handler is never run.

    This happens because:

    • Python adds signal_handler as a handler to the specified signal and stores the passed Python function in structure Handlers.
    • When signal.pause() is run, the main thread releases the GIL, so that other threads can run.
    • The non-main thread gets to lease and actually invokes the subprocess and then continues to do something.
    • When subprocess finishes, it sends the signal *to the thread that invoked it* (assuming this thread is still running). This means that the signal.pause() call is not interrupted and main thread continues to sleep.
    • The non-main thread adds handler call to the list of pending calls using Py_AddPendingCall.
    • Pending calls are checked in Py_MakePendingCalls, which is called in every iteration of the bytecode evaluation loop in PyEval_EvalFrameEx.
    • The problem is that since pause() isn't un-paused and hangs forever, the evaluation loop never gets to another iteration, hence can't do any pending call.

    This happens on all Python versions I've tried, using pthread threading.
    I think this could *possibly* be solved by issuing a pthread_kill from the non-main thread to the main thread to wake it up, but I'm not sure what all the implications of such a change would be.

    @bkabrda bkabrda mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jul 1, 2014
    @vstinner
    Copy link
    Member

    vstinner commented Jul 1, 2014

    First of all, signals and threads usually don't play together. Most signal functions are *not* thread safe. The behaviour of signal functions are not well defined for threads.

    Example with pause:

    "pause() causes the calling process (or thread) to sleep until a signal is delivered that either terminates the process or causes the invocation of a signal-catching function."

    What does it mean "or thread"? Sometimes the function waits for a signal from any thread, something only from the caller thread? :-p

    I understood that pause() only waits for signals received in the caller thread, main thread in your case.

    Depending on the platform, a signal may be delivered to a different thread :-/ Especially when a signal is send to the process, ex: "kill -USR1 pid" command on UNIX.

    This issue is more a documentation issue: we should mention that pause() is limited to a thread.

    Python signal handlers are only called from the main thread, even if signals can be received from any thread.

    For your example, you can workaround pause() issue by calling "signal.pthread_sigmask(signal.SIG_BLOCK, [signal.SIGCHLD])" in your thread.

    @bkabrda
    Copy link
    Mannequin Author

    bkabrda mannequin commented Jul 1, 2014

    I think that the problem here is that the documentation sets an expectation that's not met, at least from my POV. Running the reproducer, I was expecting the main thread to unpause and execute the signal handler (citing: "Python signal handlers are always executed in the main Python thread"). I understand that this may be insanely hard to fix in code, so documentation should explicitly point out that there are some (perhaps unfixable) problems with Python's approach to threading+signals.

    @njsmith
    Copy link
    Contributor

    njsmith commented Mar 21, 2017

    It turns out that this bug is more general than signal.pause, and has caused problems for a few different people:
    dabeaz/curio#118 (comment)
    dabeaz/curio#118 (comment)
    dabeaz/curio#118 (comment)

    The basic problem is that CPython's signal handling strategy on Unix-likes assumes that if a signal is delivered to the process at a time when the main thread is blocked in a syscall, then that syscall will promptly exit with EINTR. So for example, time.sleep has some clever code on Windows to make it interruptible with control-C, but on Unix we assume that the kernel will break the sleep for us.

    The problem with this is that POSIX only guarantees that *some* thread will receive the signal -- not necessarily the main thread. In practice it seems like most implementations do deliver most signals to the main thread or CPython wouldn't have gotten away with this for as long as it has, but in particular it seems like Linux is happy to deliver SIGCHLD to random other threads. So the C-level signal handler runs in whatever thread, it sets the flag for the main thread to run the Python-level signal handler... and then the main thread sits there blocked in sleep() or select() or pause() or whatever, and never checks the flag.

    A simple solution would be to make sure signals are always delivered to the main thread, by adjusting the C-level signal handler to do something like:

    if (current_thread != main_thread) {
        pthread_kill(main_thread, sig_num);
        return;
    }

    @njsmith njsmith added the 3.7 (EOL) end of life label Mar 21, 2017
    @njsmith njsmith changed the title signal.pause() doesn't wake up on SIGCHLD in non-main thread signal.pause() and signal handlers don't react to SIGCHLD in non-main thread Mar 21, 2017
    @vstinner
    Copy link
    Member

    If the main thread waits on select() and uses a pipe with
    signal.set_wakeup_fd(), it should be fine. Is that your case?

    Asyncio requires to have an event loop running in the main thread to spawn
    child processes, to get SIGCHLD signals to get notified of child
    completions. Or you can ignore these signals and use the "safe" child
    waiter using waitall() and then iterate on all children.

    @njsmith
    Copy link
    Contributor

    njsmith commented Mar 22, 2017

    I don't really have a specific use case personally -- for trio, I haven't found a way to make use of set_wakeup_fd because of various issues[1], but I'm also not planning to use SIGCHLD, so this isn't very urgent.

    In general set_wakeup_fd can be a workaround, but I wanted to note this upstream because it's a bug in Python's signal handler logic. Note that Python already goes to great lengths to make e.g. signal handlers run during time.sleep on Windows; they ought to just work on Unix too.

    --

    [1] (Off-topic but in case you're curious: I register actual signal handlers anyway because I follow the philosophy that the wakeup pipe should only be used for wakeups rather than transmitting information, so as long as signal handlers work I can do my own wakeups, + for some reason set_wakeup_fd doesn't work for me on Windows (no idea why, can't reproduce in simplified tests, might be my fault, need to debug), + set_wakeup_fd's handling of buffer overflow is broken for my purposes.)

    @vstinner
    Copy link
    Member

    2017-03-22 1:55 GMT+01:00 Nathaniel Smith <report@bugs.python.org>:

    • for some reason set_wakeup_fd doesn't work for me on Windows (no idea why, can't reproduce in simplified tests, might be my fault, need to debug),

    I modified signal.set_wakeup_fd() to accept a socket handle on
    Windows. Are you using socket.socketpair() on Windows? Or do you pass
    a pipe?

    @njsmith
    Copy link
    Contributor

    njsmith commented Mar 22, 2017

    @Haypo: It's a socketpair. It works fine when I set up a toy test case using set_wakeup_fd + select, and it works fine in my real code when I use CFFI cleverness to register a signal handler that manually writes a byte to my wakeup socket, but when I pass that exact same socket to set_wakeup_fd in my real code, it doesn't work. It's pretty mysterious, and I have no particular reason to think that the problem is in CPython as opposed to some stupid mistake I'm making.

    @njsmith
    Copy link
    Contributor

    njsmith commented Mar 22, 2017

    @Haypo: okay, looked over things over for a third time and this time I found my very silly error :-). So I'm now able to use set_wakeup_fd on Windows (python-trio/trio#108), but not on Unix (python-trio/trio#109).

    In any case, the issue here remains that one shouldn't have to use set_wakeup_fd for a signal to interrupt time.sleep etc.

    @vstinner
    Copy link
    Member

    Nathaniel Smith added the comment:

    In any case, the issue here remains that one shouldn't have to use set_wakeup_fd for a signal to interrupt time.sleep etc.

    CPython implements a C signal handler which sets a flag and then exits
    immediately. The thread getting the signal will probably interrupts
    the current blocking syscall with EINTR. CPython detects that a signal
    flag was set, calls the *Python* signal handler. If the Python signal
    handler raises an exception, the syscall is abandonned. Otherwise, the
    syscall is restarted.

    So "interrupting sleep" is only reliable if:

    • there is only one thread (the "main" thread)
    • the syscall is interrupted with EINTR
    • the signal handler raises an exception

    When you implement an event loop, raising an exception may be the best
    design. In asyncio, the Python signal handler calls loop.call_soon()
    to execute the final callback (3rd signal handler of the same
    signal!). The rule in asynico is not call blocking syscalls, or at
    least ensure that they don't block too long (ex: use aiofiles to
    access the filesystem).

    @vstinner
    Copy link
    Member

    Attached signal_pause_doesnt_wake_up.py is a little bit complex and not sure that it's reliable.

    I wrote thread_signal.py which should have a more deterministic behaviour. Output on Linux:
    ---
    main: main thread 140392674538560
    thread: wait
    main: spawn thread 140392542746368
    main: sleep
    main: send signal to thread
    main: sleep again
    Python signal handler, thread 140392674538560
    main: wait thread
    thread: done
    main: done
    ---

    As expected, the Python signal handler is called in the main thread.

    time.sleep() in the thread *is* interrupted by SIGUSR1 (select() fails with EINTR), but PyErr_CheckSignals() does nothing since it's not the main thread. Then the sleep is automatically restarted (PEP-475).

    The code works as expected, but I understand that the behaviour is surprising.

    To be honest, I never understood the rationale behind "only execute signal handlers in the main thread". I was also surprised to no see a pthread_kill() in the C signal handler.

    I dislike the idea of transfering the signal to the main thread from another thread in the C signal handler using pthread_kill(). Most code behave very badly when getting a signal, so getting a signal twice is likely to double the pain.

    Moreover, pthread_sigmask() can block signals in the main thread for deliberate reasons.

    If we should change something, I simply suggest to remove the arbitrary limitation from the C signal handler. I don't know the consequences yet.

    @vstinner
    Copy link
    Member

    Hum, maybe I found the root issue: the C signal handler calls Py_AddPendingCall() which uses a lock to protect a global static pendingcalls array (of 32 items). The function tries 100 times in a row to acquire to lock, or does nothing (returns -1) if it fails to acquire the lock.

    If we start to allow signals from any thread, this shared pendingcalls array can quickly become a source of race conditions like deadlocks or ignored callbacks.

    To avoid deadlocks, IMHO the best is to have a per-thread array which consumes 512 bytes (on 64-bit, 32 items made of 2 pointers).

    --

    The _thread module has a strange _thread.interrupt_main() function.

    --

    From the point of view of the Python signal handler, the current "if (PyThread_get_thread_ident() != main_thread) return 0;" code in the C signal handler is somehow an implicit pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG)) on all threads except of the main thread, whereas Unix gives a fine control on these masks with the pthread_sigmask() function.

    --

    The Windows part is more tricky. A Windows Event object (created by CreateEvent() and retrieved by _PyOS_SigintEvent()) is used to interrupt a few blocking functions:

    • my_fgets() used by PyOS_StdioReadline() to implemented "readline" (especially for the REPL)
    • _PyOS_WindowsConsoleReadline()
    • read_console_w() of Modules/_io/winconsole.c
    • time.sleep() -- only if it's the main thread and delay != 0 seconds
    • _multiprocessing.SemLock.acquire() -- only if called from the main thread
    • _winapi.WaitForMultipleObjects()

    The event is set by the SIGINT signal handler set by Python.

    Extract of pysleep() comment:

    /* Allow sleep(0) to maintain win32 semantics, and as decreed

    • by Guido, only the main thread can be interrupted.
      */

    @vstinner
    Copy link
    Member

    thread_signal.py: Oops, I forgot to remove the "raise Exception" line from sighandler() (I used it for a quick test.)

    @vstinner
    Copy link
    Member

    I created a WIP patch #768 to experiment allowing signal handlers from any threads. I expect many race conditions since I only removed sanity checks without adding new code to handle parallelism.

    @njsmith
    Copy link
    Contributor

    njsmith commented Mar 23, 2017

    Letting Python-level signal handlers run in arbitrary threads is an interesting idea, but it's a non-trivial change to Python semantics that may well break some programs (previously it was guaranteed that signal handlers couldn't race with main thread code), and it doesn't fully fix the bug here (because it's common for Python programs to contain threads which don't run Python code, e.g. if they use zmq). I can imagine some potential benefits, but I feel like this needs some substantial rationale?

    OTOH the pthread_kill-based fix I suggested is like 3 lines and AFAICT fixes the problem exactly.

    I dislike the idea of transfering the signal to the main thread from another thread in the C signal handler using pthread_kill(). Most code behave very badly when getting a signal, so getting a signal twice is likely to double the pain.

    This doesn't make much sense to me... CPython's pretty good at handling EINTR these days, and, well, the only case that's different is when you have a signal handler that needs to be run and the main thread is blocked in a syscall, which is exactly the case that's currently broken?

    Moreover, pthread_sigmask() can block signals in the main thread for deliberate reasons.

    Sure, and with pthread_kill() this would have the effect that the signal would be queued by the kernel and delivered after signal is unmasked. That seems like pretty sensible semantics to me; in fact it's exactly what you get in single-threaded code.

    Compare that to right now where AFAICT pthread_sigmask() is pretty useless in the presence of threads, because of the thing where if one thread has a signal blocked then the kernel will pick another thread to deliver it to.

    From the point of view of the Python signal handler, the current "if (PyThread_get_thread_ident() != main_thread) return 0;" code in the C signal handler is somehow an implicit pthread_sigmask(signal.SIG_BLOCK, range(1, signal.NSIG)) on all threads except of the main thread, whereas Unix gives a fine control on these masks with the pthread_sigmask() function.

    That code isn't in the C signal handler, it's in the code that the interpreter runs occasionally to check if the C signal handler has been called, right?

    @pitrou
    Copy link
    Member

    pitrou commented Mar 23, 2017

    Using set_wakeup_fd would fix them... except that it generates spurious warning messages when the wakeup fd buffer is full, and there's no way to stop it

    Are you using a pipe or a socket to set_wakeup_fd? Pipes have rather small buffers. In any case, since, as you mention, Tornado and Twisted use it, and they never complained, perhaps your example is too contrived or artificial.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 23, 2017

    That said, the pthread_kill() solution deserves testing as well.

    @gvanrossum
    Copy link
    Member

    When I removed myself from the nosy list I did mean to stop working on this bug, just that I am unlikely to be of any help. :-)

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    Status: No status
    Development

    No branches or pull requests

    4 participants