Title: Signal tripped flags need memory barriers
Type: behavior Stage:
Components: Versions: Python 3.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: arigo, njs, pitrou, vstinner
Priority: normal Keywords:

Created on 2017-08-04 08:36 by njs, last changed 2017-08-06 15:08 by pitrou.

Messages (7)
msg299736 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2017-08-04 08:36
Sometimes, CPython's signal handler (signalmodule.c:signal_handler) runs in a different thread from the main thread. On Unix this is rare but it does happen (esp. for SIGCHLD), and on Windows it happens 100% of the time. It turns out that there is a subtle race condition that means such signals can be lost.

To notify the main thread that a signal has been received, and which signal it is, CPython's signal handler sets two global variables:

  Handlers[sig_num].tripped = 1;
  is_tripped = 1

And then PyErr_CheckSignals does:

  if (is_tripped) {
      is_tripped = 0;
      for (i = 1; i < NSIG; i++) {
          if (Handlers[i].is_tripped) {
              ... run signal handler ...

As some comments in the source note, this logic depends strongly on the assumption that when the signal handler sets the two flags, they are immediately visible to PyErr_CheckSignals, and that the two assignments happen in the given order. For example, suppose that is_tripped were set before Handlers[i].is_tripped. Then we could have the following sequence:

1. Thread A sets is_tripped = 1
2. Thread B runs PyErr_CheckSignals
3. Thread B notices is_tripped == 1, and sets it to 0
4. Thread B scans through all the Handlers, sees that none are set, and carries on
5. Thread A sets Handlers[i].is_tripped = 1

In this case the signal is lost until another signal arrives, which may not happen until an arbitrary amount of time has passed.

Also, the fix for bpo-30038 (making sure that we set the flags before writing to the wakeup fd) assumes that setting the flags before writing to the wakeup fd means that, well, the flags will be written before the fd.

However, when you have code running in separate threads, none of this is actually guaranteed. In general, the compiler is free to re-order writes, and more perniciously, the hardware memory hierarchy is also free to arbitrarily delay when writes to RAM made by one CPU become visible to another CPU, which can also create arbitrary reorderings.

In an attempt to avoid these issues, signalmodule.c declares the signal flags as 'volatile sig_atomic_t'. However, the best one can hope for from this is that it'll stop the *compiler* from reordering writes. It does nothing to stop the *hardware* from reordering writes. Using volatile like this might be sufficient if the signal handler runs in the main thread, but not if it runs in another thread.

The correct way to handle this is to use the primitives in pyatomic.h to get/set the tripped flags.


I noticed this because of the same test case that was failing due to bpo-30038... this thing has been frustrating me for months because it's intermittent and never seems to happen on my local VM. But specifically what I'm seeing at this point is:

Thread A (the thread where the signal handler runs):
A1. takes a lock.
A2. with the lock held, calls the C function raise(SIGINT), which directly calls the CPython C-level signal handler. (Verified by consulting the Windows C runtime source.)
A3. writes to the wakeup fd & sets both tripped flags, in whatever order
A4. releases the lock

Simultaneously, in thread B (the main thread):
B1. observes that the wakeup fd has been written to
B2. takes the lock
B3. releases the lock
B4. calls repr(), which unconditionally calls PyObject_Repr, which unconditionally calls PyErr_CheckSignals

I expect the call in B4 to run the Python-level signal handler, but this does not happen. Instead it runs some time later, causing an unexpected KeyboardInterrupt.

My reasoning is: the lock guarantees that no matter what the internal details of the C-level signal handler actually are, they have to all be complete before my code checks for signals. In excruciating detail:

- the write to the wakeup fd must happen-after the lock is acquired (A1)
- therefore, step B1 happens-after step A1
- B2 happens-after B1, so B2 happens-after A1.
- B2 takes the same lock as A1, so if B2 happens-after A1, it must also happen-after A4, when the lock is released.
- step B4 happens-after B2, therefore it happens-after A4 and A3. Therefore the tripped flags should be visible and the Python-level signal handler should run.

Yet this does not happen, so *something* really weird is going on. The hypothesis that it's the lack of memory barriers both explains how this could happen -- specifically I think thread B may be running PyErr_CheckSignals as part of the wakeup-fd reading logic, and it's clearing is_tripped before Handlers[i].tripped becomes visible, like in my code above. And it also explains why it happens in like 1/50 runs in Appveyor, but never on my local Windows 10 VM that only has 1 virtual CPU.

But regardless of whether this is the cause of my particular weird bug, it clearly *is* a bug and should be fixed.
msg299756 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2017-08-04 18:27
On further investigation (= a few hours staring at the ceiling last night), it looks like there's another explanation for my particular bug... which is good, because on further investigation (= a few hours squinting at google results) it looks like this probably can't happen on x86/x86-64 (I think? maybe?).

It's still true though that you can't just throw in a 'volatile' and expect cross-thread synchronization to work -- that's not C's semantics, and it's not true on popular architectures like ARM.
msg299760 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2017-08-04 20:30
For reference, no, it can't happen on x86 or x86-64.  I've found that this simplified model actually works for reasoning about ordering at the hardware level: think about each core as a cache-less machine that always *reads* from the central RAM, but that has a delay for writes---i.e. writes issued by a core are queued internally, and actually sent to the central RAM at some unspecified later time, but in order.

(Of course that model fails on other machines like ARM.)
msg299761 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2017-08-04 20:42
@arigo: Technically we also need that the writes to memory are observed to happen-before the write() call on the wakeup fd, which is not something that Intel is going to make any guarantees about. But *probably* this is also safe because the kernel has to use some kind of cross-CPU synchronization to wake up the read()ing thread, and that imposes a memory barrier.
msg299795 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-08-06 11:29
I've recently merged which might address this.  If you know of a way of reproducing the (theoretical) issue outlined here, it would be nice to know if it's fixed on git master.
msg299796 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-08-06 12:42
Since it's described as a bug, should the change be backported to 3.6? I
don't think that 2.7 has required tools to backport this change (C99,
atomic types).
msg299802 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-08-06 15:08
Backporting would be reasonable IMHO.
Date User Action Args
2017-08-06 15:08:09pitrousetmessages: + msg299802
2017-08-06 12:42:09vstinnersetmessages: + msg299796
2017-08-06 11:29:18pitrousetnosy: + pitrou
messages: + msg299795
2017-08-04 20:42:54njssetmessages: + msg299761
2017-08-04 20:30:33arigosetnosy: + arigo
messages: + msg299760
2017-08-04 18:27:04njssetmessages: + msg299756
2017-08-04 08:36:33njscreate