This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author njs
Recipients njs, vstinner
Date 2017-04-11.09:11:22
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1491901883.05.0.966298195873.issue30038@psf.upfronthosting.co.za>
In-reply-to
Content
In trip_signal [1], the logic goes:

1) set the flag saying that this particular signal was tripped
2) write to the wakeup fd
3) set the global is_tripped flag saying "at least one signal was tripped", and do Py_AddPendingCall (which sets some global flags that the bytecode interpreter checks on every pass through the loop)

So the problem here is that it's step (2) that wakes up the main thread to check for signals, but it's step (3) that actually arranges for the Python-level signal handler to run. (Step (1) turns out to be irrelevant, because no-one looks at the per-signal flags unless the global is_tripped flag is set. This might be why no-one noticed this bug through code inspection though – I certainly missed it, despite explicitly checking for it several times!)

The result is that the following sequence of events is possible:

- signal arrives (e.g. SIGINT)
- trip_signal writes to the wakeup fd
- the main thread blocked in IO wait sees this, and wakes up
- the main thread checks for signals, and doesn't find any
- the main thread empties the wakeup fd
- the main thread goes back to sleep
- trip_signal sets the flags to request the Python-level signal handler be run
- the main thread doesn't notice, because it's asleep

It turns out that this is a real thing that can actually happen; it's causing an annoying intermittent failure in the trio testsuite on appveyor; and under the correct conditions I can reproduce it very reliably in my local Windows VM. See [2].

I think the fix is just to swap the order of steps (2) and (3), so we write to the wakeup fd last. Unfortunately I can't easily test this because I don't have a way to build CPython on Windows. But [2] has some IMHO pretty compelling evidence that this is what's happening.

[1] https://github.com/python/cpython/blob/6fab78e9027f9ebd6414995580781b480433e595/Modules/signalmodule.c#L238-L291
[2] https://github.com/python-trio/trio/issues/119
History
Date User Action Args
2017-04-11 09:11:23njssetrecipients: + njs, vstinner
2017-04-11 09:11:23njssetmessageid: <1491901883.05.0.966298195873.issue30038@psf.upfronthosting.co.za>
2017-04-11 09:11:22njslinkissue30038 messages
2017-04-11 09:11:22njscreate