This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ctypes interfers with signal handling
Type: behavior Stage: resolved
Components: ctypes Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Andre Merzky, eryksun, gslavin
Priority: normal Keywords:

Created on 2016-08-29 22:47 by Andre Merzky, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (15)
msg273889 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-08-29 22:47
Summary:  A thread uses signals to the MainThread to signal an abnormal condition.  The main thread is expected to get that signal, invoke the signal handler, and raise an exception.  In combination with 'ctypes' calls, that does not happen.


Consider the following code:

------------------------------------------------
#!/usr/bin/env python

import multiprocessing as mp
import threading       as mt
import signal
import time
import os

# from uuid.py line 400
import ctypes, ctypes.util
lib = ctypes.CDLL(ctypes.util.find_library('uuid'))

def sigusr2_handler(signum, frame):
    raise RuntimeError('caught sigusr2')
signal.signal(signal.SIGUSR2, sigusr2_handler)

def sub(pid):
    time.sleep(1)
    os.kill(pid, signal.SIGUSR2)

try:
  # p = mp.Process(target=sub, args=[os.getpid()])
  # p.start()
    t = mt.Thread(target=sub, args=[os.getpid()])
    t.start()
    time.sleep(3)
except Exception as e:
    print 'except: %s' % e
else:
    print 'unexcepted'
finally:
  # p.join()
    t.join()
------------------------------------------------

With Python 2.7 I would expect the output:
------------------------------------------------
except: caught sigusr2
------------------------------------------------

but I get:
------------------------------------------------
Traceback (most recent call last):
  File "./bug.py", line 29, in <module>
    print 'unexcepted'
  File "./bug.py", line 13, in sigusr2_handler
    raise RuntimeError('caught sigusr2')
  File "./bug.py", line 29, in <module>
    print 'unexcepted'
  File "./bug.py", line 13, in sigusr2_handler
    raise RuntimeError('caught sigusr2')
RuntimeError: caught sigusr2
------------------------------------------------


most of the time.  The problem only happens when the 'ctypes.CDLL' line is enabled -- commenting it out results in the expected behavior.  That line is copied from uuid.py -- importing uuid.py triggers the same unexpected behavior, which is ultimately why I am stuck.

Note that the problem only occurs when a *thread* sends the signal -- it does *not* happen if the signal is sent by the main thread or by a different process (switch to the multiprocessing code path for confirmation).

Interestingly, the problem also disappears when a 'print' statement is added after the 'time.sleep(3)', which may (or may not) indicate a timing issue.

I would welcome any suggestion on how to further triage this.
msg273891 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-08-30 00:36
I can't reproduce this issue in 2.7.11 on Linux. I also don't see how it could be related to ctypes. It doesn't meddle with signal handlers, and neither does Linux libuuid -- not as far I as I can tell with breakpoints set on signal() and sigaction(). Please provide the Python version and platform that you're using.
msg273894 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-08-30 02:04
Thanks for checking!  I use:

$ uname -a
Linux thinkie 3.11-2-amd64 #1 SMP Debian 3.11.8-1 (2013-11-13) x86_64 GNU/Linux

$ python -V
Python 2.7.5+


Note that the problem does not occur on every run -- but in more than 50% of the cases, for me.  I am on a somewhat old machine right now (M620), not sure if that matters, and will try to reproduce on some other boxes tomorrow.
msg273913 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-08-30 12:40
I can confirm the same unexpected behavior for 

Python 2.7.6 / Linux radical 4.2.0-25-generic #30~14.04.1-Ubuntu (2nd attempt)
Python 2.7.9 / Linux login1.stampede.tacc.utexas.edu 2.6.32-431.17.1.el6.x86_64 (32nd attempt)
Python 2.7.10 / Linux gordon-ln2.sdsc.edu 2.6.32-431.29.2.el6.x86_64 (5th attempt)

Any suggestion for further testing?
msg273953 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-08-30 19:07
I also see the problem on 2.7.11, on MacOS, but with significantly lower frequency.  I can't tell if that is due to changes in Python, the different architecture, or whatever...

$ python -V
Python 2.7.11

$ uname -a
Darwin cameo.local 14.4.0 Darwin Kernel Version 14.4.0: Thu May 28 11:35:04 PDT 2015; root:xnu-2782.30.5~1/RELEASE_X86_64 x86_64

$ while true; do i=$((i+1)); echo -n "$i: "; python t.py || break; done
1: except: caught sigusr2
2: except: caught sigusr2
3: except: caught sigusr2
...
134: except: caught sigusr2
Traceback (most recent call last):
  File "t.py", line 30, in <module>
    print 'unexcepted'
  File "t.py", line 14, in sigusr2_handler
    raise RuntimeError('caught sigusr2')
RuntimeError: caught sigusr2
msg274488 - (view) Author: George Slavin (gslavin) * Date: 2016-09-06 01:00
I can reproduce this issue with Python 2.7.12 and Python 3.5.2 on a Arch linux VM using the python script provided, but I think this is an error in the code. 

The repro is tied to the time.sleep call in the try block.  If I do time.sleep(1), I can reproduce the issue almost every time (with or without ctypes).  When the time.sleep() call is too short, you reach the finally block before the main thread receives the signal.  Because you're in the finally block, the exception isn't caught.
msg274524 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-06 05:29
> The repro is tied to the time.sleep call in the try block.  If I do time.sleep(1)

Yes - if both sleeps, the one in the `try` block and the one in the thread's routine (`sub`) are equal, then you'll have the typical race, and you can well be in the `finally` clause when the exception arises.

The problem though is different: please feel free to set the sleep in the `try` block to `sleep(100)` -- and unless the thread creation and startup takes 99 seconds, you should *not* run into that race, the problem however persists:

-------------------------------------------------------
merzky@thinkie:~ $ grep -C 2 sleep bug.py

def sub(pid):
    time.sleep(1)
    os.kill(pid, signal.SIGUSR2)

--
    t = mt.Thread(target=sub, args=[os.getpid()])
    t.start()
    time.sleep(100)
except Exception as e:
    print 'except: %s' % e

merzky@thinkie:~ $ while true; do i=$((i+1)); echo -n "$i: "; python bug.py || break; done
1: except: caught sigusr2
2: except: caught sigusr2
3: except: caught sigusr2
4: Traceback (most recent call last):
  File "bug.py", line 30, in <module>
    print 'unexcepted'
  File "bug.py", line 14, in sigusr2_handler
    raise RuntimeError('caught sigusr2')
RuntimeError: caught sigusr2
------------------------------------------------------

In this case, the begaviour does depend on `ctypes` -- or at least I have not seen that problem without `ctypes` being used.

Thanks, Andre.
msg274530 - (view) Author: George Slavin (gslavin) * Date: 2016-09-06 06:47
Thanks for the reply!  I've managed to reproduce the issue without using ctypes, so we can exclude ctypes as the cause of the problem :)

The bug only occurs for me after hundreds of iterations of the script, so it is very intermittent.

From my results, it appears the signal handler is called, but the exception isn't caught in the try block (this matches your result).

From the call stack it appears the exception handler is called at
"print 'unexcepted'", which is after the try block.

Interestingly, if I place a print after time.sleep, I cannot repro this issue.  Maybe there is a race between the signal handler execution and execution of the try block in the main thread?  Adding the print might slow down the main thread enough that the race disappears.

I don't know the source that well, so I'll need to do some digging to get any more information.

I hope this helps!

-George Slavin
msg274531 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-06 06:56
Also thanks for the reply! :)

Interesting that our results differ: as said, my code stumbles over it rather frequently,  and I can't reproduce w/o ctypes activities.  But indeed, the latter might just reinforce a race condition which is present either way...  

Yes, I also observed that a print statement (or other activities) makes the problem appear less frequently -- I would also guess that it influences a potential race somewhere.

Thanks for looking into it!

Andre.
msg274533 - (view) Author: George Slavin (gslavin) * Date: 2016-09-06 07:20
In case anyone else sees this thread,  here's my trimmed down script to repro the issue:

#!/usr/bin/env python2

import threading       as mt
import signal
import time
import os

def sigusr2_handler(signum, frame):
    raise RuntimeError('caught sigusr2')
signal.signal(signal.SIGUSR2, sigusr2_handler)

def sub(pid):
    time.sleep(1)
    os.kill(pid, signal.SIGUSR2)

try:
    t = mt.Thread(target=sub, args=[os.getpid()])
    t.start()
    time.sleep(500)
except Exception as e:
    print('except: %s' % e)
else:
    print('unexcepted')
finally:
    t.join()


When I run the above, I get the repro after hundreds of iterations.  From these results, it appears there is no guarentee that the signal handler will run before the main thread continues execution at the time.sleep(500) line.  This would explain why we advance to the else clause before the exception is raised.
msg274534 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-06 07:35
Hi George, 

> From these results, it appears there is no guarentee that the signal handler will run before the main thread continues execution at the time.sleep(500) line.  This would explain why we advance to the else clause before the exception is raised.

To me it looks like the problem pops up *way* before the `sleep(100)` (or whatever) finishes, in fact it looks consistently like the sleep is indeed interrupted after one second.  I would it thus interpret differently, as the code should not be able to advance to the `else` clause at that time.

Is that different for you?
msg274537 - (view) Author: George Slavin (gslavin) * Date: 2016-09-06 07:46
The docs say the sleep call will end if a signal is caught, so once the
main thread wakes, it won't go back to sleep.

On Sep 6, 2016 12:35 AM, "Andre Merzky" <report@bugs.python.org> wrote:

>
> Andre Merzky added the comment:
>
> Hi George,
>
> > From these results, it appears there is no guarentee that the signal
> handler will run before the main thread continues execution at the
> time.sleep(500) line.  This would explain why we advance to the else clause
> before the exception is raised.
>
> To me it looks like the problem pops up *way* before the `sleep(100)` (or
> whatever) finishes, in fact it looks consistently like the sleep is indeed
> interrupted after one second.  I would it thus interpret differently, as
> the code should not be able to advance to the `else` clause at that time.
>
> Is that different for you?
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue27889>
> _______________________________________
>
msg274541 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-06 08:51
I think we are on the same page then, thanks.  

AFAIU, the C-level signal handler results in a flag being set, which is evaluated at some later point in time[1], after a certain number of opcodes have been executed.  Could it be that those opcodes blindly continue to walk into the `else` clause despite the sleep interruption?

[1] https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers
msg275294 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-09 09:29
I would appreciate any suggestion on how to avoid this problem -- otherwise it seems that I can't reliably use signals in threaded contexts at all :(

Thanks, Andre.
msg276203 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2016-09-13 07:44
FWIW, a workaround seems to be a nested try/except clause:

try:
  try:
    do_lots_of_work()
  except RuntimeError as e:
    print 'signal handling worked'
except RuntimeError:
  print 'signal handling delayed'

I did a stress test over 100k runs, and got no unexpected behavior that way.  I am not sure if the underlying race condition is just hidden, or if the inner except clause triggers the signal handling code internally -- either way, while cumbersome to use, it seems to work.

I'd still appreciate feedback on a cleaner solution.

Thanks, Andre.
History
Date User Action Args
2022-04-11 14:58:35adminsetgithub: 72076
2021-02-25 12:35:33eryksunsetstatus: open -> closed
resolution: out of date
stage: resolved
2016-09-13 07:44:48Andre Merzkysetmessages: + msg276203
2016-09-09 09:29:07Andre Merzkysetmessages: + msg275294
2016-09-06 08:51:41Andre Merzkysetmessages: + msg274541
2016-09-06 07:46:28gslavinsetmessages: + msg274537
2016-09-06 07:35:07Andre Merzkysetmessages: + msg274534
2016-09-06 07:20:51gslavinsetmessages: + msg274533
2016-09-06 06:56:29Andre Merzkysetmessages: + msg274531
2016-09-06 06:47:04gslavinsetmessages: + msg274530
2016-09-06 05:29:59Andre Merzkysetmessages: + msg274524
2016-09-06 01:00:04gslavinsetnosy: + gslavin
messages: + msg274488
2016-08-30 19:07:31Andre Merzkysetmessages: + msg273953
2016-08-30 12:40:55Andre Merzkysetmessages: + msg273913
2016-08-30 02:04:31Andre Merzkysetmessages: + msg273894
2016-08-30 00:36:43eryksunsetnosy: + eryksun
messages: + msg273891
2016-08-29 22:47:38Andre Merzkycreate