Issue756924
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2003-06-18 23:28 by morngnstar, last changed 2022-04-10 16:09 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
zombie.tar | morngnstar, 2003-06-18 23:28 | test case |
Messages (41) | |||
---|---|---|---|
msg16475 - (view) | Author: Greg Jones (morngnstar) | Date: 2003-06-18 23:28 | |
When a segmentation fault happens on Linux in any thread but the main thread, the program exits, but zombie threads remain behind. Steps to reproduce: 1. Download attached tar and extract files zombie.py and zombieCmodule.c. 2. Compile and link zombieCmodule.c as a shared library (or whatever other method you prefer for making a Python extension module). 3. Put the output from step 2 (zombieC.so) in your lib/python directory. 4. Run python2.2 zombie.py. 5. After the program exits, run ps. zombie.py launches several threads that just loop forever, and one that calls a C function in zombieC. The latter prints "NULL!" then segfaults intentionally, printing "Segmentation fault". Then the program exits, returning control back to the shell. Expected, and Python 2.1 behavior: No Python threads appear in the output of ps. Actual Python 2.2 behavior: 5 Python threads appear in the output of ps. To kill them, you have to apply kill -9 to each one individually. Not only does this bug leave around messy zombie threads, but the threads left behind hold on to program resources. For example, if the program binds a socket, that port cannot be bound again until someone kills the threads. Of course programs should not generate segfaults, but if they do they should fail gracefully. I have identified the cause of this bug. The old Python 2.1 behavior can be restored by removing these lines of Python/thread_pthread.h: sigfillset(&newmask); SET_THREAD_SIGMASK(SIG_BLOCK, &newmask, &oldmask); ... and ... SET_THREAD_SIGMASK(SIG_SETMASK, &oldmask, NULL); I guess even SIGSEGV gets blocked by this code, and somehow that prevents the default behavior of segfaults from working correctly. I'm not suggesting that removing this code is a good way to fix this bug. This is just an example to show that it seems to be the blocking of signals that causes this bug. |
|||
msg16476 - (view) | Author: Greg Jones (morngnstar) | Date: 2003-06-18 23:54 | |
Logged In: YES user_id=554883 Related to Bug #756940. |
|||
msg16477 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-04 14:00 | |
Logged In: YES user_id=119306 The issue is that the threading implementation in Linux kernels previous to 2.6 diverged from the pthreads standard for signal handling. Normally signals are sent to the process and can be handled by any thread. In the LinuxThreads implementation of pthreads, signals are sent to a specific thread. If that thread blocks signals (which is what happens to all threads spawned in Python 2.2) then those signals do not get routed to a thread with them unblocked (what Python calls the "main thread") The new threading facility in Linux 2.6, the NPTL, does not have this signal handling bug. A simple python script that shows the problem is included below. This will hang in Linux kernels before 2.6 or RedHat customized kernels before RH9. #!/usr/bin/python import signal import thread import os def handle_signals(sig, frame): pass def send_signals(): os.kill(os.getpid(), signal.SIGSEGV) signal.signal(signal.SIGSEGV, handle_signals) thread.start_new_thread(send_signals, ()) signal.pause() |
|||
msg16478 - (view) | Author: Tim Peters (tim.peters) * | Date: 2004-05-04 14:44 | |
Logged In: YES user_id=31435 Noting that this has become a semi-frequent topic on the zope-dev mailing list, most recently in the "Segfault and Deadlock" thread starting here: <http://mail.zope.org/pipermail/zope-dev/2004- May/022813.html> |
|||
msg16479 - (view) | Author: Kjetil Jacobsen (kjetilja) | Date: 2004-05-05 08:28 | |
Logged In: YES user_id=5685 I've experienced similar behaviour with hung threads on other platforms such as HP/UX, so we should consider letting through some signals to all threads on all platforms. For instance, very few apps use signal handlers for SIGILL, SIGFPE, SIGSEGV, SIGBUS and SIGABRT, so unblocking those signals should not cause much breakage compared to the breakage caused by blocking all signals. |
|||
msg16480 - (view) | Author: Tim Peters (tim.peters) * | Date: 2004-05-06 20:05 | |
Logged In: YES user_id=31435 Boosting priority, hoping to attract interest before 2.3.4. Patch 949332 looks relevant. |
|||
msg16481 - (view) | Author: Anthony Baxter (anthonybaxter) | Date: 2004-05-07 12:39 | |
Logged In: YES user_id=29957 We're a week out from release-candidate, and this seems (to me) to be an area that's fraught with risk. The terms "HP/UX" and "threads" have also cropped up, which, for me, is a marker of "here be sodding great big dragons". I don't mind delaying the release if it's necessary, and there's a definite path to getting a nice clean fix in that won't break things for some other class of platform. This stuff looks like being a beast to test for, though. |
|||
msg16482 - (view) | Author: Michael Hudson (mwh) | Date: 2004-05-07 12:56 | |
Logged In: YES user_id=6656 Note that there is an attempt at a configure test in 948614, but it seems very LinuxThreads specific. I agree with Anthony that this area is very scary. The last thing we want to do a fortnight before release is break things somewhere they currently work. On the gripping hand, when there's a modern, actually working implementation of pthreads, I don't think we actually need to block signals at all. I certainly don't have the threads-fu to come up with appropriate configure/pyport.h magic though. I'm not sure I have the energy to test a patch on all the testdrive, snake farm and SF compile farm machines either. |
|||
msg16483 - (view) | Author: Anthony Baxter (anthonybaxter) | Date: 2004-05-07 13:06 | |
Logged In: YES user_id=29957 Any patches in this area, I'd prefer to see on the trunk, along with tests to exercise it (and confirm that it's not breaking something else). We can then give it a solid testing during the 2.4 release cycle. I don't want to have to stretch the bugfix release cycle out to have alphas, betas and the like. This seems like huge piles of no-fun. |
|||
msg16484 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-07 13:48 | |
Logged In: YES user_id=119306 There are two different thread related patches that I submitted, I agree that <http://sourceforge.net/tracker/? func=detail&aid=948614&group_id=5470&atid=305470> is pretty radical. (Its the one that tests at configure time for LinuxThreads peculiarities and alters the thread spawning and signal related activities accordingly.) A different related signal patch <http://sourceforge.net/tracker/? func=detail&aid=949332&group_id=5470&atid=305470> might be more appealing to you. It only unblocks signals like segmentation faults that creates synchronously sends to itself and that a pthreads implementation will always send to the faulting thread. (whether it blocks it or not.) |
|||
msg16485 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-07 13:59 | |
Logged In: YES user_id=119306 mwh wrote: "when there's a modern, actually working implementation of pthreads, I don't think we actually need to block signals at all." The bug report that caused the patch to be created was originally reported on Solaris, which has a more correct pthreads implementation. I'm now wondering if that problem was not caused by signals being handled by the spawned threads, but rather that the signal handler does a check for "if (getpid() == main_pid)" rather than "(PyThread_get_thread_ident() == main_thread)". One a standard's compliant pthreads implementation, and even on Solaris, getpid() will always "==" "main_pid". For the Linux case, we may have a more modern working threads implementation now, but when the old LinuxThreads style behavior was out and deployed for 8 years or so, it will probably be around for a while. |
|||
msg16486 - (view) | Author: Tim Peters (tim.peters) * | Date: 2004-05-07 16:50 | |
Logged In: YES user_id=31435 Assigned to Guido to get an answer to one of the questions here: Guido, signal_handler() checks getpid() against main_pid, and has ever since revision 2.3 (when you first taught signalmodule.c about threads). But on every pthreads box except for Linux, get_pid() should always equal main_pid (even after a fork). What was the intent? I read the comments the same as Andrew does here, that the intent was to check thread identity, not process identity. |
|||
msg16487 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2004-05-07 21:26 | |
Logged In: YES user_id=6380 (You had me confused for a bit -- I thought you meant Python 2.3, but you meant file revision 2.3, which was in 1994...) It can't be as simple as that; the 1994 code (rev 2.3) initializes both main_pid and main_thread, and checks one or the other in different places. The NOTES in that version don't shed much light on the issue except claiming that checking getpid() is a hack that works on three platforms named: SGI, Solaris, and POSIX threads. The code that is re-initializing main_pid in PyOS_AfterFork()was added much later (rev 2.30, in 1997). Here's my theory. On SGI IRIX, like on pre-2.6-kernel-Linux, getpid() differs per thread, and SIGINT is sent to each thread. The getpid() test here does the right thing: it only sets the flag once (in the handler in the main thread). On Solaris, the getpid() test is a no-op, which is fine sice only one thread gets the signal handler. Those were the only two cases that the getpid() test really cared for; the NOTES section speculated that it would also work with POSIX threads if the signal was only delivered to the main thread. Conclusion: the getpid() test was *not* a mistake, and replacing it with a get_thread_ident() test is not the right answer. But the getpid() test is probably not correct for all pthreads implementations, and some fix may be necessary. I also agree that blocking all signals is too aggressive, but am not sure what to do about this either. (It has caused some problems in my own code where I was spawning a subprocess in a thread, and the subprocess inherited the blocked signals, causing it to be unkillable except through SIGKILL.) Am I off the hook now? |
|||
msg16488 - (view) | Author: Tim Peters (tim.peters) * | Date: 2004-05-10 01:13 | |
Logged In: YES user_id=31435 Whether you're off the hook depends on whether you're determined to be <wink>. I don't run on Unixish systems, so I'm not a lot of use here. The problem you've had with unkillable subprocesses also affects Zope, and you'll recall that the zdaemon business tries to keep Zope sites running via signal cruft. As people have tried to move from Python 2.1 to Python 2.3, they're discovering that Zope sites fail hard because of the signal- blocking added after 2.1: "when an external python module segfaults during a zope request ... the remaining worker threads are deadlocked", from http://tinyurl.com/2qslw and zdaemon doesn't do its job then. Andrew has in mind a scheme for not blocking "synchronous" signals, which makes sense to me, but I'm out of touch with this stuff. If you can't review it, who can? It would sure be nice to get a resolution into 2.3.4, although I understand that may be perceived as too risky. The alternative from my immediate POV is that people running Zope-based apps on some Unixish systems stay away from Python 2.3, which is a real shame. For that matter, unless this is resolved, I suppose they'll stay away from Python 2.4 too. |
|||
msg16489 - (view) | Author: Anthony Baxter (anthonybaxter) | Date: 2004-05-10 16:08 | |
Logged In: YES user_id=29957 I'd strongly prefer that this go into the trunk, and sooner, rather than later. I'd even more strongly prefer that this not go anywhere near the release23-maint branch, at least until _after_ 2.3.4 is done. If there ends up being a nice easy way to do this, great! We can cut a 2.3.5 around the same time as 2.4 final. Putting this into 2.3.4 seems, to me, to be a hell of a risk. |
|||
msg16490 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-10 18:45 | |
Logged In: YES user_id=119306 Unfortunately, in pthreads the "synchronous" doesn't apply to a signal number, but its method of delivery. You can deliver a "SIGSEGV" asynchronously with the "kill" command, and you send normally asynchronous signals with pthread_kill. What <http:// sourceforge.net/tracker/ ?func=detail&aid=949332&group_id=5470&atid=305470> does is unblock signals like SIGSEGV which are likely to be sent synchronously from the OS and are unlikely to be handled by normal processes as asynchronous handlers. |
|||
msg16491 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2004-05-10 19:47 | |
Logged In: YES user_id=6380 I agree with Anthony, too much risk for 2.3.4. I don't claim to understand this code any more; in particular the signal blocking code that's currently there wasn't written by me and if I checked it in, I did it hoping for the best... Langmead is right about signal asynchrony. |
|||
msg16492 - (view) | Author: Tim Peters (tim.peters) * | Date: 2004-05-10 20:06 | |
Logged In: YES user_id=31435 Unassigned (was assigned to Guido, but doesn't sound like he's going to do more with it). |
|||
msg16493 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-10 22:46 | |
Logged In: YES user_id=119306 The original bug that added the signal blocking, #465673, seems to be exposing itself via a combination of threads and readline. Is it possible that it is the problem is there and not within the signal handling code itself? (especially since it installs and removes a SIGINT handler, possibly causing a race condition with the code within the signal handler when it re-installs itself. On systems that have sigaction, should python need to re-install handlers at all? ) I'm tempted to try to the following, and if it works submit a patch. Does this seem like it would be the right direction? * Remove the section of thread creation that blocks signals. * Look for sections of code may have reentrancy issues, like: ** On machines with reliable signals, keep the signal handler installed, rather than reinstalling it within the handler. ** Change Py_AddPendingCall to use a real semaphore, if available, rather than a busy flag. ** Change readline.c to use more thread safe constructs where available (arrange so that the longjmp out of the signal handler is only executed for the thread that is using readline, and use siglongjmp if available) and then see if issues like this one are solved without reintroducing issues from 465673. |
|||
msg16494 - (view) | Author: Michael Hudson (mwh) | Date: 2004-05-11 08:52 | |
Logged In: YES user_id=6656 That does indeed sound reasonable, but not for 2.3.4 (professional cowardice, I'm afraid). Good luck! |
|||
msg16495 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2004-05-11 13:59 | |
Logged In: YES user_id=6380 I'm beginning to think that langmead may be on to something: that blocking all signals in all threads is Just Plain Wrong (tm). The Zope SIGSEGV problem is just an example; I have my own beef with SIGTERM, which ends up blocked (together with all other signals) in child processes started from a thread. I would love to see langmead's patch! (For Python 2.4.) Make sure to test the behavior from 465673 (and possible 219772?) after removing the signal blocking but before adding the new fixes, and again after applying those, to make sure 465673 is really gone. Also, I'd like to hear from jasonlowe, who submitted bug 465673 and the patch that caused all the problems, 468347. Maybe his signal-fu has increased since 2001. It would be a miracle to get this into 2.3.4 though... |
|||
msg16496 - (view) | Author: Nobody/Anonymous (nobody) | Date: 2004-05-11 19:52 | |
Logged In: NO I agree, the original patch I submitted is horribly ham-fisted because it blocks all signals. I'm kicking myself for not forseeing the problems with SIGSEGV, SIGTERM, etc. as reported in 756924. The original problem I was trying to fix was that the wrong thread (i.e.: any thread but the main thread) would receive the SIGINT and end up doing the longjmp() to the context saved by the main thread. Then we have two threads executing on the main thread's stack which is a Bad Thing. With the way the readline support currently handles SIGINT via setjmp()/longjmp(), you really want the main thread and only the main thread to get the SIGINT and perform that longjmp(). Would it be reasonable to block only SIGINT (and not other signals) for all threads but the main thread? That would force SIGINT to be handled by the main thread and eliminate the worry that the wrong thread will do the longjmp() into the main thread's context in the readline code. I agree with large parts of langmead's proposed approach to fixing this, but I do have concerns about the combination of these two parts: * Remove the section of thread creation that blocks signals. * Change readline.c to use more thread safe constructs where available (arrange so that the longjmp out of the signal handler is only executed for the thread that is using readline, and use siglongjmp if available) According to the book "Programming with Threads" by Kleinman, Shah, and Smaalders: "Asynchronously generated signals are sent to the process as a whole where they may be serviced by any thread that has the signal unmasked. If more than one thread is able to receive a signal sent to the process, only one is chosen." If we leave SIGINT unmasked on all threads, then the signal handler will need to check the thread ID, and if not the main thread, use pthread_kill(main_thread, SIGINT) to defer the work to the main thread. In that sense, it'd be simpler to block SIGINT in all threads and force the system to route the SIGINT to the main thread directly. Of course if a particular threads implementation doesn't have the desired asynchronous signal routing behavior, maybe leaving SIGINT unmasked and using the pthread_kill(main_thread, SIGINT) technique could work around that. So to sum up, I'm in complete agreement with unblocking most if not all signals in other threads and with langmead's proposals to leverage the benefits provided by sigaction() and siglongjmp() when possible. I have one question though. Would it be reasonable to force SIGINT to be handled only by the main thread, or is there a need for Python threads other than the main thread to handle/receive SIGINT? If the latter then the setjmp()/longjmp() mechanism currently used in the readline module is going to be problematic. |
|||
msg16497 - (view) | Author: Jason Lowe (jasonlowe) | Date: 2004-05-11 19:57 | |
Logged In: YES user_id=56897 Ack. I thought I was logged in for that previous comment which was from me (jasonlowe). |
|||
msg16498 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-11 22:46 | |
Logged In: YES user_id=119306 I was handwaving a bit over the "arrangements" to make with the siglongjump. It is probable that blocking SIGINT from all spawned threads will be the easiest. It will also work in both the pthreads and LWP case (signal sent to one unblocked thread in the process) and the LinuxThreads and SGI threads case (signal broadcast to the process group, which includes each thread individually.) The only thing I wanted to double check was whether readline could be executed by any thread other than the main thread. If so, the SIGINT handler needs to check not whether it is the main thread, but rather if it is the (or *a*?) thread that currently is in the middle of a readline call. |
|||
msg16499 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2004-05-11 22:54 | |
Logged In: YES user_id=6380 But if you still block SIGINT (why is SIGINT special?) in all threads, processes forked from threads will be started with SIGINT blocked, and that's still wrong. |
|||
msg16500 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2004-05-11 23:04 | |
Logged In: YES user_id=6380 And I think it is possible to call readline() from any thread. (Though it would be a problem if multiple threads were doing this simultaneously :-) |
|||
msg16501 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-11 23:41 | |
Logged In: YES user_id=119306 The only thing special about SIGINT is that the readline module uses PyOS_setsig to set it, and when readline's special SIGINT handler is set, it throws all of the careful thread handling in Modules/sigmodule.c: signal_handler out the window. Now that I say it out loud, PyOS_setsig some consideration on its own. |
|||
msg16502 - (view) | Author: Jason Lowe (jasonlowe) | Date: 2004-05-12 13:54 | |
Logged In: YES user_id=56897 SIGINT is 'special' because that's the signal behind the problems reported in bug 465673. Given the readline module's setjmp/longjmp mechanism to process SIGINT, we simply cannot allow one thread to do the setjmp() and another thread to do the longjmp() when it receives SIGINT. Without the setjmp/longjmp stuff, SIGINT is no more special than any other asynchronous signal like SIGTERM, SIGUSR1, etc. It'd be great if we could get the desired behavior for SIGINT out of the readline module without setjmp/longjmp, but without help from the readline library I don't see an easy way to do this. The readline library insists on continuing the readline() call after a SIGINT is handled, and there doesn't appear to be any way to get it to abort the current readline() call short of modifying the readline library. If we're stuck with the setjmp/longjmp mechanism, I think we can solve the issues regarding readline() being called from another thread and exec'd processes from threads by using the pthread_kill() technique mentioned earlier. The steps would look something like this: - Do not block any signals (including SIGINT) in any threads. - When we initialize the readline module's jmp_buf via setjmp(), save off the current thread ID. Probably want to check for existing ownership of jmp_buf and flag an error if detected. - When the readline module SIGINT handler is invoked, check if the current thread owns jmp_buf. If we are the owning thread then execute the longjmp (or siglongjmp). If we're not the owning thread, then have the current thread execute pthread_kill(jmp_buf_owner_thread, SIGINT) and little else. This will defer the SIGINT to the only thread that can really process it correctly. - Since SIGINT isn't blocked in any thread, processes exec'd from threads should get the default behavior for SIGINT rather than having it blocked. The above algorithm has a race condition on thread implementations where all threads receive SIGINT. The race can cause SIGINT to be processed more than once. The jmp_buf owning thread might finish the processing of SIGINT before another thread starts its processing and re-sends SIGINT to the jmp_buf owning thread. If there's a way to know via configure that we're on a thread implementation that broadcasts SIGINT, we could #ifdef the code to use something like the getpid() hack in signalmodule.c to do the right thing. |
|||
msg16503 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2004-05-12 14:11 | |
Logged In: YES user_id=6380 Sounds good. This solves the problem in the readline module, where it originates. BTW, if we can simplify things by only allowing readline() to be called from the main thread, that's fine with me. Doing console I/O from threads is insane anyway. We can start by assuming the signal broadcast problem is restricted to IRIX, and configure appropriately: define a test symbol for this and in configure, set this when IRIX is detected. |
|||
msg16504 - (view) | Author: Anthony Baxter (anthonybaxter) | Date: 2004-05-12 14:39 | |
Logged In: YES user_id=29957 This seems like a pragmatic and sensible approach to take, to me. It should probably be tested on the HP/UX boxes (google for 'HP/UX testdrive') I particularly like the idea of just putting a test in to block readline in the non-main thread. It seems the pythonic approach - since we can't guarantee behaviour that's anything but sane, it seems like a plan. Or at least make it issue a warning saying "don't do this" when readline is invoked from a non-main thread. |
|||
msg16505 - (view) | Author: Jason Lowe (jasonlowe) | Date: 2004-05-13 16:20 | |
Logged In: YES user_id=56897 Argh! I thought we had a relatively clean solution, but it appears there's a stumbling block with the pthread_kill() approach. pthread_kill() is not guaranteed to be async-signal-safe, and nuggets found on the net indicate there's no portable way to redirect a process signal from one thread to another: http://groups.google.com/groups?q=pthread_kill+async+safe&start=30&hl=en&lr=&selm=3662B6A8.861984D5%40zko.dec.com&rnum=32 http://www.redhat.com/archives/phil-list/2003-December/msg00049.html Given that we can't safely call pthread_kill() from the SIGINT handler directly, there might be another way to solve our problems with pthread_atfork(). Here's my thinking: - Block SIGINT in all threads except the main (readline) thread. - Register via child process handler via pthread_atfork() that sets the SIGINT action for the child process back to the default. Unfortunately this fix isn't localized to the readline module as desired, but it may solve the problems. SIGINT routing will be forced to the readline thread, and child processes won't have SIGINT blocked, solving bug 756940. The IRIX thread signal delivery model (i.e.: broadcast) may cause problems since SIGINT may be pending when we attempt to set the action to default. Having SIGINT pending when the handler is changed to default would kill the child process. Maybe having the child process set the disposition to ignore and then to default would safely clear any pending SIGINT signal? I'll try to run some experiments with the pthread_atfork() approach soon, but work and home life for me is particularly busy lately. Apologies in advance if it takes me a while to respond or submit patches. If we're interested in a timely fix, would it be useful to break up the fix in two stages? I think we can all agree that the current approach of blocking ALL signals in created threads is a Bad Thing. What if we implement a quick, partial fix by simply change the existing code to only block SIGINT? This should be a two-line change to thread_pthread.h where "sigemptyset(&newmask); sigaddset(&newmask, SIGINT);" is used instead of "sigfillset(&newmask);". I see this partial fix having a number of benefits: - Easy change to make. No extra stuff to check for in configure or calls to things that may not exist or work properly. - Much less risky than trying to fix all the problems at once. The change only opens up signals to threads that Python-2.1 is already allowing through. - Should solve the SIGSEGV zombie problem and Guido's SIGTERM annoyance, although it would still have the problem reported in bug 756940. |
|||
msg16506 - (view) | Author: Michael Hudson (mwh) | Date: 2004-05-13 16:25 | |
Logged In: YES user_id=6656 Just to make life more entertaining, pthread_atfork isn't what you want, either. http://mail.python.org/pipermail/python-dev/2003-December/041309.html |
|||
msg16507 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-13 16:48 | |
Logged In: YES user_id=119306 pthread_kill(). That is annoying, I have something nearly done that used it. I didn't double check the safety of pthread_kill. I saw that posix says that kill is safe to call from interrupt handlers and went from there. Can we note that we need a pthread_kill in a call to Py_AddPendingCall, and then handle it later? |
|||
msg16508 - (view) | Author: Jason Lowe (jasonlowe) | Date: 2004-05-13 17:25 | |
Logged In: YES user_id=56897 There didn't seem to be an outcome from the python-dev discussion regarding system() and pthread_atfork(). The thread indicates that system() indeed is supposed to call atfork handlers, so therefore RedHat 9 is violating the pthread standard in that sense. (Whether or not they'll fix it is another issue.) There's also mention that os.system() may be changed to not call system() because of the atfork() problem. If the changes to avoid system() are implemented, would the pthread_atfork() approach still be problematic? As Martin Loewis points out, we could always implement the signal fixup in the child directly after the fork() if Python routines are being used to do the fork() in the first place. However if we're concerned about native modules that directly call fork() then it seems our choices are a pthread_atfork() approach or an approach where SIGINT isn't blocked. Without an async-signal-safe way to route a signal from one thread to another, I don't see how we can leave SIGINT unblocked in all threads. Re: Py_AddPendingCall. That approach might work in many cases, but I assume it doesn't work well when all threads are currently busy in native modules that are not well-behaved. For example, I have two threads: one in readline() and the other blocked in a native call that, like readline(), doesn't return control on EINTR. If the SIGINT is sent to the readline thread, the signal handler could check the thread ID and do the longjmp() since we're the proper thread to do so. If the SIGINT is sent to the other thread, the callback added by Py_AddPendingCall() won't necessarily be processed any time soon because no threads are going to return control (in a timely manner) to Python. To make matters worse, apparently even something as simple as pthread_self(), which we'd use to get the thread ID, isn't async-signal-safe on all platforms. From what I've read, none of the pthread functions are guaranteed to be async-signal-safe. :-( |
|||
msg16509 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-20 12:16 | |
Logged In: YES user_id=119306 I have an approach to have readline work well with threads while still acknowledging KeyboardInterrrupt. Using the alternate readline interface of rl_callback_handler_install() and rl_callback_read_char() along with a select(), we can recognize the interrupt signal when select returns EINTR and not need the signal handler at all. I just need to try my patch on a few more systems and try to put Anthony at ease. |
|||
msg16510 - (view) | Author: Michael Hudson (mwh) | Date: 2004-05-20 17:02 | |
Logged In: YES user_id=6656 This sounds cool! The only thing to be aware of is readline versioning... are these alternate interfaces a recent thing? |
|||
msg16511 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-20 18:38 | |
Logged In: YES user_id=119306 The callback interface seems to have been added in readline 2.1, from 1997. There seem to be configure tests in the current Modules/readline.c code to search for features new to readline 2.1 so my current approach would be upping the minimum readline version from 2.0 to 2.1. If needed I could test for the callback interface and use it if available, but fall back to the readline() interface otherwise (and leave the thread and signal handling issues in place when used with older readline.) |
|||
msg16512 - (view) | Author: Dieter Maurer (dmaurer) | Date: 2004-05-26 11:28 | |
Logged In: YES user_id=265829 The Python documentation currently asserts that signals are delivered only to the main thread. I think we can deviate from this assertion for signals that are not normally used by applications but are used by the OS to indicate abnormal execution conditions (like SIGSEGV, SIGBUS and friends). We should at least make sure that such abnormal conditions lead to a proper process shutdown -- as early as possible. I doubt that we should change the assertion for signals usually used by applications. Patch 949332 seems to be an appropriate short term solution - until we come up with something better. I would really like it to land already in Python 2.3.4. I will apply it for our productive Python environments because I am convinced that it will improve behaviour compared to the current state. I can report back should we see unexpected behaviour. |
|||
msg16513 - (view) | Author: Anthony Baxter (anthonybaxter) | Date: 2004-05-26 15:43 | |
Logged In: YES user_id=29957 This will not be in 2.3.4, as I've already stated. a) We've already cut the release candidate, and the final release is less than 12 hours away b) this is a high-risk patch -- anything in the area of threads and signals is very risky. Here's how it should go forward: First, the trunk should be patched. This fix can then be in 2.4a1. Once we're through to 2.4 final, we'll know whether the fix is good enough for the 2.3 series. Finally, after 2.4 final comes out, there will be a 2.3.5. This fix can be in that, assuming it's fine in 2.4. |
|||
msg16514 - (view) | Author: Andrew Langmead (langmead) | Date: 2004-05-26 17:19 | |
Logged In: YES user_id=119306 The patch in <https://sourceforge.net/tracker/ ?func=detail&aid=960406&group_id=5470&atid=305470> is probably a better change to consider rather than the LinuxThreads specific workarounds of 948614 or the false synchronous/ asynchronous dichotomy of 949332. (948614 was just looking at things in terms of LinuxThreads specific quirks, and when comments started popping up about similar problems in HP/UX and others it became apparent that a more general approach was needed. 949332 was a hope that a minimally intrusive variation might get be more palatable for 2.3.4, but I can perfectly understand your reluctance.) The new patch, 960406, seems to solve the problems Jason Lowe reported in <https://sourceforge.net/tracker/ index.php?func=detail&aid=468347&group_id=5470&atid=305470> (at least with my testing under Linux, Solaris, Irix, HP/UX, and True64 Unix) but does not exhibit the problems described here. |
|||
msg16515 - (view) | Author: Georg Brandl (georg.brandl) * | Date: 2006-06-14 09:18 | |
Logged In: YES user_id=849994 It seems that the patch at 960406 was applied and fixed this problem. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:09:19 | admin | set | github: 38677 |
2003-06-18 23:28:33 | morngnstar | create |