New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SocketServer doesn't handle syscall interruption #52226
Comments
SocketServer's handle_request function uses "select" call to handle io, but sending POSIX signal will result in 'Interrupted system call' exception raised. After that Paste (http://pythonpaste.org/) http server will crash. |
Wrapping select in (taken from twisted sources) can help: |
A patch to make select calls EINTR-safe is attached, but:
Thoughts ? |
What kind of signals can be received in real-life? Also, does your patch still allow Python to handle the signal properly (invoke a registered signal handler, or perhaps raise KeyboardInterrupt)? You could raise the issue on the python-dev mailing-list for more advice. |
In real life, you can receive for example SIGSTOP (strace, gdb, shell), but mostly SIGCHLD (any process spawning children), etc. The attached patch just restarts calls when EINTR is received, as is done in subprocess module. The mailing list is a good idea. |
There are lots of possible answers. Here's one. You launch a child process with os.fork() (and perhaps os.exec*() as well). Eventually you'll get a SIGCHLD in the parent when the child exits. If you've also installed a Python-level SIGCHLD handler with signal.signal (and not flagged the handler with SA_RESTART using signal.siginterrupt) then if you're in select.select() when the SIGCHLD is received, select.select() will fail with select.error EINTR. If your Python SIGCHLD handler deals with the SIGCHLD completely before returning, then this lets you support child processes and use SocketServer at the same time. Changing the behavior of select.select() ("a fix at select level would be better") should be *very* carefully considered, and probably rejected after such consideration. There are completely legitimate use cases which require select.select() to fail with EINTR, and these should remain supported (not to mention that existing code relying on this behavior shouldn't be broken by a Python upgrade in a quite insidious way). |
We use SIGUSR1 to reopen log files after rotation. sighandler works just fine, but after that Paste crashes. I suppose that implementing silent syscall restart at select.select() level is a bad idea (explanation is given already), but some helper like untilConcludes in socket module might be useful. |
First, let me cast my vote in favour of leaving select() itself alone, it's excessive to fix this issue with changing select(). Second, while exarkun's comments are accurate, I understand where pitrou is coming from. EINTR is not a pathological case, but it's not commonplace, either. I believe that on modern Unices (I tried Linux and OpenSolaris), the default handler for SIGCHLD will restart interrupted calls. By default, SIGUSR1 and most sane others, will kill the process anyway. So I can understand the purity of the argument that if a process sets a signal handler, it might just as well set SA_RESTART and be done with it. That said, Imagine I write a forking server with SocketServer which sets a signal handler for SIGCHLD, and I don't set SA_RESTART because I don't think I need to. I believe this is a valid real-world scenario where (a) I don't expect an EINTER-related exception to be raised from SocketServer, (b) SocketServer can actually do something about it, and (c) I doubt the old EINTR behaviour is relied upon by anyone (now I'm /certain/ someone relying on this will pop-up...). I think its a safer bet on behalf of SocketServer (and subprocess, etc) to use something like untilConcludes. If we agree so far, I believe that an implementation of untilConcludes *should* be added to stdlib ("signal.restartable_call", anyone?). If people agree, I'd be happy to produce the patch (trunk+py3k, doc+test). Other than SocketServer and subprocess, anything else that should use "untilConcludes"? Finally, to answer pitrou's second question, I'm pretty sure untilConcludes as presented will not hurt other signal handlers. The interrupt handler will be executed before returning to untilConcludes' loop and manually restarting the call, so KeyboardInterrupt (or anything else signalistic) will behave as expected. I'm attaching a toy 3.x program to play with select() and EINTR, I used it in writing this comment. Gee, it even has an OptionParser CLI to save you some commenting-out. |
wrapping select in an eintr handler that loops like SocketServer_eintr.diff is fine. |
According to "man 7 signal" select must be explicitely restarted, regardless of the SA_RESTART flag. |
Definitely, it's better to have this handler written once and correct than having various implementations scattered around libraries.
Go ahead :-)
man sigaction states this:
Provide behaviour compatible with BSD signal semantics by making _certain_ system calls restartable across signals. Indeed: you can't really rely on SA_RESTART, since certain syscalls are not restartable, and this flag isn't really portable. That's why it's definitely better to take the EINTR-wrapper approach. |
Ok, I think the wrapper approach is a good one. Now remains to decide whether it should stay private to socketserver or be exposed as a generic method (for example in the "os" module as proposed). By the way, adding a test would be nice, too. I'm not sure it what form it should be done; perhaps register a signal handler and trigger the signal from a sub-thread while the main thread is iterating in a select loop? |
I vote for making a global wrapper. As neologix pointed - many other modules can be (or are) affected. Futher found bugs could be fixed using that global wrapper. |
Note that a trivial untilConcludes isn't correct for select if a timeout was passed. If a select(..., 60) was interrupted after 59 seconds, you probably want to restart it with a timeout of 1 second, not 60. The SocketServer_eintr.diff patch has this flaw. |
pitrou, re. test code: I actually started with the test code, so that part is done. I opted to use a forked SocketServer rather than threads. I'm not an expert on the low-level details of a multi-threaded process receiving threads, but it seems to me like there's a chicken and egg issue here: suppose we have threadA (blocking on select()) and threadB (managing the test and doing something like os.kill(os.getpid(), SIGFOO). I think by definition the thread running when the signal is received is threadB (since it sent the signal), and I'm not sure threadA will be interrupted properly that way (threads are sneaky, now aren't they). So I cleaned up test_socketserver.py a bit, split it to a base test case and concrete test cases, and added a separate test case which uses os.fork() rather than threads and tests that way. The attached patch is against recent py3k and includes both neologix's patch (adapted to py3k) and the added test code. If this looks good, I'll easily backport it to 2.x/trunk. spiv, re. handling timeout in select: You're very right, and I'm not sure how we should handle it if we want to take the "untilConcludes" code that is currently shared between subprocess.py and socketserver.py and move it outside. For subprocess' and socketserver's specific cases, the naive implementation which doesn't handle timeouts is OK, because they don't use the timeout parameter anyway. Moving it outside warrants more discussion. Remember this isn't just about select() - it's a general user-space-restarter for any of the myriad of "slow" system calls that might be interrupted. There is some tension between elegance and even feasibility of correct implementation vs. the wrapper hiding functionality of some of these system calls. You could argue that if the user of the wrapper is unhappy about the lost functionality, they can make their own wrapper. Unless we can reach a resolution quickly, my gut feeling is that the current patch should be applied as-is and this issue be closed. A new issue along the lines of "Extract duplicate interruption-restart code to a relevant module and review stdlib for unprotected select() calls" should be opened instead. Your thoughts? |
I encountered this issue when trying to exit cleanly on SIGTERM, which I use to terminate background daemons running serve_forever. In BaseServer, a threading.Event is used in shutdown, so it can block until server_forever is finished (after checking __serving). Since the SIGTERM interrupts the select system call, the event set is never reached, and shutdown hangs waiting on the event. I've attached an example of the pattern I was trying to use in my server. There are several ways around the issue, but looking at the API it seems like this _should_ work, and in my experience all servers have clean-up code so it's a very common case. |
This precise use case is already fixed (in SVN trunk and in the 2.6 |
While bda's usecase is indeed fixed is recent versions, I want to point out the issue still exists in a recent py3k checkout. I've updated my patch to apply cleanly against recent py3k (and made less structural changes to test_socketserver.py; I still think it has an archaic unittest structure but I want this patch to remain relevant a bit longer, maybe fix the structure at a later time). Again, I argue that resolving the general case (a generic untilConcludes to be exposed in stdlib) is hard and that we have a patch review worthy (IMO) solution at hand. Let's apply it (and possibly open a separate ticket for the generic case, if that's deemed interesting). |
Any news on this? Could we possibly apply patch as is? If I'm not mistaken timeout issue is the only one left unresolved. About timeout. The most elegant way, would be to use select's syscall timeout parameter, but man 2 select says: "On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1-2001 permits either behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. Consider timeout to be undefined after select() returns." I'm ready to adapt Yaniv's patch to 2.7 |
I've added some code comments on http://bugs.python.org/review/7978/show overall I think the patch is right, I pointed out one thing to clean up and I think the unittest can be greatly simplified by using stubbed out mock select.select() instead of fork+signals+sleeping. |
Using signalfd() (require Linux 2.6.22+), specified signals are written in a file and don't interrupt system calls (select). Bonus: we can wait for a signal using select! Using pthread_sigmask(), you can also block signals before calling select, but it's not atomic. Or you can use pselect() which is pthread_sigmask()+select() in an atomic fashion. pthread_sigmask() is already in Python 3.3, I plan to add signalfd() in a few weeks into Python 3.3. For pselect(), I don't know if we need it. pthread_sigmask() is available on most POSIX systems, not on Windows. Can select() be interrupted by CTRL+c on Windows? I don't say that we should not restart manually select() on EINTR, just that better solutions do exist today. |
I think that bpo-12224 is a duplicate of this issue. |
I don't think we should block signals - we can sleep on select for about forever. |
Jerzy's latest patch looks ok to me. |
New changeset 5493d44b56d8 by Antoine Pitrou in branch '3.2': New changeset 97a0c6230ece by Antoine Pitrou in branch 'default': |
New changeset d941d1fcc6e6 by Antoine Pitrou in branch '2.7': |
You must recompute the timeout when select() is interrupted: see issue bpo-12338. |
Ah, right. It's a minor issue, though. I would suggest opening a separate issue for it. The patch is now committed to all 3 branches (I'm finally convinced this is a bug worth fixing), so I'm closing this issue. |
I forgot to mention my patch is 3.3-only, sorry - it depends on changes from bpo-12555 (http://hg.python.org/cpython/rev/41a1de81ef2b#l18.21 to be precise). To support 3.2 and 2.7: Regarding not updating the timeout, it was already mentioned above. Though as an afterthought, it might be worrying that if the process is receiving repeated signals with interval between them less than timeout, we might fall into infinite loop of select() when it should timeout, but that is probably very obscure case. |
New changeset f8e7fcd581ff by Antoine Pitrou in branch '3.2': New changeset 4298d6e79ecb by Antoine Pitrou in branch '2.7': |
It shouldn't be necessary, I've just committed fixes. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: