Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SocketServer doesn't handle syscall interruption #52226

Closed
nvetoshkin mannequin opened this issue Feb 21, 2010 · 32 comments
Closed

SocketServer doesn't handle syscall interruption #52226

nvetoshkin mannequin opened this issue Feb 21, 2010 · 32 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@nvetoshkin
Copy link
Mannequin

nvetoshkin mannequin commented Feb 21, 2010

BPO 7978
Nosy @gpshead, @pitrou, @vstinner, @giampaolo
Files
  • SocketServer_eintr.diff: handle EINTR patch
  • eintr_safety.py: Toy 3.x program to play with select() and EINTR
  • sockServe.py: example
  • socketserver_eintr_py3k_updated.diff
  • socketserver_eintr_20120406.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-04-08.23:25:19.590>
    created_at = <Date 2010-02-21.19:14:53.658>
    labels = ['type-bug', 'library']
    title = "SocketServer doesn't handle syscall interruption"
    updated_at = <Date 2012-04-08.23:48:11.811>
    user = 'https://bugs.python.org/nvetoshkin'

    bugs.python.org fields:

    activity = <Date 2012-04-08.23:48:11.811>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-04-08.23:25:19.590>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2010-02-21.19:14:53.658>
    creator = 'nvetoshkin'
    dependencies = []
    files = ['16755', '16813', '17120', '17314', '25144']
    hgrepos = []
    issue_num = 7978
    keywords = ['patch']
    message_count = 32.0
    messages = ['99676', '100704', '102331', '102333', '102336', '102337', '102363', '102591', '102592', '102594', '102596', '102660', '102680', '102690', '102704', '104455', '104459', '105617', '120898', '125234', '137361', '137362', '137414', '157695', '157725', '157814', '157815', '157817', '157818', '157822', '157824', '157825']
    nosy_count = 14.0
    nosy_names = ['gregory.p.smith', 'spiv', 'exarkun', 'pitrou', 'vstinner', 'giampaolo.rodola', 'Christophe Simonis', 'nvetoshkin', 'neologix', 'Yaniv.Aknin', 'bda', 'kchen', 'Jerzy.Kozera', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue7978'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @nvetoshkin
    Copy link
    Mannequin Author

    nvetoshkin mannequin commented Feb 21, 2010

    SocketServer's handle_request function uses "select" call to handle io, but sending POSIX signal will result in 'Interrupted system call' exception raised. After that Paste (http://pythonpaste.org/) http server will crash.
    I suppose EINTR must be handled properly (i.e. syscall must be restarted silently) on SocketServer's side. That must be pretty easy task.

    @nvetoshkin nvetoshkin mannequin added type-crash A hard crash of the interpreter, possibly with a core dump stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Feb 21, 2010
    @nvetoshkin
    Copy link
    Mannequin Author

    nvetoshkin mannequin commented Mar 9, 2010

    Wrapping select in (taken from twisted sources) can help:
    def untilConcludes(f, *a, **kw):
    while True:
    try:
    return f(*a, **kw)
    except (IOError, OSError), e:
    if e.args[0] == errno.EINTR:
    continue
    raise

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 4, 2010

    A patch to make select calls EINTR-safe is attached, but:

    • there are many modules that suffer from the same problem, so maybe a fix at select level would be better
    • if not, it could be a good idea to add this EINTR-retry handler to a given module (signal, os?) so that every module doesn't have to reimplement it (I know that at least subprocess does have one).

    Thoughts ?

    @pitrou
    Copy link
    Member

    pitrou commented Apr 4, 2010

    What kind of signals can be received in real-life?
    (I'm assuming that most signals are of the kind that you only receive if someone else deliberately sends it to you, in which case you are supposed to be prepared to handle it)

    Also, does your patch still allow Python to handle the signal properly (invoke a registered signal handler, or perhaps raise KeyboardInterrupt)?

    You could raise the issue on the python-dev mailing-list for more advice.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 4, 2010

    In real life, you can receive for example SIGSTOP (strace, gdb, shell), but mostly SIGCHLD (any process spawning children), etc. The attached patch just restarts calls when EINTR is received, as is done in subprocess module. The mailing list is a good idea.

    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Apr 4, 2010

    What kind of signals can be received in real-life?

    There are lots of possible answers. Here's one. You launch a child process with os.fork() (and perhaps os.exec*() as well). Eventually you'll get a SIGCHLD in the parent when the child exits. If you've also installed a Python-level SIGCHLD handler with signal.signal (and not flagged the handler with SA_RESTART using signal.siginterrupt) then if you're in select.select() when the SIGCHLD is received, select.select() will fail with select.error EINTR.

    If your Python SIGCHLD handler deals with the SIGCHLD completely before returning, then this lets you support child processes and use SocketServer at the same time.

    Changing the behavior of select.select() ("a fix at select level would be better") should be *very* carefully considered, and probably rejected after such consideration. There are completely legitimate use cases which require select.select() to fail with EINTR, and these should remain supported (not to mention that existing code relying on this behavior shouldn't be broken by a Python upgrade in a quite insidious way).

    @nvetoshkin
    Copy link
    Mannequin Author

    nvetoshkin mannequin commented Apr 5, 2010

    What kind of signals can be received in real-life?

    We use SIGUSR1 to reopen log files after rotation. sighandler works just fine, but after that Paste crashes.

    I suppose that implementing silent syscall restart at select.select() level is a bad idea (explanation is given already), but some helper like untilConcludes in socket module might be useful.

    @YanivAknin
    Copy link
    Mannequin

    YanivAknin mannequin commented Apr 8, 2010

    First, let me cast my vote in favour of leaving select() itself alone, it's excessive to fix this issue with changing select().

    Second, while exarkun's comments are accurate, I understand where pitrou is coming from. EINTR is not a pathological case, but it's not commonplace, either. I believe that on modern Unices (I tried Linux and OpenSolaris), the default handler for SIGCHLD will restart interrupted calls. By default, SIGUSR1 and most sane others, will kill the process anyway. So I can understand the purity of the argument that if a process sets a signal handler, it might just as well set SA_RESTART and be done with it.

    That said, Imagine I write a forking server with SocketServer which sets a signal handler for SIGCHLD, and I don't set SA_RESTART because I don't think I need to. I believe this is a valid real-world scenario where (a) I don't expect an EINTER-related exception to be raised from SocketServer, (b) SocketServer can actually do something about it, and (c) I doubt the old EINTR behaviour is relied upon by anyone (now I'm /certain/ someone relying on this will pop-up...). I think its a safer bet on behalf of SocketServer (and subprocess, etc) to use something like untilConcludes.

    If we agree so far, I believe that an implementation of untilConcludes *should* be added to stdlib ("signal.restartable_call", anyone?). If people agree, I'd be happy to produce the patch (trunk+py3k, doc+test). Other than SocketServer and subprocess, anything else that should use "untilConcludes"?

    Finally, to answer pitrou's second question, I'm pretty sure untilConcludes as presented will not hurt other signal handlers. The interrupt handler will be executed before returning to untilConcludes' loop and manually restarting the call, so KeyboardInterrupt (or anything else signalistic) will behave as expected.

    I'm attaching a toy 3.x program to play with select() and EINTR, I used it in writing this comment. Gee, it even has an OptionParser CLI to save you some commenting-out.

    @gpshead
    Copy link
    Member

    gpshead commented Apr 8, 2010

    wrapping select in an eintr handler that loops like SocketServer_eintr.diff is fine.

    @gpshead gpshead self-assigned this Apr 8, 2010
    @nvetoshkin
    Copy link
    Mannequin Author

    nvetoshkin mannequin commented Apr 8, 2010

    According to "man 7 signal" select must be explicitely restarted, regardless of the SA_RESTART flag.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 8, 2010

    If we agree so far, I believe that an implementation of untilConcludes *should* be added to stdlib ("signal.restartable_call", anyone?).

    Definitely, it's better to have this handler written once and correct than having various implementations scattered around libraries.

    If people agree, I'd be happy to produce the patch (trunk+py3k, doc+test).

    Go ahead :-)

    According to "man 7 signal" select must be explicitely restarted, regardless of the SA_RESTART flag.

    man sigaction states this:

    SA_RESTART 
    

    Provide behaviour compatible with BSD signal semantics by making _certain_ system calls restartable across signals.

    Indeed: you can't really rely on SA_RESTART, since certain syscalls are not restartable, and this flag isn't really portable.
    select is even specific since the file descriptor sets can be modified, and the timeout too.

    That's why it's definitely better to take the EINTR-wrapper approach.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 9, 2010

    Ok, I think the wrapper approach is a good one. Now remains to decide whether it should stay private to socketserver or be exposed as a generic method (for example in the "os" module as proposed).

    By the way, adding a test would be nice, too. I'm not sure it what form it should be done; perhaps register a signal handler and trigger the signal from a sub-thread while the main thread is iterating in a select loop?

    @nvetoshkin
    Copy link
    Mannequin Author

    nvetoshkin mannequin commented Apr 9, 2010

    I vote for making a global wrapper. As neologix pointed - many other modules can be (or are) affected. Futher found bugs could be fixed using that global wrapper.

    @spiv
    Copy link
    Mannequin

    spiv mannequin commented Apr 9, 2010

    Note that a trivial untilConcludes isn't correct for select if a timeout was passed. If a select(..., 60) was interrupted after 59 seconds, you probably want to restart it with a timeout of 1 second, not 60.

    The SocketServer_eintr.diff patch has this flaw.

    @YanivAknin
    Copy link
    Mannequin

    YanivAknin mannequin commented Apr 9, 2010

    pitrou, re. test code: I actually started with the test code, so that part is done.

    I opted to use a forked SocketServer rather than threads. I'm not an expert on the low-level details of a multi-threaded process receiving threads, but it seems to me like there's a chicken and egg issue here: suppose we have threadA (blocking on select()) and threadB (managing the test and doing something like os.kill(os.getpid(), SIGFOO). I think by definition the thread running when the signal is received is threadB (since it sent the signal), and I'm not sure threadA will be interrupted properly that way (threads are sneaky, now aren't they). So I cleaned up test_socketserver.py a bit, split it to a base test case and concrete test cases, and added a separate test case which uses os.fork() rather than threads and tests that way.

    The attached patch is against recent py3k and includes both neologix's patch (adapted to py3k) and the added test code. If this looks good, I'll easily backport it to 2.x/trunk.

    spiv, re. handling timeout in select: You're very right, and I'm not sure how we should handle it if we want to take the "untilConcludes" code that is currently shared between subprocess.py and socketserver.py and move it outside. For subprocess' and socketserver's specific cases, the naive implementation which doesn't handle timeouts is OK, because they don't use the timeout parameter anyway.

    Moving it outside warrants more discussion. Remember this isn't just about select() - it's a general user-space-restarter for any of the myriad of "slow" system calls that might be interrupted. There is some tension between elegance and even feasibility of correct implementation vs. the wrapper hiding functionality of some of these system calls. You could argue that if the user of the wrapper is unhappy about the lost functionality, they can make their own wrapper.

    Unless we can reach a resolution quickly, my gut feeling is that the current patch should be applied as-is and this issue be closed. A new issue along the lines of "Extract duplicate interruption-restart code to a relevant module and review stdlib for unprotected select() calls" should be opened instead. Your thoughts?

    @bda
    Copy link
    Mannequin

    bda mannequin commented Apr 28, 2010

    I encountered this issue when trying to exit cleanly on SIGTERM, which I use to terminate background daemons running serve_forever.

    In BaseServer, a threading.Event is used in shutdown, so it can block until server_forever is finished (after checking __serving). Since the SIGTERM interrupts the select system call, the event set is never reached, and shutdown hangs waiting on the event.

    I've attached an example of the pattern I was trying to use in my server. There are several ways around the issue, but looking at the API it seems like this _should_ work, and in my experience all servers have clean-up code so it's a very common case.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 28, 2010

    In BaseServer, a threading.Event is used in shutdown, so it can block
    until server_forever is finished (after checking __serving). Since the
    SIGTERM interrupts the select system call, the event set is never
    reached, and shutdown hangs waiting on the event.

    This precise use case is already fixed (in SVN trunk and in the 2.6
    branch) since the select() loop is now wrapped in a try..finally.
    I just ran your test case and killing -TERM works ok.

    @YanivAknin
    Copy link
    Mannequin

    YanivAknin mannequin commented May 13, 2010

    While bda's usecase is indeed fixed is recent versions, I want to point out the issue still exists in a recent py3k checkout. I've updated my patch to apply cleanly against recent py3k (and made less structural changes to test_socketserver.py; I still think it has an archaic unittest structure but I want this patch to remain relevant a bit longer, maybe fix the structure at a later time).

    Again, I argue that resolving the general case (a generic untilConcludes to be exposed in stdlib) is hard and that we have a patch review worthy (IMO) solution at hand. Let's apply it (and possibly open a separate ticket for the generic case, if that's deemed interesting).

    @nvetoshkin
    Copy link
    Mannequin Author

    nvetoshkin mannequin commented Nov 9, 2010

    Any news on this? Could we possibly apply patch as is? If I'm not mistaken timeout issue is the only one left unresolved.

    About timeout. The most elegant way, would be to use select's syscall timeout parameter, but man 2 select says: "On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1-2001 permits either behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. Consider timeout to be undefined after select() returns."

    I'm ready to adapt Yaniv's patch to 2.7

    @gpshead
    Copy link
    Member

    gpshead commented Jan 3, 2011

    I've added some code comments on http://bugs.python.org/review/7978/show

    overall I think the patch is right, I pointed out one thing to clean up and I think the unittest can be greatly simplified by using stubbed out mock select.select() instead of fork+signals+sleeping.

    @vstinner
    Copy link
    Member

    Using signalfd() (require Linux 2.6.22+), specified signals are written in a file and don't interrupt system calls (select). Bonus: we can wait for a signal using select!

    Using pthread_sigmask(), you can also block signals before calling select, but it's not atomic. Or you can use pselect() which is pthread_sigmask()+select() in an atomic fashion.

    pthread_sigmask() is already in Python 3.3, I plan to add signalfd() in a few weeks into Python 3.3. For pselect(), I don't know if we need it.

    pthread_sigmask() is available on most POSIX systems, not on Windows. Can select() be interrupted by CTRL+c on Windows?

    I don't say that we should not restart manually select() on EINTR, just that better solutions do exist today.

    @vstinner
    Copy link
    Member

    I think that bpo-12224 is a duplicate of this issue.

    @nvetoshkin
    Copy link
    Mannequin Author

    nvetoshkin mannequin commented Jun 1, 2011

    I don't think we should block signals - we can sleep on select for about forever.

    @JerzyKozera
    Copy link
    Mannequin

    JerzyKozera mannequin commented Apr 6, 2012

    I've updated the patch according to suggestions from Gregory P. Smith. Thanks to a change from bpo-12555 (PEP-3151) now just checking for OSError is enough.

    (I've decided to use mocked select() instead of calling alarm() to avoid depending on timing.)

    @pitrou
    Copy link
    Member

    pitrou commented Apr 7, 2012

    Jerzy's latest patch looks ok to me.
    This is a slight behaviour change so I'm not sure it should go in 3.2/2.7.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 8, 2012

    New changeset 5493d44b56d8 by Antoine Pitrou in branch '3.2':
    Issue bpo-7978: socketserver now restarts the select() call when EINTR is returned.
    http://hg.python.org/cpython/rev/5493d44b56d8

    New changeset 97a0c6230ece by Antoine Pitrou in branch 'default':
    Issue bpo-7978: socketserver now restarts the select() call when EINTR is returned.
    http://hg.python.org/cpython/rev/97a0c6230ece

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 8, 2012

    New changeset d941d1fcc6e6 by Antoine Pitrou in branch '2.7':
    Issue bpo-7978: socketserver now restarts the select() call when EINTR is returned.
    http://hg.python.org/cpython/rev/d941d1fcc6e6

    @vstinner
    Copy link
    Member

    vstinner commented Apr 8, 2012

    You must recompute the timeout when select() is interrupted: see issue bpo-12338.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 8, 2012

    You must recompute the timeout when select() is interrupted: see issue
    bpo-12338.

    Ah, right. It's a minor issue, though. I would suggest opening a separate issue for it. The patch is now committed to all 3 branches (I'm finally convinced this is a bug worth fixing), so I'm closing this issue.

    @pitrou pitrou added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Apr 8, 2012
    @pitrou pitrou closed this as completed Apr 8, 2012
    @pitrou pitrou added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Apr 8, 2012
    @JerzyKozera
    Copy link
    Mannequin

    JerzyKozera mannequin commented Apr 8, 2012

    I forgot to mention my patch is 3.3-only, sorry - it depends on changes from bpo-12555 (http://hg.python.org/cpython/rev/41a1de81ef2b#l18.21 to be precise). To support 3.2 and 2.7:
    (1) select.error must be caught as in the original patch,
    (2) e.args[0] must be used - select.error doesn't have 'errno' attribute.
    Should I prepare the patch for 3.2 and 2.7?

    Regarding not updating the timeout, it was already mentioned above. Though as an afterthought, it might be worrying that if the process is receiving repeated signals with interval between them less than timeout, we might fall into infinite loop of select() when it should timeout, but that is probably very obscure case.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 8, 2012

    New changeset f8e7fcd581ff by Antoine Pitrou in branch '3.2':
    Fix the patch for issue bpo-7978: select() raises select.error before 3.3, not OSError.
    http://hg.python.org/cpython/rev/f8e7fcd581ff

    New changeset 4298d6e79ecb by Antoine Pitrou in branch '2.7':
    Fix the patch for issue bpo-7978: select() raises select.error before 3.3, not OSError.
    http://hg.python.org/cpython/rev/4298d6e79ecb

    @pitrou
    Copy link
    Member

    pitrou commented Apr 8, 2012

    To support 3.2 and 2.7:
    (1) select.error must be caught as in the original patch,
    (2) e.args[0] must be used - select.error doesn't have 'errno'
    attribute.
    Should I prepare the patch for 3.2 and 2.7?

    It shouldn't be necessary, I've just committed fixes.
    Thanks for noticing!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants