classification
Title: Fix EINTR Socket Module issues in 2.7
Type: behavior Stage: resolved
Components: Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, gregory.p.smith, koobs, mcjeff, r.david.murray, rpcope1, vstinner
Priority: normal Keywords: needs review, patch

Created on 2015-04-04 02:51 by mcjeff, last changed 2017-09-06 22:45 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
socket_intr.py mcjeff, 2015-04-04 02:51
socket_eintr.patch mcjeff, 2015-04-04 02:58 review
socket_eintr.1.patch mcjeff, 2015-04-04 16:37 review
socket_eintr.2.patch mcjeff, 2015-04-08 15:24 review
connect_eintr-py27.patch vstinner, 2015-04-09 08:32 review
test_connect_eintr4.py vstinner, 2015-04-09 08:32
socket_eintr.3.patch mcjeff, 2015-04-09 16:19 review
socket_eintr.4.patch mcjeff, 2015-04-10 18:08 review
socket_eintr.5.patch mcjeff, 2015-04-13 18:12 review
Messages (30)
msg240044 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-04 02:51
There are a collection of places in the socket module that do not correctly retry on EINTR.  Updated to wrap those calls in a retry loop.  However, when fixing connect calls, I noticed that when EINTR is retried on a socket with a timeout specified, the retry fails with EALREADY.. so I fixed that.

I was going to shy away from primitive calls on sockets as one expects these things when working at a lower level, however, due to the way socket timeouts were implemented, I handled it differently in internal_connect.  The create_connection calls probably ought to shield users from retry.

Python 2.7.6.
msg240045 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-04 02:58
mcjeff@mcjeff:~/cpython_clean$ hg summary
parent: 95416:fe34dfea16b0 
 Escaped backslashes in docstrings.
branch: 2.7
commit: 3 modified, 3 unknown
update: (current)
msg240083 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-04 16:37
Whoops. Accidentally attached the wrong patch that I generated during testing.
msg240093 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-04 21:54
I have a very good news for you: this issue and more generally all EINTR
issues will be solved in Python 3.5. See the PEP 475.

I'm not really interested to fix Python 2.7.
msg240094 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-04-04 22:04
You may not be, but I am. :). Jeff is aware of PEP 475.

Thanks for the awesome work on the real cleanup of this stuff in 3.5.
Sanity at last.
msg240234 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-04-07 22:48
I like the socketmodule.c part of socket_eintr.1.patch, but it appears to still have the issue haypo describes in https://bugs.python.org/issue20611#msg240194 where connect() cannot be called more than once.  The kernel carries on with the connect without our process waiting for it, we have to monitor it to know when it has succeeded or failed.

See https://hg.python.org/cpython/file/85a5265909cb/Modules/socketmodule.c#l2610 and issue23618 from Python 3.5.

[code review done at 35,000ft, thanks chromebook gogo free wifi pass!]
msg240272 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-08 15:24
Missed check on _ex func.
msg240273 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-08 15:26
So, yeah, that's right. In the attached patch, I'm closing the file descriptor if the timeout/error happens on a non-blocking call.  It fails with an EBADF on reconnect at that point, but it doesn't potentially leave an FD in the proc's file table.

Should be no more EINTR's coming out of the select call.
msg240310 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-09 08:32
Ok, since you look to want to fix Python 2.7, I can help you to handle EINTR in socket.connect() since I fixed Python 3.5.

If Python 2.7 is fixed, Python 3.4 should also be fixed.

connect_eintr-py27.patch: Patch for Python 2.7 to handle EINTR in socket.connect() for blocking socket or socket with a timeout. The patch calls select() to wait until the connection completes or fails.

Use attached test_connect_eintr4.py to test it.

I only tested my patch on Linux yet.
msg240313 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-09 08:44
socket_eintr.2.patch has an issue with timeout.

socket_eintr.2.patch retries a socket method when it is interrupted, but it doesn't recompute the timeout. If a program is interrupted every second by a signal, the signal handler doesn't raise an exception and the socket has a timeout longer than 1 second, the socket method will "hang".

Internally, socket method uses select() using the socket timeout. I had to modify deeply the socket module to handle correctly EINTR and respect the timeout in Python 3.5. By the way, I changed the behaviour of socket.sendall(), the timeout is now more strict (it's now the maximum total duration, the timeout is no reset each time some bytes are sent).

Since Python applications of the last 20 years already had to handle EINTR theirself, it's maybe safer to leave Python 2.7 with its bugs and let application developers workaround them? Anyway, if you have to handle EINTR in your application, you will have to handle EINTR explicitly to support Python 2.7.9 and older, no? It's hard to require an exact minor version on all platforms.

The PEP 475 is wider than just the socket module:
https://docs.python.org/dev/whatsnew/3.5.html#changes-in-the-python-api

Impacted Python modules: io, os, select, socket, time. (Python 2.7 don't have faulthandler and don't have signal.sigtimedwait() nor signal.sigwaitinfo() functions, so faulthandler and signal are not impacted in Python 2.7.)

Do you want to fix all these modules?
msg240347 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-09 16:19
Updated to recalculate timeout at Python level.  The current module already works this way on recv() calls. See attached.

I'd be happy to churn through and fix the other modules (using the 3.5 work as a guide), though I think only addressing the higher level abstractions makes sense (I think that's been noted elsewhere). For example, the _fileobject wrappers, but not the recv from sock_recv_guts.
msg240382 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-09 20:34
> I'd be happy to churn through and fix the other modules (using the 3.5 work as a guide),

It is risky to modify so much code. The PEP 475 also has an impact on backward compatibility:
https://www.python.org/dev/peps/pep-0475/#backward-compatibility

IMO it must be discussed on python-dev.
msg240407 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-10 03:04
Sure, to be clear --- the intention wouldn't be to create any backwards incompatible changes.  Rather, as a means to identify what needs to be addressed.
msg240441 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-10 18:08
I went and added the timeout modification per loop and the docs. The thought crossed my mind initially (as did adding a new field to the socket structure), but I wasn't sure what the implications might be for use cases I' not aware of.
msg240597 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-13 07:44
With socket_eintr.4.patch, socket methods become non-thread safe. I'm not sure that it's a serious issue, but you have to consider this side effect of your change.

(Modify socketmodule.c instead would allow to handle EINTR, recompute timeout and keep the thread safety, but it requires a larger patch.)
msg240692 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-13 18:12
Added a flag to allow for at least one run -- I know nothing of non-Linux clock resolution.  That should handle that.

As for the thread safety of the socket timeouts, yeah, that was why I didn't do that initially, I assumed the suggestion to take that approach took the risk into account; you'll know far more about potential impact than I will.

Since this is at a higher abstraction than socket primitives, another option would be to track remaining time in thread local data so that we don't mutate the timeout on the object (which I don't really like doing anyway). 

Thoughts on approach before I put it together?
msg240698 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-13 18:25
Actually, never mind that suggestion. Now that I think a bit more about it, that actually doesn't do anything since I'd still need to set the updated timeout on the current socket object anyway. Whoops. 

I'll leave it up to you as to whether we go with an approach like this as is or not. I'm happy to change the approach if there's a better one.
msg240860 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-14 08:50
@Benjamin: since you are the maintainer of Python 2.7, I would like your opinion on this issue.

Python older than 3.5 does not handle EINTR for you: if you register a signal handler for a signal, the signal handler does not raise an exception, and a syscall is interrupted by a signal: Python raises an exception with the EINTR error code (it may be an OSError, IOError, socket.error, etc.). Well, the longer rationale is in the PEP 475.

Basically, Jeff McNeil wants to implement the PEP 475 in Python 2.7. I'm opposed to that because I contributed to the implementation in Python 3.5 and it changed a *lot* of code, the implementation was only made possible with previous changes done in Python 3.3 and 3.4 like the introduction of time.monotonic() in Python, a monotonic clock in C (the new _PyTime_t API and the _PyTime_GetMonotonicClock()) and refactoring of code.

Retrying socket methods doesn't simply mean writing a loop: you must respect the timeout, which is more complex. Socket methods in Python are implemented internally with select(). select() can fail with EINTR, as socket functions.

Jeff wrote a function to retry on EINTR which recomputes the timeout: socket_eintr.5.patch. Problem: his patch calls socket.settimeout() which is not thread safe. IMO the correct fix is to implement the PEP 475 in socketmodule.c directly, because it makes possible to use a variable for the timeout (no need to call settimeout()). (Reimplementing select() in socket.py to wait until the socket becomes readable/writable doesn't sound right to me; and it would add a new dependency from the socket module to the select module.)

Modifying the C code is more complex: see for example my patch connect_eintr-py27.patch to handle EINTR in socket.connect() and recompute the timeout. IMO socket.connect() requires the most complex code to handle EINTR and recompute the timeout, you have to combine connect(), select() and getsockopt(). (So it's maybe not the best example.) In Python 3.5, it took me ~10 commits to refactor socketmodule.c to be able to add a new sock_call() function (and a few more commits to fix bugs :-))

Since Python didn't handle EINTR before Python 3.5 and only a very few people complained, I don't think that it's worth to take the risk on introduction regressions. Anyway, Python 2 is dead, Python 3 is the place where large bugfixes are implemented (and where new features are implemented).

The PEP 475 is much larger than the socket module. In Python 2.7, io, os, select, socket and time are also impacted for example.

I suggest to close this issue as WONTFIX.

I'm not sure that only fixing socket.connect() is useful. (Apply connect_eintr-py27.patch which is based on my fix for Python 3.5.)
msg240879 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-14 13:18
I'm not a big fan of the settimeout approach myself (and only did it because it was mentioned as a possible approach).  I think the existing implementations of EINTR retry also suffer from the same issue in which the timeout isn't adjusted per iteration (but that's okay, see below).

The _fileobject explicitly states only to use a blocking socket (i.e. one without a timeout set), so in practice, that shouldn't be a problem.  I'd like to ensure the rest of the calls in that class take the same approach (thus the retryable call function, originally without the settimeout code) as they're a higher level abstraction above recv/send. 

The only other call in socket.py that also qualifies as a higher abstraction is create_connection.   If we could apply the 2.7 patch you created, connect ought to be correct at that point.  All that remains after that would be isolating _retryable_call to _fileobject calls -- sans the settimeout -- which requires a blocking socket anyway. In retrospect, I probably should have just placed that call in _fileobject anyway. 

I think that addresses most of what I'd like to fix.  Of course, I'm happy to go through and weave PEP 475 into the socketmodule.c code entirely, but if the code churn creates too much worry, I think the above is a good medium.
msg241646 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-20 13:39
Is there a bug being fixed here? I mean other than socket not handling EINTR, where I think we agree that handling it is a feature, given the PEP.
msg241658 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-20 15:16
See also issue #20611 "socket.create_connection() doesn't handle EINTR properly" which was closed as duplicated of this issue.

I'm not sure that #20611 is a duplicate.
msg241742 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-04-21 20:22
Fixing the underlying connect call should also address EINTR failing with a "operation already in progress" when a connect+timeout fails due to a signal. 

I can understand not addressing EINTR generically (though it is already partially addressed in 2.7's socket.py - this just completes it), but IMO, not handling it on connect & responding with a seemingly unrelated error is the wrong thing to do.
msg241766 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-04-22 03:25
... Code review:

In socket_eintr.5.patch I don't think the thread safety concerns about the s.settimeout() calls being done from Python code should be an issue but I'll ponder that for a bit.  When you've got a socket s, why would code ever be calling s.recv() in one thread while calling another method on the same socket s in another thread?  That seems unwise.  (For UDP it might be a valid thing to do but it still has an odd smell to it)

If you're worried about timer resolution, just check that the new timeout value "deadline - now" is less than the previous timeout value and if not, ensure the value decreases by some small amount. So long as it continually goes down, it will eventually timeout. Even if the duration isn't accurate in the face of a bunch of EINTRs, this is a degenerate case; it should still be reasonable behavior.

On POSIX systems using os.times()[4] rather than an absolute time.time(), which can be changed out from underneath the process (even though that is rare on functioning systems), could be useful. But conditionally using that seems overly complex.  I wouldn't bother.

... Side discussion

Some things in 2.7 related to EINTR have already been fixed in the past few years such as data loss in readline() and IIRC writelines().  Correctness isn't a feature.

Do you consider it an API change to prevent an error that nobody relies on or even expects (as PEP 475 notes) and virtually no code is written to properly handle when it _does_ happen?  I don't.  It is a bug fix.  It's mostly a matter of if we can do it sanely and in a maintainable manner.

Please don't WONTFIX this issue, I intend to get safe fixes in.

... Motivation

The underlying motivation here is to enable the use of a signal based sampling profiler on a program that is using Python 2.7 (entirely, or embedded as one part of a larger C/C++ application).  Signals (SIGPROF) used in that scenario cause EINTR returns from syscalls that otherwise rarely (read: never) have them in most environments.  Result: Bugs show up making such a sampling profiler impossible to deploy in practice until those issues are fixed.  Fixing the Python standard library is high value as it is a higher up place where these can be handled properly as you cannot even use some standard library modules at all otherwise because the unhandled EINTR surfaces too late or causes other unrecoverable errors internally before you see it.
msg241854 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-23 09:02
> On POSIX systems using os.times()[4] rather than an absolute time.time(),

This is just wrong. os.times() doesn't use seconds, but an "CPU time" which is unrelated. socket.settimeout() uses seconds.

By the way, as I wrote before: using the system clock here may introduce bugs when the system clock goes forward to backward, but Python 2.7 doesn't have a monotonic clock :-( That's another reason why I didn't want to fix Python 2.7 or 3.4.


> (For UDP it might be a valid thing to do but it still has an odd smell to it)

I'm concerned by backward compatibility. Trivial changes was made in Python 2.7 and then came back as regressions because it broke several applications. See recent discussions on this topic on python-dev. Handling EINTR and recompute the timeout is not a trivial changes, some applications rely heavily on the current behaviour of sockets. Network servers using Twisted, eventlet and other libraries.

When we say "threads", also think about greenthreads which are used in eventlet to handle sockets.


> Some things in 2.7 related to EINTR have already been fixed in the past few years such as data loss in readline() and IIRC writelines().  Correctness isn't a feature.

readline()/writelines() don't have a timeout, so it was easier to handle EINTR there.


> Do you consider it an API change to prevent an error that nobody relies on or even expects (as PEP 475 notes) and virtually no code is written to properly handle when it _does_ happen?

For your information, some people already contacted me because their application behaves differently on Python 3.5 (with the PEP 475). Well, it was easy to modify their code to not rely on EINTR, but it means that "Applications relying on the fact that system calls are interrupted with InterruptedError will hang"  exists in the wild.

While it's acceptable to change such thing in Python 3.5, a new major version, I asked if it's acceptable for a minor version.

By the way, Python 3.5 is not released yet, so we can expect more feedback on PEP 475 changes when Python 3.5 will be released (other regressions in applications?).


> Please don't WONTFIX this issue, I intend to get safe fixes in.

I suggest WONTFIX for Python 2.7, this issue is *already* fixed in Python 3.5. Come on, in 2015 it's time to upgrade to Python 3!
msg241905 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-04-24 00:54
diverging discussion: Go re-read the documentation on os.times().  It is plural, it isn't just CPU time.  (on POSIX os.times()[4] is likely to be the system uptime in seconds as a float... it cannot be changed like the absolute clock can, it is a relative monotonic clock).
msg241986 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-24 22:31
> diverging discussion: Go re-read the documentation on os.times().

We don't have the same definition of the unit "seconds" :-)

>>> print(os.times()); time.sleep(1); print(os.times())
(0.04, 0.01, 0.0, 0.0, 4731691.68)
(0.04, 0.01, 0.0, 0.0, 4731692.68)

My watch doesn't stop when I'm sleeping.

See https://docs.python.org/dev/library/time.html#time.process_time

See also "CPU Time" in https://www.python.org/dev/peps/pep-0418/#glossary
msg241987 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-24 22:37
> We don't have the same definition of the unit "seconds" :-)

Or, please forget my comment, I watched the first 4 items of os.times(). I didn't notice the difference of the 5th item, os.times()[4].

The bad news is that only the first two items of os.times() are filled on Windows :-( I don't think that Python 2.7 provides a monotonic clock on Windows.

static PyObject *
os_times_impl(PyModuleDef *module)
{
#ifdef MS_WINDOWS
    FILETIME create, exit, kernel, user;
    HANDLE hProc;
    hProc = GetCurrentProcess();
    GetProcessTimes(hProc, &create, &exit, &kernel, &user);
    /* The fields of a FILETIME structure are the hi and lo part
       of a 64-bit value expressed in 100 nanosecond units.
       1e7 is one second in such units; 1e-7 the inverse.
       429.4967296 is 2**32 / 1e7 or 2**32 * 1e-7.
    */
    return build_times_result(
        (double)(user.dwHighDateTime*429.4967296 +
                 user.dwLowDateTime*1e-7),
        (double)(kernel.dwHighDateTime*429.4967296 +
                 kernel.dwLowDateTime*1e-7),
        (double)0,
        (double)0,
        (double)0);
...
msg243760 - (view) Author: Jeff McNeil (mcjeff) * Date: 2015-05-21 16:06
Do we have a decision on this yet?  I'm willing to rework bits that may need it, but I'd like to know whether this is going to be a fruitful effort or not. Thanks!
msg301517 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2017-09-06 20:43
Unassigning because i'm unlikely to get to this in 2.7, it is better to concentrate on ensuring that 3.x stays signal safe.
msg301534 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-06 22:45
I am closing the issue as WONTFIX for all the reasons listen in my previous comments.
History
Date User Action Args
2017-09-06 22:45:57vstinnersetstatus: open -> closed
resolution: wont fix
messages: + msg301534

stage: patch review -> resolved
2017-09-06 20:43:50gregory.p.smithsetassignee: gregory.p.smith ->
messages: + msg301517
2016-04-04 06:41:24rpcope1setnosy: + rpcope1
2015-05-21 16:06:21mcjeffsetmessages: + msg243760
2015-04-24 22:37:21vstinnersetmessages: + msg241987
2015-04-24 22:31:57vstinnersetmessages: + msg241986
2015-04-24 00:54:01gregory.p.smithsetmessages: + msg241905
2015-04-23 09:02:54vstinnersetmessages: + msg241854
2015-04-22 03:25:14gregory.p.smithsetmessages: + msg241766
2015-04-21 20:22:45mcjeffsetmessages: + msg241742
2015-04-20 15:16:26vstinnersetmessages: + msg241658
2015-04-20 13:39:07r.david.murraysetnosy: + r.david.murray
messages: + msg241646
2015-04-20 09:59:29koobssetnosy: + koobs
2015-04-14 13:18:41mcjeffsetmessages: + msg240879
2015-04-14 08:50:51vstinnersetnosy: + benjamin.peterson
messages: + msg240860
2015-04-13 18:25:27mcjeffsetmessages: + msg240698
2015-04-13 18:12:16mcjeffsetfiles: + socket_eintr.5.patch

messages: + msg240692
2015-04-13 07:44:33vstinnersetmessages: + msg240597
2015-04-10 18:08:51mcjeffsetfiles: + socket_eintr.4.patch

messages: + msg240441
2015-04-10 03:04:36mcjeffsetmessages: + msg240407
2015-04-09 20:34:05vstinnersetmessages: + msg240382
2015-04-09 16:20:01mcjeffsetfiles: + socket_eintr.3.patch

messages: + msg240347
2015-04-09 08:44:08vstinnersetmessages: + msg240313
2015-04-09 08:32:31vstinnersetfiles: + test_connect_eintr4.py
2015-04-09 08:32:20vstinnersetfiles: + connect_eintr-py27.patch

messages: + msg240310
2015-04-08 15:26:23mcjeffsetmessages: + msg240273
2015-04-08 15:24:47mcjeffsetfiles: + socket_eintr.2.patch

messages: + msg240272
2015-04-07 23:00:46gregory.p.smithsetkeywords: + needs review
assignee: gregory.p.smith
type: behavior
stage: patch review
2015-04-07 22:48:17gregory.p.smithsetmessages: + msg240234
2015-04-07 22:13:36gregory.p.smithlinkissue20611 superseder
2015-04-04 22:04:31gregory.p.smithsetmessages: + msg240094
2015-04-04 21:54:23vstinnersetmessages: + msg240093
2015-04-04 16:37:41mcjeffsetfiles: + socket_eintr.1.patch

messages: + msg240083
2015-04-04 03:55:44ned.deilysetnosy: + vstinner
2015-04-04 02:58:32mcjeffsetfiles: + socket_eintr.patch

nosy: + gregory.p.smith
messages: + msg240045

keywords: + patch
2015-04-04 02:51:25mcjeffcreate