Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 475 - EINTR handling #67474

Closed
neologix mannequin opened this issue Jan 20, 2015 · 33 comments
Closed

PEP 475 - EINTR handling #67474

neologix mannequin opened this issue Jan 20, 2015 · 33 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@neologix
Copy link
Mannequin

neologix mannequin commented Jan 20, 2015

BPO 23285
Nosy @pitrou, @vstinner, @ned-deily, @vadmium
Files
  • ff1274594739.diff
  • eintr.diff
  • eintr-1.diff
  • eintr-2.diff
  • eintr-3.diff
  • eintr-4.diff
  • fileio.patch
  • setitimer.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-03-20.11:44:35.815>
    created_at = <Date 2015-01-20.23:03:28.161>
    labels = ['type-feature', 'library']
    title = 'PEP 475 - EINTR handling'
    updated_at = <Date 2016-07-10.09:23:45.050>
    user = 'https://bugs.python.org/neologix'

    bugs.python.org fields:

    activity = <Date 2016-07-10.09:23:45.050>
    actor = 'berker.peksag'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-03-20.11:44:35.815>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2015-01-20.23:03:28.161>
    creator = 'neologix'
    dependencies = []
    files = ['37797', '37802', '37905', '37909', '37910', '37935', '38309', '38403']
    hgrepos = ['293']
    issue_num = 23285
    keywords = ['patch']
    message_count = 33.0
    messages = ['234400', '234402', '234403', '234412', '234425', '234874', '234879', '234931', '234950', '234952', '234953', '234955', '234989', '234993', '234997', '235010', '235091', '235092', '235285', '235424', '235435', '235451', '235525', '235546', '237100', '237101', '237159', '237193', '237194', '237611', '237612', '237613', '238650']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'vstinner', 'ned.deily', 'neologix', 'python-dev', 'martin.panter']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue23285'
    versions = ['Python 3.5']

    @neologix neologix mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jan 20, 2015
    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 20, 2015

    The test runs fine on Linux, but hangs in test_send() on OS-X and *BSD.
    I don't know what's wrong, so if someone with access to one of these OS could have a look, it would be great.

    @neologix neologix mannequin changed the title PEP 475 - EINTR hanndling PEP 475 - EINTR handling Jan 20, 2015
    @pitrou
    Copy link
    Member

    pitrou commented Jan 20, 2015

    Perhaps Ned can help on the OS X side of things.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 20, 2015

    The review diff is weird: it seems it contains changes that aren't EINTR-related (see e.g. argparse.rst).

    @ned-deily
    Copy link
    Member

    (It may be several days before I can spend much time on it but I will take a look.)

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 21, 2015

    The review diff is weird: it seems it contains changes that aren't EINTR-related (see e.g. argparse.rst).

    Here's a manually generated diff.

    @ned-deily
    Copy link
    Member

    I added a few prints to the send and receive loops of _test_send. When running on a reasonably current Debian testing Linux:

    ./python Lib/test/eintrdata/eintr_tester.py
    test_read (main.OSEINTRTest) ... ok
    test_wait (main.OSEINTRTest) ... ok
    test_wait3 (main.OSEINTRTest) ... ok
    test_wait4 (main.OSEINTRTest) ... ok
    test_waitpid (main.OSEINTRTest) ... ok
    test_write (main.OSEINTRTest) ... ok
    test_accept (main.SocketEINTRTest) ... ok
    test_recv (main.SocketEINTRTest) ... ok
    test_recvmsg (main.SocketEINTRTest) ... ok
    test_send (main.SocketEINTRTest) ... len(data) = 50331651
    sent = 183360, written = 183360
    sent = 50148291, written = 50331651
    chunk = 50331651, total read = 50331651
    ok
    test_sendall (main.SocketEINTRTest) ... len(data) = 50331651
    chunk = 49323840, total read = 49323840
    sent = None, written = 50331651
    chunk = 1007811, total read = 50331651
    ok
    test_sendmsg (main.SocketEINTRTest) ... len(data) = 50331651
    sent = 183360, written = 183360
    sent = 50148291, written = 50331651
    chunk = 50331651, total read = 50331651
    ok

    ----------------------------------------------------------------------
    Ran 12 tests in 10.139s

    OK

    When run on OS X (10.10.1):

    test_read (main.OSEINTRTest) ... ok
    test_wait (main.OSEINTRTest) ... ok
    test_wait3 (main.OSEINTRTest) ... ok
    test_wait4 (main.OSEINTRTest) ... ok
    test_waitpid (main.OSEINTRTest) ... ok
    test_write (main.OSEINTRTest) ... ok
    test_accept (main.SocketEINTRTest) ... ok
    test_recv (main.SocketEINTRTest) ... ok
    test_recvmsg (main.SocketEINTRTest) ... ok
    test_send (main.SocketEINTRTest) ... len(data) = 50331651
    sent = 8192, written = 8192
    chunk = 8192, total read = 8192
    chunk = 8192, total read = 16384
    sent = 16384, written = 24576
    chunk = 8192, total read = 24576
    chunk = 8192, total read = 32768
    chunk = 8192, total read = 40960
    chunk = 8192, total read = 49152
    sent = 32768, written = 57344
    chunk = 8192, total read = 57344
    chunk = 8192, total read = 65536
    chunk = 8192, total read = 73728
    chunk = 8192, total read = 81920
    chunk = 8192, total read = 90112
    sent = 40960, written = 98304
    chunk = 8192, total read = 98304
    chunk = 8192, total read = 106496
    chunk = 8192, total read = 114688
    chunk = 8192, total read = 122880
    sent = 32768, written = 131072
    chunk = 8192, total read = 131072
    chunk = 8192, total read = 139264
    chunk = 8192, total read = 147456
    chunk = 8192, total read = 155648
    chunk = 8192, total read = 163840
    sent = 40960, written = 172032
    chunk = 8192, total read = 172032
    chunk = 8192, total read = 180224
    chunk = 8192, total read = 188416
    chunk = 8192, total read = 196608
    sent = 32768, written = 204800
    chunk = 8192, total read = 204800
    chunk = 8192, total read = 212992
    chunk = 8192, total read = 221184
    chunk = 8192, total read = 229376
    chunk = 8192, total read = 237568
    sent = 40960, written = 245760
    chunk = 8192, total read = 245760
    chunk = 8192, total read = 253952
    chunk = 8192, total read = 262144
    chunk = 8192, total read = 270336
    sent = 32768, written = 278528
    chunk = 8192, total read = 278528
    chunk = 8192, total read = 286720
    chunk = 8192, total read = 294912
    chunk = 8192, total read = 303104
    chunk = 8192, total read = 311296
    sent = 40960, written = 319488
    [... and so on]

    When run standalone, the tests do eventually finish without error but take a *long* time to do so.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 28, 2015

    I added a few prints to the send and receive loops of _test_send. When running on a reasonably current Debian testing Linux:

    Thanks, that's what I was suspecting, but I really don't understand
    why 200ms isn't enough for a socket write to actually do something:
    maybe OS-X and *BSD schedulers are large timeslice...

    Could you try by increasing signal_period to e.g. 0.5, and sleep_time to 1?

    @ned-deily
    Copy link
    Member

    It turns out the times are not important; the hangup is the default size of the socket buffers on OS X and possibly BSD in general. In my case, the send and receive buffers are 8192, which explains why the chunks written are so small. I somewhat arbitrarily changed the sizes of the buffers in _test_send with:

    rd.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, support.SOCK_MAX_SIZE // 3)
    wr.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, support.SOCK_MAX_SIZE // 3)

    The results were:

    test_send (main.SocketEINTRTest) ... rd SO_RCVBUF default was 8192, wr SO_SNDBUF default was 8192
    len(data) = 50331651
    sent = 5592405, written = 5592405
    chunk = 5592405, total read = 5592405
    chunk = 5592405, total read = 11184810
    chunk = 5592405, total read = 16777215
    sent = 16777215, written = 22369620
    chunk = 5592405, total read = 22369620
    chunk = 5592405, total read = 27962025
    chunk = 5592405, total read = 33554430
    chunk = 5592405, total read = 39146835
    sent = 22369620, written = 44739240
    chunk = 5592405, total read = 44739240
    sent = 5592411, written = 50331651
    chunk = 5592405, total read = 50331645
    chunk = 6, total read = 50331651
    ok
    test_sendall (main.SocketEINTRTest) ... rd SO_RCVBUF default was 8192, wr SO_SNDBUF default was 8192
    len(data) = 50331651
    chunk = 5592405, total read = 5592405
    chunk = 5592405, total read = 11184810
    chunk = 5592405, total read = 16777215
    chunk = 5592405, total read = 22369620
    chunk = 5592405, total read = 27962025
    chunk = 5592405, total read = 33554430
    chunk = 5592405, total read = 39146835
    chunk = 5592405, total read = 44739240
    sent = None, written = 50331651
    chunk = 5592405, total read = 50331645
    chunk = 6, total read = 50331651
    ok
    test_sendmsg (main.SocketEINTRTest) ... rd SO_RCVBUF default was 8192, wr SO_SNDBUF default was 8192
    len(data) = 50331651
    sent = 5592405, written = 5592405
    chunk = 5592405, total read = 5592405
    chunk = 5592405, total read = 11184810
    chunk = 5592405, total read = 16777215
    chunk = 5592405, total read = 22369620
    chunk = 5592405, total read = 27962025
    sent = 27962025, written = 33554430
    chunk = 5592405, total read = 33554430
    chunk = 5592405, total read = 39146835
    chunk = 5592405, total read = 44739240
    sent = 16777221, written = 50331651
    chunk = 5592405, total read = 50331645
    chunk = 6, total read = 50331651
    ok

    I dunno if a value that large will work in all environments, so the code to call setsockopt might need to be smarter.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 29, 2015

    It turns out the times are not important; the hangup is the default size of the socket buffers on OS X and possibly BSD in general. In my case, the send and receive buffers are 8192, which explains why the chunks written are so small.

    Hmmm...
    Basically, with a much smaller socket buffer, we get much more context
    switches, which increases drastically the test runtime.
    But I must admit I'm still really surprised by the time it takes on
    OS-X: with a SOCK_MAX_SIZE of 16MB and a socket buffer size of 8kb,
    that's 2000 calls to send with context switches between the sender and
    receiver - and thie overhead should be much less on a two core
    machine.

    I'll try to increase the socket buffer size and see what the buildbots
    do: most of them will probably cap it at around 100k, but that'll
    hopefully be enough.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 29, 2015

    Charles-François Natali added the comment:

    Hmmm...
    Basically, with a much smaller socket buffer, we get much more context
    switches, which increases drastically the test runtime.
    But I must admit I'm still really surprised by the time it takes on
    OS-X: with a SOCK_MAX_SIZE of 16MB and a socket buffer size of 8kb,
    that's 2000 calls to send with context switches between the sender and
    receiver - and thie overhead should be much less on a two core
    machine.

    OK, actually the receiver is completely CPU-bound, because of memory
    allocation for the socket.recv() buffer I'll play with recv_into() and
    profile a bit more.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 29, 2015

    OK, it turns out the culprit was repeated calls to BytesIO.getvalue(),
    which forced large allocations upon reception of every message.
    The patch attached fixes this (without changing the socket buffer size).

    @ned-deily
    Copy link
    Member

    eintr-1.diff doesn't seem to make any significant difference from eintr.diff on my system. It's still pegging a CPU at 100% and takes 7 minutes wall time to complete.

    $ time ./python ~/Projects/PyDev/active/dev/3x/source/Lib/test/eintrdata/eintr_tester.py
    test_read (__main__.OSEINTRTest) ... ok
    test_wait (__main__.OSEINTRTest) ... ok
    test_wait3 (__main__.OSEINTRTest) ... ok
    test_wait4 (__main__.OSEINTRTest) ... ok
    test_waitpid (__main__.OSEINTRTest) ... ok
    test_write (__main__.OSEINTRTest) ... ok
    test_accept (__main__.SocketEINTRTest) ... ok
    test_recv (__main__.SocketEINTRTest) ... ok
    test_recvmsg (__main__.SocketEINTRTest) ... ok
    test_send (__main__.SocketEINTRTest) ... ok
    test_sendall (__main__.SocketEINTRTest) ... ok
    test_sendmsg (__main__.SocketEINTRTest) ... ok

    Ran 12 tests in 439.966s

    OK

    real 7m20.276s
    user 5m7.575s
    sys 2m9.846s

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 29, 2015

    eintr-1.diff doesn't seem to make any significant difference from eintr.diff on my system. It's still pegging a CPU at 100% and takes 7 minutes wall time to complete.

    Alright, enough played: the patch attached uses a memoryview and
    socket.recv_into() to remove all memory allocations: if this doesn't
    fix your problem, I don't know what's going on...

    @ned-deily
    Copy link
    Member

    With eintr-2.diff, fast!:

    $ time ./python ~/Projects/PyDev/active/dev/3x/source/Lib/test/eintrdata/eintr_tester.py
    test_read (__main__.OSEINTRTest) ... ok
    test_wait (__main__.OSEINTRTest) ... ok
    test_wait3 (__main__.OSEINTRTest) ... ok
    test_wait4 (__main__.OSEINTRTest) ... ok
    test_waitpid (__main__.OSEINTRTest) ... ok
    test_write (__main__.OSEINTRTest) ... ok
    test_accept (__main__.SocketEINTRTest) ... ok
    test_recv (__main__.SocketEINTRTest) ... ok
    test_recvmsg (__main__.SocketEINTRTest) ... ok
    test_send (__main__.SocketEINTRTest) ... ok
    test_sendall (__main__.SocketEINTRTest) ... ok
    test_sendmsg (__main__.SocketEINTRTest) ... ok

    Ran 12 tests in 7.652s

    OK

    real 0m7.959s
    user 0m2.573s
    sys 0m1.604s

    Instrumented test_send, 3 socket.send calls, many socket.recv_into calls:

    test_send (main.SocketEINTRTest) ... rd SO_RCVBUF default was 8192, wr SO_SNDBUF default was 8192
    len(data) = 16777215

    sent = 8192, total written = 8192
    sent = 4440064, total written = 4448256
    sent = 12328959, total written = 16777215

    received = 8192, total read = 8192
    received = 8192, total read = 16384
    received = 8192, total read = 24576
    [...]
    received = 8192, total read = 16760832
    received = 8192, total read = 16769024
    received = 8191, total read = 16777215

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 29, 2015

    With eintr-2.diff, fast!:

    Victory \°/.

    Instrumented test_send, 3 socket.send calls, many socket.recv_into calls:

    Yep, that's expected.
    I think we should keep the default socket buffer size: it increases
    the test coverage, and it's probably not worth trying to increase it
    to make the test a bit faster, especially since it spends so much time
    sleeping.

    Antoine, I'm now happy with the patch, so we'll be waiting for your
    decision on the PEP with Victor :-).

    @pitrou
    Copy link
    Member

    pitrou commented Jan 30, 2015

    Victor, do you think there's anything left to do?

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jan 31, 2015

    I just realized I didn't retry upon EINTR for open(): eintr-4.diff
    adds this, along with tests (using a fifo).

    Also, I added comments explaining why we don't retry upon close() (see
    e.g. http://lwn.net/Articles/576478/ and
    http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-09/3000.html).

    @vstinner
    Copy link
    Member

    Charles-François Natali added the comment:

    I just realized I didn't retry upon EINTR for open(): eintr-4.diff
    adds this, along with tests (using a fifo).

    Also, I added comments explaining why we don't retry upon close() (see
    e.g. http://lwn.net/Articles/576478/ and
    http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-09/3000.html).

    I didn't read these articles yet, but I will. IMO the PEP must be
    updated if we don't retry in close(), since close() is used as an
    example where "EINTR is unexpected"!

    @pitrou
    Copy link
    Member

    pitrou commented Feb 2, 2015

    PEP is now updated.

    @vstinner
    Copy link
    Member

    vstinner commented Feb 5, 2015

    Would it be possible to push the latest patch right now, and fix remaining issues (if there are known issues?), before Python 3.4 alpha 1? According to the PEP-478, the alpha 1 is scheduled for this sunday (February 8, 2015).

    @pitrou
    Copy link
    Member

    pitrou commented Feb 5, 2015

    Would it be possible to push the latest patch right now

    It's ok for me. Please watch the buildbots :)

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Feb 5, 2015

    Antoine Pitrou added the comment:

    > Would it be possible to push the latest patch right now

    It's ok for me. Please watch the buildbots :)

    Cool, I'll push on Friday evening or Saturday.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 7, 2015

    New changeset 5b63010be19e by Charles-François Natali in branch 'default':
    Issue bpo-23285: PEP-475 -- Retry system calls failing with EINTR.
    https://hg.python.org/cpython/rev/5b63010be19e

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 8, 2015

    New changeset 000bbdf0ea76 by Ned Deily in branch 'default':
    Issue bpo-23285: Install new test directory.
    https://hg.python.org/cpython/rev/000bbdf0ea76

    @vstinner
    Copy link
    Member

    vstinner commented Mar 3, 2015

    The change on Modules/_io/fileio.c is wrong: functions may return None with an exception set. It is wrong because a function must return a result with no exception set, or NULL and an exception set.

    Attached patch fixes this issue.

    @vstinner
    Copy link
    Member

    vstinner commented Mar 3, 2015

    Note: I found the bug while working on a patch for bpo-22181.

    My test is this shell script:

    $ while true; do ./python -c 'import os, signal; signal.setitimer(signal.ITIMER_REAL, 0.001, 0.0001); signal.signal(signal.SIGALRM, lambda *args: print(".", end="")); print("urandom"); x=os.urandom(5000); print("ok"); signal.alarm(0)'; if [ $? -ne 142 -a $? -ne 0 ]; then break; fi done  

    The test calls print() in a signal handler which can likely create a reentrant call, which is forbidden. But Python handles this case, it's fine.

    After removing all debug prints and reverting the fix on fileio.c, the test doesn't crash with the assertion error anymore. Before, an assertion failed because fileio_write() returned Py_None with an exception set.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Mar 3, 2015

    @victor: please commit.
    Would be nice to have a test for it;

    @vstinner
    Copy link
    Member

    vstinner commented Mar 4, 2015

    Would be nice to have a test for it;

    There are already some tests on EINTR in test_file_eintr and test_subprocess (test_communicate_eintr()).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 4, 2015

    New changeset cad6eac598ec by Victor Stinner in branch 'default':
    Issue bpo-23285: Fix handling of EINTR in fileio.c
    https://hg.python.org/cpython/rev/cad6eac598ec

    @vstinner
    Copy link
    Member

    vstinner commented Mar 9, 2015

    test_setitimer.patch: hack unittest.TestCase.run() to inject signals every 1 ms, send the first signal in 100 ms.

    This patch helped me to find that socket.connect() doesn't handle EINTR yet: see issue bpo-23618 "PEP-475: handle EINTR in the socket module".

    @pitrou
    Copy link
    Member

    pitrou commented Mar 9, 2015

    For the record, it seems test_eintr sometimes left zombie processes in my machine where I run reference leak tests every night. I didn't investigate and just disabled the tests.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 9, 2015

    (I'll add that that machine is hosted on an OpenVZ-based VPS, so perhaps there are issues with the old patched kernel and whatnot?)

    @vstinner
    Copy link
    Member

    For the record, it seems test_eintr sometimes left zombie processes in my machine where I run reference leak tests every night. I didn't investigate and just disabled the tests.

    (I'll add that that machine is hosted on an OpenVZ-based VPS, so perhaps there are issues with the old patched kernel and whatnot?)

    On my Fedora 21 (on a physical PC, not virtualized), I ran "./python -m test -R 3:3: test_eintr" 3 times. After that, I didn't see any zombi Python process. If I cannot reproduce the issue, I cannot fix it. I bet that it's related to OpenVZ.

    IMO this issue can be closed. It already has a long history. I prefer to open new issues. See the issue bpo-23648 "PEP-475 meta issue" which lists all issues related to the PEP-475. I opened new issues for each module which didn't handle completly the PEP-475 yet.

    By the way: great job Charles-François! I like your changeset, it works well.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants