PEP 475 - EINTR handling #67474

neologix · 2015-01-20T23:03:28Z

BPO	23285
Nosy	@pitrou, @vstinner, @ned-deily, @vadmium
Files	ff1274594739.diff eintr.diff eintr-1.diff eintr-2.diff eintr-3.diff eintr-4.diff fileio.patch setitimer.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-03-20.11:44:35.815>
created_at = <Date 2015-01-20.23:03:28.161>
labels = ['type-feature', 'library']
title = 'PEP 475 - EINTR handling'
updated_at = <Date 2016-07-10.09:23:45.050>
user = 'https://bugs.python.org/neologix'

bugs.python.org fields:

activity = <Date 2016-07-10.09:23:45.050>
actor = 'berker.peksag'
assignee = 'none'
closed = True
closed_date = <Date 2015-03-20.11:44:35.815>
closer = 'vstinner'
components = ['Library (Lib)']
creation = <Date 2015-01-20.23:03:28.161>
creator = 'neologix'
dependencies = []
files = ['37797', '37802', '37905', '37909', '37910', '37935', '38309', '38403']
hgrepos = ['293']
issue_num = 23285
keywords = ['patch']
message_count = 33.0
messages = ['234400', '234402', '234403', '234412', '234425', '234874', '234879', '234931', '234950', '234952', '234953', '234955', '234989', '234993', '234997', '235010', '235091', '235092', '235285', '235424', '235435', '235451', '235525', '235546', '237100', '237101', '237159', '237193', '237194', '237611', '237612', '237613', '238650']
nosy_count = 6.0
nosy_names = ['pitrou', 'vstinner', 'ned.deily', 'neologix', 'python-dev', 'martin.panter']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue23285'
versions = ['Python 3.5']

neologix · 2015-01-20T23:04:34Z

The test runs fine on Linux, but hangs in test_send() on OS-X and *BSD.
I don't know what's wrong, so if someone with access to one of these OS could have a look, it would be great.

pitrou · 2015-01-20T23:14:05Z

Perhaps Ned can help on the OS X side of things.

pitrou · 2015-01-20T23:15:08Z

The review diff is weird: it seems it contains changes that aren't EINTR-related (see e.g. argparse.rst).

ned-deily · 2015-01-20T23:55:34Z

(It may be several days before I can spend much time on it but I will take a look.)

neologix · 2015-01-21T05:56:39Z

The review diff is weird: it seems it contains changes that aren't EINTR-related (see e.g. argparse.rst).

Here's a manually generated diff.

ned-deily · 2015-01-28T07:53:48Z

I added a few prints to the send and receive loops of _test_send. When running on a reasonably current Debian testing Linux:

./python Lib/test/eintrdata/eintr_tester.py
test_read (main.OSEINTRTest) ... ok
test_wait (main.OSEINTRTest) ... ok
test_wait3 (main.OSEINTRTest) ... ok
test_wait4 (main.OSEINTRTest) ... ok
test_waitpid (main.OSEINTRTest) ... ok
test_write (main.OSEINTRTest) ... ok
test_accept (main.SocketEINTRTest) ... ok
test_recv (main.SocketEINTRTest) ... ok
test_recvmsg (main.SocketEINTRTest) ... ok
test_send (main.SocketEINTRTest) ... len(data) = 50331651
sent = 183360, written = 183360
sent = 50148291, written = 50331651
chunk = 50331651, total read = 50331651
ok
test_sendall (main.SocketEINTRTest) ... len(data) = 50331651
chunk = 49323840, total read = 49323840
sent = None, written = 50331651
chunk = 1007811, total read = 50331651
ok
test_sendmsg (main.SocketEINTRTest) ... len(data) = 50331651
sent = 183360, written = 183360
sent = 50148291, written = 50331651
chunk = 50331651, total read = 50331651
ok

----------------------------------------------------------------------
Ran 12 tests in 10.139s

OK

When run on OS X (10.10.1):

test_read (main.OSEINTRTest) ... ok
test_wait (main.OSEINTRTest) ... ok
test_wait3 (main.OSEINTRTest) ... ok
test_wait4 (main.OSEINTRTest) ... ok
test_waitpid (main.OSEINTRTest) ... ok
test_write (main.OSEINTRTest) ... ok
test_accept (main.SocketEINTRTest) ... ok
test_recv (main.SocketEINTRTest) ... ok
test_recvmsg (main.SocketEINTRTest) ... ok
test_send (main.SocketEINTRTest) ... len(data) = 50331651
sent = 8192, written = 8192
chunk = 8192, total read = 8192
chunk = 8192, total read = 16384
sent = 16384, written = 24576
chunk = 8192, total read = 24576
chunk = 8192, total read = 32768
chunk = 8192, total read = 40960
chunk = 8192, total read = 49152
sent = 32768, written = 57344
chunk = 8192, total read = 57344
chunk = 8192, total read = 65536
chunk = 8192, total read = 73728
chunk = 8192, total read = 81920
chunk = 8192, total read = 90112
sent = 40960, written = 98304
chunk = 8192, total read = 98304
chunk = 8192, total read = 106496
chunk = 8192, total read = 114688
chunk = 8192, total read = 122880
sent = 32768, written = 131072
chunk = 8192, total read = 131072
chunk = 8192, total read = 139264
chunk = 8192, total read = 147456
chunk = 8192, total read = 155648
chunk = 8192, total read = 163840
sent = 40960, written = 172032
chunk = 8192, total read = 172032
chunk = 8192, total read = 180224
chunk = 8192, total read = 188416
chunk = 8192, total read = 196608
sent = 32768, written = 204800
chunk = 8192, total read = 204800
chunk = 8192, total read = 212992
chunk = 8192, total read = 221184
chunk = 8192, total read = 229376
chunk = 8192, total read = 237568
sent = 40960, written = 245760
chunk = 8192, total read = 245760
chunk = 8192, total read = 253952
chunk = 8192, total read = 262144
chunk = 8192, total read = 270336
sent = 32768, written = 278528
chunk = 8192, total read = 278528
chunk = 8192, total read = 286720
chunk = 8192, total read = 294912
chunk = 8192, total read = 303104
chunk = 8192, total read = 311296
sent = 40960, written = 319488
[... and so on]

When run standalone, the tests do eventually finish without error but take a *long* time to do so.

neologix · 2015-01-28T09:26:12Z

I added a few prints to the send and receive loops of _test_send. When running on a reasonably current Debian testing Linux:

Thanks, that's what I was suspecting, but I really don't understand
why 200ms isn't enough for a socket write to actually do something:
maybe OS-X and *BSD schedulers are large timeslice...

Could you try by increasing signal_period to e.g. 0.5, and sleep_time to 1?

ned-deily · 2015-01-29T00:58:32Z

It turns out the times are not important; the hangup is the default size of the socket buffers on OS X and possibly BSD in general. In my case, the send and receive buffers are 8192, which explains why the chunks written are so small. I somewhat arbitrarily changed the sizes of the buffers in _test_send with:

rd.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, support.SOCK_MAX_SIZE // 3)
wr.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, support.SOCK_MAX_SIZE // 3)

The results were:

test_send (main.SocketEINTRTest) ... rd SO_RCVBUF default was 8192, wr SO_SNDBUF default was 8192
len(data) = 50331651
sent = 5592405, written = 5592405
chunk = 5592405, total read = 5592405
chunk = 5592405, total read = 11184810
chunk = 5592405, total read = 16777215
sent = 16777215, written = 22369620
chunk = 5592405, total read = 22369620
chunk = 5592405, total read = 27962025
chunk = 5592405, total read = 33554430
chunk = 5592405, total read = 39146835
sent = 22369620, written = 44739240
chunk = 5592405, total read = 44739240
sent = 5592411, written = 50331651
chunk = 5592405, total read = 50331645
chunk = 6, total read = 50331651
ok
test_sendall (main.SocketEINTRTest) ... rd SO_RCVBUF default was 8192, wr SO_SNDBUF default was 8192
len(data) = 50331651
chunk = 5592405, total read = 5592405
chunk = 5592405, total read = 11184810
chunk = 5592405, total read = 16777215
chunk = 5592405, total read = 22369620
chunk = 5592405, total read = 27962025
chunk = 5592405, total read = 33554430
chunk = 5592405, total read = 39146835
chunk = 5592405, total read = 44739240
sent = None, written = 50331651
chunk = 5592405, total read = 50331645
chunk = 6, total read = 50331651
ok
test_sendmsg (main.SocketEINTRTest) ... rd SO_RCVBUF default was 8192, wr SO_SNDBUF default was 8192
len(data) = 50331651
sent = 5592405, written = 5592405
chunk = 5592405, total read = 5592405
chunk = 5592405, total read = 11184810
chunk = 5592405, total read = 16777215
chunk = 5592405, total read = 22369620
chunk = 5592405, total read = 27962025
sent = 27962025, written = 33554430
chunk = 5592405, total read = 33554430
chunk = 5592405, total read = 39146835
chunk = 5592405, total read = 44739240
sent = 16777221, written = 50331651
chunk = 5592405, total read = 50331645
chunk = 6, total read = 50331651
ok

I dunno if a value that large will work in all environments, so the code to call setsockopt might need to be smarter.

neologix · 2015-01-29T07:28:23Z

It turns out the times are not important; the hangup is the default size of the socket buffers on OS X and possibly BSD in general. In my case, the send and receive buffers are 8192, which explains why the chunks written are so small.

Hmmm...
Basically, with a much smaller socket buffer, we get much more context
switches, which increases drastically the test runtime.
But I must admit I'm still really surprised by the time it takes on
OS-X: with a SOCK_MAX_SIZE of 16MB and a socket buffer size of 8kb,
that's 2000 calls to send with context switches between the sender and
receiver - and thie overhead should be much less on a two core
machine.

I'll try to increase the socket buffer size and see what the buildbots
do: most of them will probably cap it at around 100k, but that'll
hopefully be enough.

neologix · 2015-01-29T07:42:34Z

Charles-François Natali added the comment:

Hmmm...
Basically, with a much smaller socket buffer, we get much more context
switches, which increases drastically the test runtime.
But I must admit I'm still really surprised by the time it takes on
OS-X: with a SOCK_MAX_SIZE of 16MB and a socket buffer size of 8kb,
that's 2000 calls to send with context switches between the sender and
receiver - and thie overhead should be much less on a two core
machine.

OK, actually the receiver is completely CPU-bound, because of memory
allocation for the socket.recv() buffer I'll play with recv_into() and
profile a bit more.

neologix · 2015-01-29T08:12:29Z

OK, it turns out the culprit was repeated calls to BytesIO.getvalue(),
which forced large allocations upon reception of every message.
The patch attached fixes this (without changing the socket buffer size).

ned-deily · 2015-01-29T08:41:45Z

eintr-1.diff doesn't seem to make any significant difference from eintr.diff on my system. It's still pegging a CPU at 100% and takes 7 minutes wall time to complete.

$ time ./python ~/Projects/PyDev/active/dev/3x/source/Lib/test/eintrdata/eintr_tester.py
test_read (__main__.OSEINTRTest) ... ok
test_wait (__main__.OSEINTRTest) ... ok
test_wait3 (__main__.OSEINTRTest) ... ok
test_wait4 (__main__.OSEINTRTest) ... ok
test_waitpid (__main__.OSEINTRTest) ... ok
test_write (__main__.OSEINTRTest) ... ok
test_accept (__main__.SocketEINTRTest) ... ok
test_recv (__main__.SocketEINTRTest) ... ok
test_recvmsg (__main__.SocketEINTRTest) ... ok
test_send (__main__.SocketEINTRTest) ... ok
test_sendall (__main__.SocketEINTRTest) ... ok
test_sendmsg (__main__.SocketEINTRTest) ... ok

Ran 12 tests in 439.966s

OK

real 7m20.276s
user 5m7.575s
sys 2m9.846s

neologix · 2015-01-29T19:56:13Z

eintr-1.diff doesn't seem to make any significant difference from eintr.diff on my system. It's still pegging a CPU at 100% and takes 7 minutes wall time to complete.

Alright, enough played: the patch attached uses a memoryview and
socket.recv_into() to remove all memory allocations: if this doesn't
fix your problem, I don't know what's going on...

ned-deily · 2015-01-29T22:01:17Z

With eintr-2.diff, fast!:

$ time ./python ~/Projects/PyDev/active/dev/3x/source/Lib/test/eintrdata/eintr_tester.py
test_read (__main__.OSEINTRTest) ... ok
test_wait (__main__.OSEINTRTest) ... ok
test_wait3 (__main__.OSEINTRTest) ... ok
test_wait4 (__main__.OSEINTRTest) ... ok
test_waitpid (__main__.OSEINTRTest) ... ok
test_write (__main__.OSEINTRTest) ... ok
test_accept (__main__.SocketEINTRTest) ... ok
test_recv (__main__.SocketEINTRTest) ... ok
test_recvmsg (__main__.SocketEINTRTest) ... ok
test_send (__main__.SocketEINTRTest) ... ok
test_sendall (__main__.SocketEINTRTest) ... ok
test_sendmsg (__main__.SocketEINTRTest) ... ok

Ran 12 tests in 7.652s

OK

real 0m7.959s
user 0m2.573s
sys 0m1.604s

Instrumented test_send, 3 socket.send calls, many socket.recv_into calls:

test_send (main.SocketEINTRTest) ... rd SO_RCVBUF default was 8192, wr SO_SNDBUF default was 8192
len(data) = 16777215

sent = 8192, total written = 8192
sent = 4440064, total written = 4448256
sent = 12328959, total written = 16777215

received = 8192, total read = 8192
received = 8192, total read = 16384
received = 8192, total read = 24576
[...]
received = 8192, total read = 16760832
received = 8192, total read = 16769024
received = 8191, total read = 16777215

neologix · 2015-01-29T22:49:06Z

With eintr-2.diff, fast!:

Victory \°/.

Instrumented test_send, 3 socket.send calls, many socket.recv_into calls:

Yep, that's expected.
I think we should keep the default socket buffer size: it increases
the test coverage, and it's probably not worth trying to increase it
to make the test a bit faster, especially since it spends so much time
sleeping.

Antoine, I'm now happy with the patch, so we'll be waiting for your
decision on the PEP with Victor :-).

pitrou · 2015-01-30T00:39:28Z

Victor, do you think there's anything left to do?

neologix · 2015-01-31T08:58:14Z

I just realized I didn't retry upon EINTR for open(): eintr-4.diff
adds this, along with tests (using a fifo).

Also, I added comments explaining why we don't retry upon close() (see
e.g. http://lwn.net/Articles/576478/ and
http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-09/3000.html).

vstinner · 2015-01-31T09:35:20Z

Charles-François Natali added the comment:

I just realized I didn't retry upon EINTR for open(): eintr-4.diff
adds this, along with tests (using a fifo).

Also, I added comments explaining why we don't retry upon close() (see
e.g. http://lwn.net/Articles/576478/ and
http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-09/3000.html).

I didn't read these articles yet, but I will. IMO the PEP must be
updated if we don't retry in close(), since close() is used as an
example where "EINTR is unexpected"!

pitrou · 2015-02-02T20:39:05Z

PEP is now updated.

vstinner · 2015-02-05T08:36:11Z

Would it be possible to push the latest patch right now, and fix remaining issues (if there are known issues?), before Python 3.4 alpha 1? According to the PEP-478, the alpha 1 is scheduled for this sunday (February 8, 2015).

pitrou · 2015-02-05T12:55:45Z

Would it be possible to push the latest patch right now

It's ok for me. Please watch the buildbots :)

neologix · 2015-02-05T20:51:37Z

Antoine Pitrou added the comment:

> Would it be possible to push the latest patch right now

It's ok for me. Please watch the buildbots :)

Cool, I'll push on Friday evening or Saturday.

python-dev · 2015-02-07T13:29:32Z

New changeset 5b63010be19e by Charles-François Natali in branch 'default':
Issue bpo-23285: PEP-475 -- Retry system calls failing with EINTR.
https://hg.python.org/cpython/rev/5b63010be19e

python-dev · 2015-02-08T04:49:29Z

New changeset 000bbdf0ea76 by Ned Deily in branch 'default':
Issue bpo-23285: Install new test directory.
https://hg.python.org/cpython/rev/000bbdf0ea76

vstinner · 2015-03-03T01:46:04Z

The change on Modules/_io/fileio.c is wrong: functions may return None with an exception set. It is wrong because a function must return a result with no exception set, or NULL and an exception set.

Attached patch fixes this issue.

vstinner · 2015-03-03T01:49:14Z

Note: I found the bug while working on a patch for bpo-22181.

My test is this shell script:

$ while true; do ./python -c 'import os, signal; signal.setitimer(signal.ITIMER_REAL, 0.001, 0.0001); signal.signal(signal.SIGALRM, lambda *args: print(".", end="")); print("urandom"); x=os.urandom(5000); print("ok"); signal.alarm(0)'; if [ $? -ne 142 -a $? -ne 0 ]; then break; fi done

The test calls print() in a signal handler which can likely create a reentrant call, which is forbidden. But Python handles this case, it's fine.

After removing all debug prints and reverting the fix on fileio.c, the test doesn't crash with the assertion error anymore. Before, an assertion failed because fileio_write() returned Py_None with an exception set.

neologix · 2015-03-03T20:26:11Z

@victor: please commit.
Would be nice to have a test for it;

vstinner · 2015-03-04T17:43:18Z

Would be nice to have a test for it;

There are already some tests on EINTR in test_file_eintr and test_subprocess (test_communicate_eintr()).

python-dev · 2015-03-04T17:47:04Z

New changeset cad6eac598ec by Victor Stinner in branch 'default':
Issue bpo-23285: Fix handling of EINTR in fileio.c
https://hg.python.org/cpython/rev/cad6eac598ec

vstinner · 2015-03-09T09:11:53Z

test_setitimer.patch: hack unittest.TestCase.run() to inject signals every 1 ms, send the first signal in 100 ms.

This patch helped me to find that socket.connect() doesn't handle EINTR yet: see issue bpo-23618 "PEP-475: handle EINTR in the socket module".

pitrou · 2015-03-09T09:15:17Z

For the record, it seems test_eintr sometimes left zombie processes in my machine where I run reference leak tests every night. I didn't investigate and just disabled the tests.

pitrou · 2015-03-09T09:15:55Z

(I'll add that that machine is hosted on an OpenVZ-based VPS, so perhaps there are issues with the old patched kernel and whatnot?)

vstinner · 2015-03-20T11:44:17Z

For the record, it seems test_eintr sometimes left zombie processes in my machine where I run reference leak tests every night. I didn't investigate and just disabled the tests.

(I'll add that that machine is hosted on an OpenVZ-based VPS, so perhaps there are issues with the old patched kernel and whatnot?)

On my Fedora 21 (on a physical PC, not virtualized), I ran "./python -m test -R 3:3: test_eintr" 3 times. After that, I didn't see any zombi Python process. If I cannot reproduce the issue, I cannot fix it. I bet that it's related to OpenVZ.

IMO this issue can be closed. It already has a long history. I prefer to open new issues. See the issue bpo-23648 "PEP-475 meta issue" which lists all issues related to the PEP-475. I opened new issues for each module which didn't handle completly the PEP-475 yet.

By the way: great job Charles-François! I like your changeset, it works well.

neologix mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jan 20, 2015

neologix mannequin changed the title ~~PEP 475 - EINTR hanndling~~ PEP 475 - EINTR handling Jan 20, 2015

vstinner closed this as completed Mar 20, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEP 475 - EINTR handling #67474

PEP 475 - EINTR handling #67474

neologix mannequin commented Jan 20, 2015

neologix mannequin commented Jan 20, 2015

pitrou commented Jan 20, 2015

pitrou commented Jan 20, 2015

ned-deily commented Jan 20, 2015

neologix mannequin commented Jan 21, 2015

ned-deily commented Jan 28, 2015

neologix mannequin commented Jan 28, 2015

ned-deily commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

ned-deily commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

ned-deily commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

pitrou commented Jan 30, 2015

neologix mannequin commented Jan 31, 2015

vstinner commented Jan 31, 2015

pitrou commented Feb 2, 2015

vstinner commented Feb 5, 2015

pitrou commented Feb 5, 2015

neologix mannequin commented Feb 5, 2015

python-dev mannequin commented Feb 7, 2015

python-dev mannequin commented Feb 8, 2015

vstinner commented Mar 3, 2015

vstinner commented Mar 3, 2015

neologix mannequin commented Mar 3, 2015

vstinner commented Mar 4, 2015

python-dev mannequin commented Mar 4, 2015

vstinner commented Mar 9, 2015

pitrou commented Mar 9, 2015

pitrou commented Mar 9, 2015

vstinner commented Mar 20, 2015

PEP 475 - EINTR handling #67474

PEP 475 - EINTR handling #67474

Comments

neologix mannequin commented Jan 20, 2015

neologix mannequin commented Jan 20, 2015

pitrou commented Jan 20, 2015

pitrou commented Jan 20, 2015

ned-deily commented Jan 20, 2015

neologix mannequin commented Jan 21, 2015

ned-deily commented Jan 28, 2015

neologix mannequin commented Jan 28, 2015

ned-deily commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

ned-deily commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

ned-deily commented Jan 29, 2015

neologix mannequin commented Jan 29, 2015

pitrou commented Jan 30, 2015

neologix mannequin commented Jan 31, 2015

vstinner commented Jan 31, 2015

pitrou commented Feb 2, 2015

vstinner commented Feb 5, 2015

pitrou commented Feb 5, 2015

neologix mannequin commented Feb 5, 2015

python-dev mannequin commented Feb 7, 2015

python-dev mannequin commented Feb 8, 2015

vstinner commented Mar 3, 2015

vstinner commented Mar 3, 2015

neologix mannequin commented Mar 3, 2015

vstinner commented Mar 4, 2015

python-dev mannequin commented Mar 4, 2015

vstinner commented Mar 9, 2015

pitrou commented Mar 9, 2015

pitrou commented Mar 9, 2015

vstinner commented Mar 20, 2015