classification
Title: test_multiprocessing_forkserver: TestIgnoreEINTR.test_ignore() fails on Travis CI
Type: Stage: resolved
Components: Tests Versions: Python 3.8, Python 3.7, Python 3.6, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: davin, miss-islington, ned.deily, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2018-05-16 06:30 by vstinner, last changed 2018-06-20 10:12 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 7247 closed vstinner, 2018-05-30 15:14
PR 7260 closed vstinner, 2018-05-31 00:17
PR 7261 closed vstinner, 2018-05-31 02:06
PR 7262 merged vstinner, 2018-05-31 02:53
PR 7264 merged miss-islington, 2018-05-31 03:09
PR 7265 merged vstinner, 2018-05-31 03:15
PR 7266 merged vstinner, 2018-05-31 03:19
PR 7318 closed vstinner, 2018-06-01 12:58
PR 7319 merged vstinner, 2018-06-01 14:05
PR 7320 merged miss-islington, 2018-06-01 14:48
PR 7322 merged vstinner, 2018-06-01 15:36
PR 7323 merged vstinner, 2018-06-01 15:38
Messages (30)
msg316771 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-16 06:30
https://travis-ci.org/python/cpython/jobs/379560387

======================================================================
ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
    os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process
msg317985 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-05-29 06:19
FWIW, Issue29589 seems to be an older example of this failure.
msg318008 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-29 11:08
> FWIW, Issue29589 seems to be an older example of this failure.

It was a failure on Mageia 5 (Linux) at 2017-02-17:

== CPython 3.6.0 (default, Feb 17 2017, 15:26:31) [GCC 4.9.2]
==   Linux-4.4.13-desktop-1.mga5-x86_64-with-mageia-5-Official little-endian
==   hash algorithm: siphash24 64bit
==  cwd: /home/dima/bin/Python-3.6.0/build/test_python_8244
==  encodings: locale=UTF-8, FS=utf-8
Run tests sequentially
0:00:00 [1/1] test_multiprocessing_forkserver
...
======================================================================
ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dima/bin/Python-3.6.0/Lib/test/_test_multiprocessing.py", line 3728, in test_ignore
    os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process
msg318010 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-29 11:28
I am unable to reproduce the issue on Fedora 28 (Linux kernel 4.16.11, glibc 2.27):

I ran 6 jobs in parallel during 10 minutes:

* 2: ./python -m test test_multiprocessing_forkserver -m test_ignore -F
* 2: ./python -m test test_multiprocessing_forkserver -F
* 2: ./python -m test -j16 -r -F

The system load was around 30 which is very high (my CPU has 8 logicial threads, 4 physical cores).
msg318215 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-30 22:13
> https://travis-ci.org/python/cpython/jobs/379560387

More info about this failure:

== Linux-4.4.0-112-generic-x86_64-with-debian-jessie-sid little-endian
== CPU count: 48
Run tests in parallel using 4 child processes
(...)
0:07:06 load avg: 24.76 [356/416/3] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (61 sec)
msg318222 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-30 23:22
Davin, Antoine: any idea on this bug?

I ran the full test suite and test_multiprocessing_forkserver in parallel in a Trust VM, but I failed to reproduce the bug.

I did a similar test in a Ubuntu Trusty docker container: again, I'm unable to reproduce the bug.

An Ubuntu Trusty docker container is supposed to be as close as possible to Travis CI, except that my laptop doesn't have 48 CPUs.
msg318225 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-30 23:37
test_ignore() starts to fail more and more often on Travis CI for an unknown reason.

https://travis-ci.org/python/cpython/jobs/385562187

== CPU count: 48
(...)
Run tests in parallel using 4 child processes
(...)
0:07:38 load avg: 128.04 [415/416/1] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (378 sec)
(...)
======================================================================
ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
    os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process
msg318226 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-30 23:38
https://travis-ci.org/python/cpython/jobs/379560387

Re-running test 'test_multiprocessing_forkserver' in verbose mode
(...)
======================================================================
ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
    os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process
msg318227 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-30 23:40
Python 3.7:

https://travis-ci.org/python/cpython/jobs/385474104

Re-running test 'test_multiprocessing_forkserver' in verbose mode

======================================================================
ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4324, in test_ignore
    os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process
msg318228 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-30 23:40
Python 3.7:

https://travis-ci.org/python/cpython/jobs/385458840

Re-running test 'test_multiprocessing_forkserver' in verbose mode

======================================================================
ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4324, in test_ignore
    os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process
msg318232 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 00:44
I added debug traces in PR 7260. test_ignore() failed, but it may be related to my debug traces, since the failure is different.


test_multiprocessing_fork.test_ignore() failure:

test_ignore (test.test_multiprocessing_fork.TestIgnoreEINTR) ...
parent: pid=20159
parent: start child
parent: p=<Process(Process-3, started daemon)>
parent: p.ident=20176
ERROR

======================================================================
ERROR: test_ignore (test.test_multiprocessing_fork.TestIgnoreEINTR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4389, in test_ignore
    self.assertEqual(conn.recv(), 'ready')
  File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError


test_multiprocessing_fork.test_ignore() success:

test_ignore (test.test_multiprocessing_fork.TestIgnoreEINTR) ...
parent: pid=11713
parent: start child
parent: p=<Process(Process-3, started daemon)>
parent: p.ident=28998
[INFO/Process-3] child process calling self.run()
child: pid=28998
child: register SIGUSR1 handler, send ready
child: wait 1234
parent: ready received
parent: first SIGUSR1
child: got SIGUSR1
parent: send 1234
parent: wait 1234
child: send back 1234
child: big blocking send
parent: second SIGUSR1
parent: wait bytes
child: got SIGUSR1
child: done
[INFO/Process-3] process shutting down
[INFO/Process-3] process shutting down
[INFO/Process-3] process exiting with exitcode 0
[INFO/Process-3] process exiting with exitcode 0
parent: join child
parent: done
ok
msg318233 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 00:45
> I added debug traces in PR 7260. test_ignore() failed, but it may be related to my debug traces, since the failure is different.

https://travis-ci.org/python/cpython/jobs/385958924
msg318234 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 02:04
Notes:

* test_ignore() creates a daemonic process (it cannot spawn new processes)
* the fork server process is shared between test methods of test_forkserver
* multiprocessing.util._exit_function() terminates all daemon processes
msg318236 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 02:36
test_multiprocessing_forkserver.test_ignore() failed on Travis CI using PR 7261: --randseed=7474929. The method failed once when run in parallel, and then failed again when re-reun in verbose mode.
msg318237 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 02:40
> test_multiprocessing_forkserver.test_ignore() failed on Travis CI using PR 7261: --randseed=7474929. The method failed once when run in parallel, and then failed again when re-reun in verbose mode.

https://travis-ci.org/python/cpython/jobs/385986803

0:04:57 load avg: 25.81 [342/415/1] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (72 sec)
msg318238 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 02:48
I added more traces to PR 7261 and the bug still occurred on Travis CI.

https://travis-ci.org/python/cpython/jobs/385990848

0:05:04 load avg: 42.62 [342/415/1] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (69 sec)
child: pid=19401 ppid=19348
child: send BIG
child: done
test_answer_challenge_auth_failure (test.test_multiprocessing_forkserver.OtherTest) ... ok
test_deliver_challenge_auth_failure (test.test_multiprocessing_forkserver.OtherTest) ... ok
test_closefd (test.test_multiprocessing_forkserver.TestCloseFds) ... ok
test_flags (test.test_multiprocessing_forkserver.TestFlags) ... ok
test_lock (test.test_multiprocessing_forkserver.TestForkAwareThreadLock) ... ok
test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR) ... parent: pid=19342
parent: second SIGUSR1
ERROR

Hum... maybe the child exited, before the parent sent SIGUSR1: the child didn't block on sending 1 MB?
msg318239 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 03:08
New changeset 5d6c7ed5e340b2311a15f34e968d4bef09c71922 by Victor Stinner in branch 'master':
bpo-33532: Fix multiprocessing test_ignore() (#7262)
https://github.com/python/cpython/commit/5d6c7ed5e340b2311a15f34e968d4bef09c71922
msg318240 - (view) Author: miss-islington (miss-islington) Date: 2018-05-31 03:28
New changeset 9d273344238330d634acb4a7d427a8fb732ae62e by Miss Islington (bot) in branch '3.7':
bpo-33532: Fix multiprocessing test_ignore() (GH-7262)
https://github.com/python/cpython/commit/9d273344238330d634acb4a7d427a8fb732ae62e
msg318254 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 05:35
New changeset c2870b699eb899f358b843c2d706ce8ca89b2bf4 by Victor Stinner in branch '3.6':
bpo-33532: Fix multiprocessing test_ignore() (GH-7262) (#7266)
https://github.com/python/cpython/commit/c2870b699eb899f358b843c2d706ce8ca89b2bf4
msg318255 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 05:35
New changeset 82c456fa37d4b87da0d64e574bd8adc665abc4c8 by Victor Stinner in branch '2.7':
bpo-33532: Fix multiprocessing test_ignore() (GH-7265)
https://github.com/python/cpython/commit/82c456fa37d4b87da0d64e574bd8adc665abc4c8
msg318265 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-05-31 08:07
Did your PR fix the issue?
msg318301 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-31 13:08
> Did your PR fix the issue?

The bug was that *sometimes* on Travis CI, and only on Travis CI (!?), writing 1 MiB into the multiprocessing pipe didn't block. The bug is really strange because it is only reproduced on the clang Linux job of Travis CI which runs tests in parallel. Not on the Linux gcc which runs tests sequentially in coverage. Moreover, the failure only occurs for a specific order of tests.

You can easily reproduce the issue if you reduce the size of the data written into the pipe at the end of _test_ignore(). If the write (send_bytes) doesn't block, you get the same error.

I'm confident that writing 4 MiB instead of 1 MiB will fix the issue. I saw the test passing with 4 MiB whereas it failed with 1 MiB, when I fixed the test order.
msg318409 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-01 13:05
I saw one new test_ignore() failure on Travis CI in my 3.6 PR, whereas 3.6 already uses PIPE_MAX_SIZE for test_ignore().

https://github.com/python/cpython/pull/7315/
https://travis-ci.org/python/cpython/jobs/386637729

Hum, test_ignore() uses PIPE_MAX_SIZE, whereas the test fails on Linux where we use a pair of sockets, no pipes. Maybe we should use SOCK_MAX_SIZE?

Notes on pipe size:

* /proc/sys/fs/pipe-max-size
* bpo-17835: add support.PIPE_MAX_SIZE (4 MiB +1 B)
* bpo-18643: add support.SOCK_MAX_SIZE (16 MiB)
msg318416 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-01 13:34
> https://travis-ci.org/python/cpython/jobs/386637729

I added some debug traces:

test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR) ...
socket[fd=4].SO_RCVBUF 212992
socket[fd=4].SO_SNDBUF 212992
socket[fd=7].SO_RCVBUF 212992
socket[fd=7].SO_SNDBUF 212992
test_ignore: PIPE_MAX_SIZE = 4194305

The socket pair uses a buffer of 208 KiB (212,992 B) in both directions and the test sends 4 MiB (4,194,305 B) for a blocking call... and sometimes the send doesn't block.
msg318420 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-01 14:48
New changeset 252f6abe0a9430f4ae7588c0cb50a6ff141bebe3 by Victor Stinner in branch 'master':
bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7319)
https://github.com/python/cpython/commit/252f6abe0a9430f4ae7588c0cb50a6ff141bebe3
msg318421 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-01 14:50
I modified test_ignore() to use support.SOCK_MAX_SIZE, but honestly, I'm not convinced that it will be the issue. I applied my change anyway, just to check if the issue comes from the size, or if it's something else.
msg318426 - (view) Author: miss-islington (miss-islington) Date: 2018-06-01 15:21
New changeset 63fa8db58c349f985d75139e6545cfddf067c07b by Miss Islington (bot) in branch '3.7':
bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7319)
https://github.com/python/cpython/commit/63fa8db58c349f985d75139e6545cfddf067c07b
msg318438 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-01 17:39
New changeset 1d4be0a65f66c2eecefb8b8bd4d115a82fa61067 by Victor Stinner in branch '2.7':
bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7323)
https://github.com/python/cpython/commit/1d4be0a65f66c2eecefb8b8bd4d115a82fa61067
msg318441 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-01 17:39
New changeset 64e538bc703e423a04ab435c4eab6b950b8aef7e by Victor Stinner in branch '3.6':
bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7322)
https://github.com/python/cpython/commit/64e538bc703e423a04ab435c4eab6b950b8aef7e
msg320051 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-20 10:12
It didn't see this super annoying failure recently, so it seems like it has been fixed for real. Great!
History
Date User Action Args
2018-06-20 10:12:32vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg320051

stage: patch review -> resolved
2018-06-01 17:39:58vstinnersetmessages: + msg318441
2018-06-01 17:39:19vstinnersetmessages: + msg318438
2018-06-01 15:38:24vstinnersetpull_requests: + pull_request6952
2018-06-01 15:36:49vstinnersetpull_requests: + pull_request6951
2018-06-01 15:21:25miss-islingtonsetmessages: + msg318426
2018-06-01 14:50:49vstinnersetmessages: + msg318421
2018-06-01 14:48:50miss-islingtonsetpull_requests: + pull_request6949
2018-06-01 14:48:36vstinnersetmessages: + msg318420
2018-06-01 14:05:05vstinnersetstage: resolved -> patch review
pull_requests: + pull_request6948
2018-06-01 13:34:45vstinnersetmessages: + msg318416
2018-06-01 13:05:41vstinnersetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg318409
2018-06-01 12:58:43vstinnersetpull_requests: + pull_request6947
2018-05-31 13:08:32vstinnersetstatus: open -> closed
stage: patch review -> resolved
resolution: fixed
versions: + Python 2.7, Python 3.6
2018-05-31 13:08:19vstinnersetmessages: + msg318301
2018-05-31 08:07:28pitrousetmessages: + msg318265
2018-05-31 05:35:36vstinnersetmessages: + msg318255
2018-05-31 05:35:05vstinnersetmessages: + msg318254
2018-05-31 03:28:08miss-islingtonsetnosy: + miss-islington
messages: + msg318240
2018-05-31 03:19:00vstinnersetpull_requests: + pull_request6894
2018-05-31 03:15:43vstinnersetpull_requests: + pull_request6893
2018-05-31 03:09:59miss-islingtonsetpull_requests: + pull_request6892
2018-05-31 03:08:45vstinnersetmessages: + msg318239
2018-05-31 02:53:38vstinnersetpull_requests: + pull_request6889
2018-05-31 02:48:30vstinnersetmessages: + msg318238
2018-05-31 02:40:59vstinnersetmessages: + msg318237
2018-05-31 02:36:03vstinnersetmessages: + msg318236
2018-05-31 02:06:19vstinnersetpull_requests: + pull_request6888
2018-05-31 02:04:03vstinnersetmessages: + msg318234
2018-05-31 00:45:43vstinnersetmessages: + msg318233
2018-05-31 00:44:32vstinnersetmessages: + msg318232
2018-05-31 00:17:22vstinnersetpull_requests: + pull_request6887
2018-05-30 23:40:54vstinnersetmessages: + msg318228
2018-05-30 23:40:17vstinnersetmessages: + msg318227
2018-05-30 23:38:17vstinnersetmessages: + msg318226
2018-05-30 23:37:14vstinnersetmessages: + msg318225
2018-05-30 23:22:33vstinnersetnosy: + pitrou, davin
messages: + msg318222
2018-05-30 22:13:58vstinnersetmessages: + msg318215
2018-05-30 15:14:52vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request6873
2018-05-29 11:28:19vstinnersetmessages: + msg318010
2018-05-29 11:08:53vstinnersetmessages: + msg318008
2018-05-29 06:19:01ned.deilysetnosy: + ned.deily

messages: + msg317985
versions: + Python 3.7
2018-05-16 06:30:24vstinnercreate