New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_multiprocessing_forkserver: TestIgnoreEINTR.test_ignore() fails on Travis CI #77713
Comments
https://travis-ci.org/python/cpython/jobs/379560387 ====================================================================== Traceback (most recent call last):
File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process |
FWIW, bpo-29589 seems to be an older example of this failure. |
It was a failure on Mageia 5 (Linux) at 2017-02-17: == CPython 3.6.0 (default, Feb 17 2017, 15:26:31) [GCC 4.9.2] Traceback (most recent call last):
File "/home/dima/bin/Python-3.6.0/Lib/test/_test_multiprocessing.py", line 3728, in test_ignore
os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process |
I am unable to reproduce the issue on Fedora 28 (Linux kernel 4.16.11, glibc 2.27): I ran 6 jobs in parallel during 10 minutes:
The system load was around 30 which is very high (my CPU has 8 logicial threads, 4 physical cores). |
More info about this failure: == Linux-4.4.0-112-generic-x86_64-with-debian-jessie-sid little-endian |
Davin, Antoine: any idea on this bug? I ran the full test suite and test_multiprocessing_forkserver in parallel in a Trust VM, but I failed to reproduce the bug. I did a similar test in a Ubuntu Trusty docker container: again, I'm unable to reproduce the bug. An Ubuntu Trusty docker container is supposed to be as close as possible to Travis CI, except that my laptop doesn't have 48 CPUs. |
test_ignore() starts to fail more and more often on Travis CI for an unknown reason. https://travis-ci.org/python/cpython/jobs/385562187 == CPU count: 48 Traceback (most recent call last):
File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process |
https://travis-ci.org/python/cpython/jobs/379560387 Re-running test 'test_multiprocessing_forkserver' in verbose mode Traceback (most recent call last):
File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process |
Python 3.7: https://travis-ci.org/python/cpython/jobs/385474104 Re-running test 'test_multiprocessing_forkserver' in verbose mode ====================================================================== Traceback (most recent call last):
File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4324, in test_ignore
os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process |
Python 3.7: https://travis-ci.org/python/cpython/jobs/385458840 Re-running test 'test_multiprocessing_forkserver' in verbose mode ====================================================================== Traceback (most recent call last):
File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4324, in test_ignore
os.kill(p.pid, signal.SIGUSR1)
ProcessLookupError: [Errno 3] No such process |
I added debug traces in PR 7260. test_ignore() failed, but it may be related to my debug traces, since the failure is different. test_multiprocessing_fork.test_ignore() failure: test_ignore (test.test_multiprocessing_fork.TestIgnoreEINTR) ... ====================================================================== Traceback (most recent call last):
File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4389, in test_ignore
self.assertEqual(conn.recv(), 'ready')
File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError test_multiprocessing_fork.test_ignore() success: test_ignore (test.test_multiprocessing_fork.TestIgnoreEINTR) ... |
|
Notes:
|
test_multiprocessing_forkserver.test_ignore() failed on Travis CI using PR 7261: --randseed=7474929. The method failed once when run in parallel, and then failed again when re-reun in verbose mode. |
https://travis-ci.org/python/cpython/jobs/385986803 0:04:57 load avg: 25.81 [342/415/1] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (72 sec) |
I added more traces to PR 7261 and the bug still occurred on Travis CI. https://travis-ci.org/python/cpython/jobs/385990848 0:05:04 load avg: 42.62 [342/415/1] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (69 sec) Hum... maybe the child exited, before the parent sent SIGUSR1: the child didn't block on sending 1 MB? |
Did your PR fix the issue? |
The bug was that *sometimes* on Travis CI, and only on Travis CI (!?), writing 1 MiB into the multiprocessing pipe didn't block. The bug is really strange because it is only reproduced on the clang Linux job of Travis CI which runs tests in parallel. Not on the Linux gcc which runs tests sequentially in coverage. Moreover, the failure only occurs for a specific order of tests. You can easily reproduce the issue if you reduce the size of the data written into the pipe at the end of _test_ignore(). If the write (send_bytes) doesn't block, you get the same error. I'm confident that writing 4 MiB instead of 1 MiB will fix the issue. I saw the test passing with 4 MiB whereas it failed with 1 MiB, when I fixed the test order. |
I saw one new test_ignore() failure on Travis CI in my 3.6 PR, whereas 3.6 already uses PIPE_MAX_SIZE for test_ignore(). #7315 Hum, test_ignore() uses PIPE_MAX_SIZE, whereas the test fails on Linux where we use a pair of sockets, no pipes. Maybe we should use SOCK_MAX_SIZE? Notes on pipe size: |
I added some debug traces: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR) ... The socket pair uses a buffer of 208 KiB (212,992 B) in both directions and the test sends 4 MiB (4,194,305 B) for a blocking call... and sometimes the send doesn't block. |
I modified test_ignore() to use support.SOCK_MAX_SIZE, but honestly, I'm not convinced that it will be the issue. I applied my change anyway, just to check if the issue comes from the size, or if it's something else. |
It didn't see this super annoying failure recently, so it seems like it has been fixed for real. Great! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: