Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_multiprocessing_forkserver: TestIgnoreEINTR.test_ignore() fails on Travis CI #77713

Closed
vstinner opened this issue May 16, 2018 · 30 comments
Closed
Labels
3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

BPO 33532
Nosy @pitrou, @vstinner, @ned-deily, @applio, @miss-islington
PRs
  • [WIP DON'T MERGE] bpo-33532: Debug test_multiprocessing_forkserver.test_ignore() on Travis CI #7247
  • [WIP DON'T MERGE] bpo-33532: Debug test_multiprocessing_forkserver.test_ignore() #7260
  • [WIP DON'T MERGE] bpo-33532: Debug3 test_multiprocessing_forkserver.test_ignore() #7261
  • bpo-33532: Fix multiprocessing test_ignore() #7262
  • [3.7] bpo-33532: Fix multiprocessing test_ignore() (GH-7262) #7264
  • [2.7] bpo-33532: Fix multiprocessing test_ignore() #7265
  • [3.6] bpo-33532: Fix multiprocessing test_ignore() (GH-7262) #7266
  • bpo-33532: Debug test_ignore() on Travis CI #7318
  • bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() #7319
  • [3.7] bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7319) #7320
  • [3.6] bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() #7322
  • [2.7] bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() #7323
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-06-20.10:12:32.885>
    created_at = <Date 2018-05-16.06:30:24.631>
    labels = ['3.8', '3.7', 'tests']
    title = 'test_multiprocessing_forkserver: TestIgnoreEINTR.test_ignore() fails on Travis CI'
    updated_at = <Date 2018-06-20.10:12:32.883>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2018-06-20.10:12:32.883>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-06-20.10:12:32.885>
    closer = 'vstinner'
    components = ['Tests']
    creation = <Date 2018-05-16.06:30:24.631>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33532
    keywords = ['patch']
    message_count = 30.0
    messages = ['316771', '317985', '318008', '318010', '318215', '318222', '318225', '318226', '318227', '318228', '318232', '318233', '318234', '318236', '318237', '318238', '318239', '318240', '318254', '318255', '318265', '318301', '318409', '318416', '318420', '318421', '318426', '318438', '318441', '320051']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'vstinner', 'ned.deily', 'davin', 'miss-islington']
    pr_nums = ['7247', '7260', '7261', '7262', '7264', '7265', '7266', '7318', '7319', '7320', '7322', '7323']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue33532'
    versions = ['Python 2.7', 'Python 3.6', 'Python 3.7', 'Python 3.8']

    @vstinner
    Copy link
    Member Author

    https://travis-ci.org/python/cpython/jobs/379560387

    ======================================================================
    ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
        os.kill(p.pid, signal.SIGUSR1)
    ProcessLookupError: [Errno 3] No such process

    @vstinner vstinner added 3.8 only security fixes tests Tests in the Lib/test dir labels May 16, 2018
    @ned-deily
    Copy link
    Member

    FWIW, bpo-29589 seems to be an older example of this failure.

    @ned-deily ned-deily added the 3.7 (EOL) end of life label May 29, 2018
    @vstinner
    Copy link
    Member Author

    FWIW, bpo-29589 seems to be an older example of this failure.

    It was a failure on Mageia 5 (Linux) at 2017-02-17:

    == CPython 3.6.0 (default, Feb 17 2017, 15:26:31) [GCC 4.9.2]
    == Linux-4.4.13-desktop-1.mga5-x86_64-with-mageia-5-Official little-endian
    == hash algorithm: siphash24 64bit
    == cwd: /home/dima/bin/Python-3.6.0/build/test_python_8244
    == encodings: locale=UTF-8, FS=utf-8
    Run tests sequentially
    0:00:00 [1/1] test_multiprocessing_forkserver
    ...
    ======================================================================
    ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/dima/bin/Python-3.6.0/Lib/test/_test_multiprocessing.py", line 3728, in test_ignore
        os.kill(p.pid, signal.SIGUSR1)
    ProcessLookupError: [Errno 3] No such process

    @vstinner
    Copy link
    Member Author

    I am unable to reproduce the issue on Fedora 28 (Linux kernel 4.16.11, glibc 2.27):

    I ran 6 jobs in parallel during 10 minutes:

    • 2: ./python -m test test_multiprocessing_forkserver -m test_ignore -F
    • 2: ./python -m test test_multiprocessing_forkserver -F
    • 2: ./python -m test -j16 -r -F

    The system load was around 30 which is very high (my CPU has 8 logicial threads, 4 physical cores).

    @vstinner
    Copy link
    Member Author

    https://travis-ci.org/python/cpython/jobs/379560387

    More info about this failure:

    == Linux-4.4.0-112-generic-x86_64-with-debian-jessie-sid little-endian
    == CPU count: 48
    Run tests in parallel using 4 child processes
    (...)
    0:07:06 load avg: 24.76 [356/416/3] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (61 sec)

    @vstinner
    Copy link
    Member Author

    Davin, Antoine: any idea on this bug?

    I ran the full test suite and test_multiprocessing_forkserver in parallel in a Trust VM, but I failed to reproduce the bug.

    I did a similar test in a Ubuntu Trusty docker container: again, I'm unable to reproduce the bug.

    An Ubuntu Trusty docker container is supposed to be as close as possible to Travis CI, except that my laptop doesn't have 48 CPUs.

    @vstinner
    Copy link
    Member Author

    test_ignore() starts to fail more and more often on Travis CI for an unknown reason.

    https://travis-ci.org/python/cpython/jobs/385562187

    == CPU count: 48
    (...)
    Run tests in parallel using 4 child processes
    (...)
    0:07:38 load avg: 128.04 [415/416/1] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (378 sec)
    (...)
    ======================================================================
    ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
        os.kill(p.pid, signal.SIGUSR1)
    ProcessLookupError: [Errno 3] No such process

    @vstinner
    Copy link
    Member Author

    https://travis-ci.org/python/cpython/jobs/379560387

    Re-running test 'test_multiprocessing_forkserver' in verbose mode
    (...)
    ======================================================================
    ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4359, in test_ignore
        os.kill(p.pid, signal.SIGUSR1)
    ProcessLookupError: [Errno 3] No such process

    @vstinner
    Copy link
    Member Author

    Python 3.7:

    https://travis-ci.org/python/cpython/jobs/385474104

    Re-running test 'test_multiprocessing_forkserver' in verbose mode

    ======================================================================
    ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4324, in test_ignore
        os.kill(p.pid, signal.SIGUSR1)
    ProcessLookupError: [Errno 3] No such process

    @vstinner
    Copy link
    Member Author

    Python 3.7:

    https://travis-ci.org/python/cpython/jobs/385458840

    Re-running test 'test_multiprocessing_forkserver' in verbose mode

    ======================================================================
    ERROR: test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4324, in test_ignore
        os.kill(p.pid, signal.SIGUSR1)
    ProcessLookupError: [Errno 3] No such process

    @vstinner
    Copy link
    Member Author

    I added debug traces in PR 7260. test_ignore() failed, but it may be related to my debug traces, since the failure is different.

    test_multiprocessing_fork.test_ignore() failure:

    test_ignore (test.test_multiprocessing_fork.TestIgnoreEINTR) ...
    parent: pid=20159
    parent: start child
    parent: p=<Process(Process-3, started daemon)>
    parent: p.ident=20176
    ERROR

    ======================================================================
    ERROR: test_ignore (test.test_multiprocessing_fork.TestIgnoreEINTR)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/travis/build/python/cpython/Lib/test/_test_multiprocessing.py", line 4389, in test_ignore
        self.assertEqual(conn.recv(), 'ready')
      File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 250, in recv
        buf = self._recv_bytes()
      File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 407, in _recv_bytes
        buf = self._recv(4)
      File "/home/travis/build/python/cpython/Lib/multiprocessing/connection.py", line 383, in _recv
        raise EOFError
    EOFError

    test_multiprocessing_fork.test_ignore() success:

    test_ignore (test.test_multiprocessing_fork.TestIgnoreEINTR) ...
    parent: pid=11713
    parent: start child
    parent: p=<Process(Process-3, started daemon)>
    parent: p.ident=28998
    [INFO/Process-3] child process calling self.run()
    child: pid=28998
    child: register SIGUSR1 handler, send ready
    child: wait 1234
    parent: ready received
    parent: first SIGUSR1
    child: got SIGUSR1
    parent: send 1234
    parent: wait 1234
    child: send back 1234
    child: big blocking send
    parent: second SIGUSR1
    parent: wait bytes
    child: got SIGUSR1
    child: done
    [INFO/Process-3] process shutting down
    [INFO/Process-3] process shutting down
    [INFO/Process-3] process exiting with exitcode 0
    [INFO/Process-3] process exiting with exitcode 0
    parent: join child
    parent: done
    ok

    @vstinner
    Copy link
    Member Author

    I added debug traces in PR 7260. test_ignore() failed, but it may be related to my debug traces, since the failure is different.

    https://travis-ci.org/python/cpython/jobs/385958924

    @vstinner
    Copy link
    Member Author

    Notes:

    • test_ignore() creates a daemonic process (it cannot spawn new processes)
    • the fork server process is shared between test methods of test_forkserver
    • multiprocessing.util._exit_function() terminates all daemon processes

    @vstinner
    Copy link
    Member Author

    test_multiprocessing_forkserver.test_ignore() failed on Travis CI using PR 7261: --randseed=7474929. The method failed once when run in parallel, and then failed again when re-reun in verbose mode.

    @vstinner
    Copy link
    Member Author

    test_multiprocessing_forkserver.test_ignore() failed on Travis CI using PR 7261: --randseed=7474929. The method failed once when run in parallel, and then failed again when re-reun in verbose mode.

    https://travis-ci.org/python/cpython/jobs/385986803

    0:04:57 load avg: 25.81 [342/415/1] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (72 sec)

    @vstinner
    Copy link
    Member Author

    I added more traces to PR 7261 and the bug still occurred on Travis CI.

    https://travis-ci.org/python/cpython/jobs/385990848

    0:05:04 load avg: 42.62 [342/415/1] test_multiprocessing_forkserver failed -- running: test_concurrent_futures (69 sec)
    child: pid=19401 ppid=19348
    child: send BIG
    child: done
    test_answer_challenge_auth_failure (test.test_multiprocessing_forkserver.OtherTest) ... ok
    test_deliver_challenge_auth_failure (test.test_multiprocessing_forkserver.OtherTest) ... ok
    test_closefd (test.test_multiprocessing_forkserver.TestCloseFds) ... ok
    test_flags (test.test_multiprocessing_forkserver.TestFlags) ... ok
    test_lock (test.test_multiprocessing_forkserver.TestForkAwareThreadLock) ... ok
    test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR) ... parent: pid=19342
    parent: second SIGUSR1
    ERROR

    Hum... maybe the child exited, before the parent sent SIGUSR1: the child didn't block on sending 1 MB?

    @vstinner
    Copy link
    Member Author

    New changeset 5d6c7ed by Victor Stinner in branch 'master':
    bpo-33532: Fix multiprocessing test_ignore() (bpo-7262)
    5d6c7ed

    @miss-islington
    Copy link
    Contributor

    New changeset 9d27334 by Miss Islington (bot) in branch '3.7':
    bpo-33532: Fix multiprocessing test_ignore() (GH-7262)
    9d27334

    @vstinner
    Copy link
    Member Author

    New changeset c2870b6 by Victor Stinner in branch '3.6':
    bpo-33532: Fix multiprocessing test_ignore() (GH-7262) (bpo-7266)
    c2870b6

    @vstinner
    Copy link
    Member Author

    New changeset 82c456f by Victor Stinner in branch '2.7':
    bpo-33532: Fix multiprocessing test_ignore() (GH-7265)
    82c456f

    @pitrou
    Copy link
    Member

    pitrou commented May 31, 2018

    Did your PR fix the issue?

    @vstinner
    Copy link
    Member Author

    Did your PR fix the issue?

    The bug was that *sometimes* on Travis CI, and only on Travis CI (!?), writing 1 MiB into the multiprocessing pipe didn't block. The bug is really strange because it is only reproduced on the clang Linux job of Travis CI which runs tests in parallel. Not on the Linux gcc which runs tests sequentially in coverage. Moreover, the failure only occurs for a specific order of tests.

    You can easily reproduce the issue if you reduce the size of the data written into the pipe at the end of _test_ignore(). If the write (send_bytes) doesn't block, you get the same error.

    I'm confident that writing 4 MiB instead of 1 MiB will fix the issue. I saw the test passing with 4 MiB whereas it failed with 1 MiB, when I fixed the test order.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 1, 2018

    I saw one new test_ignore() failure on Travis CI in my 3.6 PR, whereas 3.6 already uses PIPE_MAX_SIZE for test_ignore().

    #7315
    https://travis-ci.org/python/cpython/jobs/386637729

    Hum, test_ignore() uses PIPE_MAX_SIZE, whereas the test fails on Linux where we use a pair of sockets, no pipes. Maybe we should use SOCK_MAX_SIZE?

    Notes on pipe size:

    • /proc/sys/fs/pipe-max-size
    • bpo-17835: add support.PIPE_MAX_SIZE (4 MiB +1 B)
    • bpo-18643: add support.SOCK_MAX_SIZE (16 MiB)

    @vstinner vstinner reopened this Jun 1, 2018
    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 1, 2018

    https://travis-ci.org/python/cpython/jobs/386637729

    I added some debug traces:

    test_ignore (test.test_multiprocessing_forkserver.TestIgnoreEINTR) ...
    socket[fd=4].SO_RCVBUF 212992
    socket[fd=4].SO_SNDBUF 212992
    socket[fd=7].SO_RCVBUF 212992
    socket[fd=7].SO_SNDBUF 212992
    test_ignore: PIPE_MAX_SIZE = 4194305

    The socket pair uses a buffer of 208 KiB (212,992 B) in both directions and the test sends 4 MiB (4,194,305 B) for a blocking call... and sometimes the send doesn't block.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 1, 2018

    New changeset 252f6ab by Victor Stinner in branch 'master':
    bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7319)
    252f6ab

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 1, 2018

    I modified test_ignore() to use support.SOCK_MAX_SIZE, but honestly, I'm not convinced that it will be the issue. I applied my change anyway, just to check if the issue comes from the size, or if it's something else.

    @miss-islington
    Copy link
    Contributor

    New changeset 63fa8db by Miss Islington (bot) in branch '3.7':
    bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7319)
    63fa8db

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 1, 2018

    New changeset 1d4be0a by Victor Stinner in branch '2.7':
    bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7323)
    1d4be0a

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 1, 2018

    New changeset 64e538b by Victor Stinner in branch '3.6':
    bpo-33532: Fix test_multiprocessing_forkserver.test_ignore() (GH-7322)
    64e538b

    @vstinner
    Copy link
    Member Author

    It didn't see this super annoying failure recently, so it seems like it has been fixed for real. Great!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants