Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_concurrent_futures: reap_children() warnings on RHEL7 and RHEL8 buildbots #82727

Closed
vstinner opened this issue Oct 21, 2019 · 19 comments
Closed
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

BPO 38546
Nosy @vstinner, @pablogsal, @miss-islington
PRs
  • [WIP] bpo-38546: Add support.get_child_processes() #17640
  • bpo-38546: multiprocessing tests stop the resource tracker #17641
  • [3.8] bpo-38546: multiprocessing tests stop the resource tracker (GH-17641) #17647
  • bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() #17652
  • [3.8] bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() (GH-17652) #17653
  • [3.7] bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() (GH-17652) #17654
  • [3.8] bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() (GH-17652) #17655
  • [3.7] bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() (GH-17652) (GH-17655) #17656
  • [3.7] bpo-38546: Backport multiprocessing tests fixes from master #19689
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-12-18.17:05:09.007>
    created_at = <Date 2019-10-21.11:20:00.633>
    labels = ['3.8', '3.7', 'tests', '3.9']
    title = 'test_concurrent_futures: reap_children() warnings on RHEL7 and RHEL8 buildbots'
    updated_at = <Date 2020-04-23.22:44:12.154>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-04-23.22:44:12.154>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-12-18.17:05:09.007>
    closer = 'vstinner'
    components = ['Tests']
    creation = <Date 2019-10-21.11:20:00.633>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 38546
    keywords = ['patch']
    message_count = 19.0
    messages = ['355051', '355053', '355079', '355107', '355310', '355312', '357095', '357404', '357929', '358547', '358548', '358549', '358560', '358617', '358627', '358630', '358644', '358646', '367156']
    nosy_count = 3.0
    nosy_names = ['vstinner', 'pablogsal', 'miss-islington']
    pr_nums = ['17640', '17641', '17647', '17652', '17653', '17654', '17655', '17656', '19689']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue38546'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @vstinner
    Copy link
    Member Author

    https://buildbot.python.org/all/#/builders/254/builds/162

    test_no_stale_references (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.56s ok
    test_ressources_gced_in_workers (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 1.90s ok
    test_shutdown_race_issue12456 (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.54s ok
    test_submit (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.52s ok
    test_submit_keyword (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.57s ok
    test_traceback (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.48s ok
    test_context_manager_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 0.03s ok
    test_del_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 0.02s ok
    test_hang_issue12364 (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 1.45s ok
    test_interpreter_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 1.62s Warning -- reap_children() reaped child process 28936
    ok

    @vstinner vstinner added 3.9 only security fixes tests Tests in the Lib/test dir labels Oct 21, 2019
    @vstinner
    Copy link
    Member Author

    It looks similar to bpo-38448: "test_concurrent_futures: reap_children() reaped child process 26487 on AMD64 RHEL8 Refleaks 3.x"

    @vstinner
    Copy link
    Member Author

    I ran "./python -m test --fail-env-changed -w -j50 -F test_concurrent_futures" on my laptop for 5 minutes: system load of 220,57 with a peak of 403 processes... but I failed to reproduce the issue :-(

    @vstinner
    Copy link
    Member Author

    The system load was at 10.02 when the bug occurred:

    https://buildbot.python.org/all/#/builders/254/builds/162

    ...
    0:04:11 load avg: 10.02 [230/419] test_pkgutil passed -- running: test_io (2 min 46 sec), test_capi (1 min 12 sec), test_concurrent_futures (2 min 43 sec), test_mailbox (2 min 40 sec), test_tokenize (30.6 sec)
    0:04:12 load avg: 10.02 [231/419] test_getpass passed -- running: test_io (2 min 47 sec), test_capi (1 min 13 sec), test_concurrent_futures (2 min 43 sec), test_mailbox (2 min 41 sec), test_tokenize (31.1 sec)
    0:04:12 load avg: 10.02 [232/419] test_sys passed -- running: test_io (2 min 47 sec), test_capi (1 min 13 sec), test_concurrent_futures (2 min 43 sec), test_mailbox (2 min 41 sec), test_tokenize (31.3 sec)
    0:04:12 load avg: 10.02 [233/419] test_winconsoleio skipped -- running: test_io (2 min 48 sec), test_capi (1 min 13 sec), test_concurrent_futures (2 min 44 sec), test_mailbox (2 min 41 sec), test_tokenize (31.7 sec)
    test_winconsoleio skipped -- test only relevant on win32
    0:04:13 load avg: 10.02 [234/419/1] test_concurrent_futures failed (env changed) (2 min 44 sec) -- running: test_io (2 min 48 sec), test_capi (1 min 14 sec), test_mailbox (2 min 42 sec), test_tokenize (32.0 sec)
    ...
    test_crash (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) ... 4.22s ok
    test_shutdown_deadlock (test.test_concurrent_futures.ProcessPoolSpawnExecutorDeadlockTest) ... 1.03s ok
    test_initializer (test.test_concurrent_futures.ProcessPoolSpawnFailingInitializerTest) ... 0.51s ok
    test_initializer (test.test_concurrent_futures.ProcessPoolSpawnInitializerTest) ... 0.54s ok
    test_free_reference (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.67s ok
    test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.49s ok
    test_map (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.56s ok
    test_map_chunksize (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.46s ok
    test_map_exception (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.46s ok
    test_map_timeout (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 6.48s ok
    test_max_workers_negative (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.57s ok
    test_max_workers_too_large (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... skipped 'Windows-only process limit'
    test_no_stale_references (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.56s ok
    test_ressources_gced_in_workers (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 1.90s ok
    test_shutdown_race_issue12456 (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.54s ok
    test_submit (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.52s ok
    test_submit_keyword (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.57s ok
    test_traceback (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.48s ok
    test_context_manager_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 0.03s ok
    test_del_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 0.02s ok
    test_hang_issue12364 (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 1.45s ok
    test_interpreter_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 1.62s
    Warning -- reap_children() reaped child process 28936
    ok
    test_processes_terminate (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 0.47s ok
    test_run_after_shutdown (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest) ... 0.00s ok

    @vstinner
    Copy link
    Member Author

    New warning on AMD64 RHEL8 LTO + PGO 3.x:
    https://buildbot.python.org/all/#/builders/284/builds/171

    ...
    test_killed_child (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.26s ok
    test_map (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.29s ok
    test_map_chunksize (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.31s ok
    test_map_exception (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.30s ok
    test_map_timeout (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 6.29s ok
    test_max_workers_negative (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.32s ok
    test_max_workers_too_large (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... skipped 'Windows-only process limit'
    test_no_stale_references (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.29s ok
    test_ressources_gced_in_workers (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 1.58s ok
    test_shutdown_race_issue12456 (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.30s Warning -- reap_children() reaped child process 10340
    ok
    test_submit (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.29s ok
    test_submit_keyword (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.31s ok
    ...

    @vstinner
    Copy link
    Member Author

    I closed bpo-38448 as duplicate of this issue. Copy of its only message:

    AMD64 RHEL8 Refleaks 3.x:
    https://buildbot.python.org/all/#/builders/272/builds/23

    0:27:13 load avg: 4.88 [416/419/1] test_concurrent_futures failed (env changed) (17 min 11 sec) -- running: test_capi (7 min 28 sec), test_gdb (8 min 49 sec), test_asyncio (23 min 23 sec)
    beginning 6 repetitions
    123456
    .Warning -- reap_children() reaped child process 26487
    .....
    Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures
    Before: set()
    After: {<weakref at 0x7fdc08f44e30; to 'SpawnProcess' at 0x7fdc0a467c30>}

    @vstinner vstinner changed the title test_concurrent_futures: test_interpreter_shutdown() warning "reap_children() reaped child process" on AMD64 RHEL7 3.x test_concurrent_futures: reap_children() warnings on RHEL7 and RHEL8 buildbots Oct 24, 2019
    @vstinner
    Copy link
    Member Author

    AMD64 Debian PGO 3.x:
    https://buildbot.python.org/all/#/builders/47/builds/3851

    test_submit_keyword (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 0.62s Warning -- reap_children() reaped child process 32627
    Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures

    @pablogsal
    Copy link
    Member

    Ran 168 tests in 160.415s
    OK (skipped=3)
    Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures
    Before: set()
    After: {<weakref at 0x7fd136d66540; to 'ForkProcess' at 0x7fd136cd4a90>}

    Another failure in AMD64 RHEL8 LTO + PGO 3.x

    https://buildbot.python.org/all/#/builders/284/builds/300/steps/5/logs/stdio

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 6, 2019

    New failure on AMD64 RHEL7 LTO 3.x:
    https://buildbot.python.org/all/#/builders/103/builds/17

    test_interpreter_shutdown
    (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolShutdownTest)
    ... 1.61s Warning -- reap_children() reaped child process 9934

    @vstinner
    Copy link
    Member Author

    I managed to reproduce the issue on the RHEL8 worker using:

    • terminal 1: ./python -u -m test --fail-env-changed test_concurrent_futures -v -m ProcessPoolSpawnProcessPoolExecutorTest -F -j 15
    • terminal 2: ./python -m test -j4 -u all -F -r

    I can still reproduce the issue with PR 17641 fix, so it's not enough.

    I also applied PR 17640 and 2 extra changes:

    diff --git a/Lib/multiprocessing/managers.py b/Lib/multiprocessing/managers.py
    index 1f9c2daa25..dbbf84b3d2 100644
    --- a/Lib/multiprocessing/managers.py
    +++ b/Lib/multiprocessing/managers.py
    @@ -661,7 +661,7 @@ class BaseManager(object):
                 except Exception:
                     pass
     
    -            process.join(timeout=1.0)
    +            process.join(timeout=0.1)
                 if process.is_alive():
                     util.info('manager still alive')
                     if hasattr(process, 'terminate'):
    diff --git a/Lib/test/support/__init__.py b/Lib/test/support/__init__.py
    index 215bab8131..110c7f945e 100644
    --- a/Lib/test/support/__init__.py
    +++ b/Lib/test/support/__init__.py
    @@ -2399,6 +2399,7 @@ def reap_children():
     
         # Reap all our dead child processes so we don't leave zombies around.
         # These hog resources and might be causing some of the buildbots to die.
    +    time.sleep(0.5)
         while True:
             try:
                 # Read the exit status of any child process which already completed

    @vstinner
    Copy link
    Member Author

    Example of failure, with additional debug prints:

    0:01:02 load avg: 42.13 [ 38/1] test_concurrent_futures failed (env changed)
    test_no_stale_references (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 2.00s reap_children
    ok
    test_ressources_gced_in_workers (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 3.86s reap_children
    ok
    test_shutdown_race_issue12456 (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 2.22s reap_children
    ok
    test_submit (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 2.26s reap_children
    ok
    test_submit_keyword (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 1.59s reap_children
    ok
    test_traceback (test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest) ... 1.60s reap_children
    ok
    MP cleanup: cleanup process
    MP cleanup: stop forkserver
    MP cleanup: stop res track
    MP cleanup: run finalizers
    reap_children

    ----------------------------------------------------------------------

    Ran 6 tests in 17.210s

    OK
    reap_children
    Warning -- reap_children() reaped child process 4992
    Warning -- multiprocessing.process._dangling was modified by test_concurrent_futures
    Before: set()
    After: {<weakref at 0x7fedc87bc7d0; to 'SpawnProcess' at 0x7fedc7abcd70>}

    @vstinner
    Copy link
    Member Author

    I can reproduce the issue with the following match file:

    test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest.test_no_stale_references
    test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest.test_ressources_gced_in_workers
    test.test_concurrent_futures.ProcessPoolSpawnProcessPoolExecutorTest.test_shutdown_race_issue12456

    @vstinner
    Copy link
    Member Author

    New changeset 9707e8e by Victor Stinner in branch 'master':
    bpo-38546: multiprocessing tests stop the resource tracker (GH-17641)
    9707e8e

    @vstinner
    Copy link
    Member Author

    New changeset 35acb35 by Victor Stinner in branch '3.8':
    bpo-38546: multiprocessing tests stop the resource tracker (GH-17641) (GH-17647)
    35acb35

    @vstinner
    Copy link
    Member Author

    New changeset 673c393 by Victor Stinner in branch 'master':
    bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() (GH-17652)
    673c393

    @vstinner
    Copy link
    Member Author

    bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() (GH-17652)

    I tested manually on the RHEL8 worker and I confirm that this change prevents the reap_children() warning.

    I close the issue. The change will be shortly backported to 3.7 and 3.8.

    --

    I'm not sure if PR 17640 is useful, let's discuss it on the PR directly.

    @vstinner vstinner added 3.7 (EOL) end of life 3.8 only security fixes labels Dec 18, 2019
    @vstinner
    Copy link
    Member Author

    New changeset b0eb046 by Victor Stinner in branch '3.8':
    bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() (GH-17652) (GH-17655)
    b0eb046

    @miss-islington
    Copy link
    Contributor

    New changeset b8bbdf0 by Miss Islington (bot) in branch '3.7':
    bpo-38546: Fix concurrent.futures test_ressources_gced_in_workers() (GH-17652) (GH-17655)
    b8bbdf0

    @vstinner
    Copy link
    Member Author

    New changeset fd32a0e by Victor Stinner in branch '3.7':
    [3.7] bpo-38546: Backport multiprocessing tests fixes from master (GH-19689)
    fd32a0e

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes tests Tests in the Lib/test dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants