Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"dictionary changed size during iteration" error in _ExecutorManagerThread #87664

Closed
kulikjak mannequin opened this issue Mar 15, 2021 · 13 comments
Closed

"dictionary changed size during iteration" error in _ExecutorManagerThread #87664

kulikjak mannequin opened this issue Mar 15, 2021 · 13 comments
Assignees
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes topic-asyncio topic-installation type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@kulikjak
Copy link
Mannequin

kulikjak mannequin commented Mar 15, 2021

BPO 43498
Nosy @brianquinlan, @pitrou, @asvetlov, @1st1, @tpetazzoni, @colesbury, @miss-islington, @kulikjak, @sweeneyde, @kartiksubbarao, @whitslack
PRs
  • bpo-43498: Fix dictionary iteration error in _ExecutorManagerThread #24868
  • [3.10] bpo-43498: Fix dictionary iteration error in _ExecutorManagerThread (GH-24868) #29836
  • [3.9] bpo-43498: Fix dictionary iteration error in _ExecutorManagerThread (GH-24868) #29837
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/asvetlov'
    closed_at = <Date 2021-11-29.13:24:23.173>
    created_at = <Date 2021-03-15.08:23:36.240>
    labels = ['expert-installation', '3.9', '3.10', '3.11', 'type-crash', 'expert-asyncio']
    title = '"dictionary changed size during iteration" error in _ExecutorManagerThread'
    updated_at = <Date 2021-11-29.13:24:23.172>
    user = 'https://github.com/kulikjak'

    bugs.python.org fields:

    activity = <Date 2021-11-29.13:24:23.172>
    actor = 'asvetlov'
    assignee = 'asvetlov'
    closed = True
    closed_date = <Date 2021-11-29.13:24:23.173>
    closer = 'asvetlov'
    components = ['Installation', 'asyncio']
    creation = <Date 2021-03-15.08:23:36.240>
    creator = 'kulikjak'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43498
    keywords = ['patch']
    message_count = 13.0
    messages = ['388712', '389755', '391120', '395544', '398805', '398806', '398811', '398870', '407251', '407260', '407268', '407269', '407270']
    nosy_count = 12.0
    nosy_names = ['bquinlan', 'pitrou', 'asvetlov', 'yselivanov', 'thomas-petazzoni', 'colesbury', 'miss-islington', 'kulikjak', 'Dennis Sweeney', 'kartiksubbarao', 'whitslack', 'markao']
    pr_nums = ['24868', '29836', '29837']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue43498'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @kulikjak
    Copy link
    Mannequin Author

    kulikjak mannequin commented Mar 15, 2021

    Recently several of our Python 3.9 builds froze during make install with the following trace in logs:

    Listing .../components/python/python39/build/prototype/sparc/usr/lib/python3.9/lib2to3/tests/data/fixers/myfixes...
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File ".../components/python/python39/build/prototype/sparc/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
        self.run()
      File ".../components/python/python39/build/prototype/sparc/usr/lib/python3.9/concurrent/futures/process.py", line 317, in run
        result_item, is_broken, cause = self.wait_result_broken_or_wakeup()
      File ".../components/python/python39/build/prototype/sparc/usr/lib/python3.9/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
        worker_sentinels = [p.sentinel for p in self.processes.values()]
      File ".../components/python/python39/build/prototype/sparc/usr/lib/python3.9/concurrent/futures/process.py", line 376, in <listcomp>
        worker_sentinels = [p.sentinel for p in self.processes.values()]
    RuntimeError: dictionary changed size during iteration

    After this, the build freezes and never ends (most likely waiting for the broken thread).

    We see this only in Python 3.9 (3.7 doesn't seem to be affected, and we don't deliver other versions) and only when doing full builds of the entire Userland, meaning that this might be related to big utilization of the build machine? That said, it only happened three or four times, so this might be just a coincidence.

    Simple fix seems to be this (PR shortly):

    --- Python-3.9.1/Lib/concurrent/futures/process.py
    +++ Python-3.9.1/Lib/concurrent/futures/process.py
    @@ -373,7 +373,7 @@ class _ExecutorManagerThread(threading.T
             assert not self.thread_wakeup._closed
             wakeup_reader = self.thread_wakeup._reader
             readers = [result_reader, wakeup_reader]
    -        worker_sentinels = [p.sentinel for p in self.processes.values()]
    +        worker_sentinels = [p.sentinel for p in self.processes.copy().values()]
             ready = mp.connection.wait(readers + worker_sentinels)
     
             cause = None

    This is on Oracle Solaris and on both SPARC and Intel machines.

    @kulikjak kulikjak mannequin added 3.10 only security fixes 3.9 only security fixes topic-installation topic-asyncio type-crash A hard crash of the interpreter, possibly with a core dump labels Mar 15, 2021
    @kartiksubbarao
    Copy link
    Mannequin

    kartiksubbarao mannequin commented Mar 29, 2021

    I'm seeing the same error with Python 3.9.2 on Fedora 33, with a script that uses ProcessPoolExecutor.

    @kulikjak
    Copy link
    Mannequin Author

    kulikjak mannequin commented Apr 15, 2021

    I investigated a little bit more and found out that this happens when ProcessPoolExecutor::_adjust_process_count() adds a new process during the iteration.

    With the following change, I can reproduce this reliably every time:

    --- Python-3.9.1/Lib/concurrent/futures/process.py
    +++ Python-3.9.1/Lib/concurrent/futures/process.py
    @@ -373,7 +373,14 @@ class _ExecutorManagerThread(threading.T
             assert not self.thread_wakeup._closed
             wakeup_reader = self.thread_wakeup._reader
             readers = [result_reader, wakeup_reader]
    -        worker_sentinels = [p.sentinel for p in self.processes.values()]
    +        worker_sentinels = []
    +        for p in self.processes.values():
    +            time.sleep(1)
    +            worker_sentinels.append(p.sentinel)
             ready = mp.connection.wait(readers + worker_sentinels)
     
             cause = None

    Since wait_result_broken_or_wakeup() is called periodically, and there is no issue if processes added during the iteration are omitted (if they were added just after that, they would be omitted anyway), the attached PR shouldn't break anything.

    @whitslack
    Copy link
    Mannequin

    whitslack mannequin commented Jun 10, 2021

    Observed this same failure mode on a Raspberry Pi 1 while running 'make install' on Python 3.9.5 with 9 concurrent workers.

    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/var/tmp/portage/dev-lang/python-3.9.5_p2/image/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
        self.run()
      File "/var/tmp/portage/dev-lang/python-3.9.5_p2/image/usr/lib/python3.9/concurrent/futures/process.py", line 317, in run
        result_item, is_broken, cause = self.wait_result_broken_or_wakeup()
      File "/var/tmp/portage/dev-lang/python-3.9.5_p2/image/usr/lib/python3.9/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
        worker_sentinels = [p.sentinel for p in self.processes.values()]
      File "/var/tmp/portage/dev-lang/python-3.9.5_p2/image/usr/lib/python3.9/concurrent/futures/process.py", line 376, in <listcomp>
        worker_sentinels = [p.sentinel for p in self.processes.values()]
    RuntimeError: dictionary changed size during iteration

    @tpetazzoni
    Copy link
    Mannequin

    tpetazzoni mannequin commented Aug 2, 2021

    I can confirm we are seeing the same issue when building Python 3.9 in the context of Buildroot. See http://autobuild.buildroot.net/results/ae6/ae6c4ab292589a4e4442dfb0a1286349a9bf4d29/build-end.log for an example build result. This happens since we have added 48-cores (96 threads) build machines to our build farm, which dramatically increased the build parallelism.

    @tpetazzoni
    Copy link
    Mannequin

    tpetazzoni mannequin commented Aug 2, 2021

    For the record: we're seeing this issue ~50 times a day on our build infrastructure.

    @sweeneyde
    Copy link
    Member

    It was mentioned in bpo-40327 that although copy() makes the situation much better, it doesn't solve the problem entirely, since the memory allocation of the copy() call can release the GIL. I don't know enough to know whether it would be worth it to add locking.

    @kulikjak
    Copy link
    Mannequin Author

    kulikjak mannequin commented Aug 4, 2021

    I think that even if copy() doesn't fix it entirely, it's still much better than nothing. I never encountered the issue mentioned in bpo-40327, but I saw this issue several times a week (before applying the proposed patch).

    @markao
    Copy link
    Mannequin

    markao mannequin commented Nov 29, 2021

    I'm experiencing the same issue on Python 3.10.0 when I execute the code that uses concurrent.futures.ProcessPoolExecutor.

    ========

    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
        self.run()
      File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 317, in run
        result_item, is_broken, cause = self.wait_result_broken_or_wakeup()
      File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
        worker_sentinels = [p.sentinel for p in self.processes.values()]
      File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 376, in <listcomp>
    PROCESSING DATAFRAME: AKAM
        worker_sentinels = [p.sentinel for p in self.processes.values()]
    RuntimeError: dictionary changed size during iteration

    ========

    I also tried to troubleshoot to find out the part that causes this exception, but the most difficult part is: it does not happen every time I execute my code that uses concurrent.futures.ProcessPoolExecutor. (Really like what Jakub mentioend earlier, it is like a coincidence.)

    At the same time, I am also testing if the same thing happens on other versions like Python 3.8.8 (on Rocky Linux 8.5), but we would appreciate it if someone can tell if this is a bug or not? Or even anything we should improve on my own code? (if needed I can share the sample code, but honestly I do not think this is something wrong with my code, since as I mentioned: the exception is not happening every time I execute my code, so I suspect this might be a bug of Python 3.10.0)

    (Since Jakub already reported it happens on Python 3.9, so I am not testing on 3.9)

    I would appreciate it if there is any update or info that can be shared.

    Thank you!

    @asvetlov
    Copy link
    Contributor

    Thanks for the report.

    Atomic copy (list(self.processes.values()) should fix the bug, sure.

    I doubt if writing a reliable test for this situation is possible; multithreading is hard.

    I think we can accept a patch without a test but with an inline comment that describes why copy is crucial.

    @asvetlov asvetlov added 3.11 only security fixes labels Nov 29, 2021
    @asvetlov
    Copy link
    Contributor

    New changeset 7431448 by Jakub Kulík in branch 'main':
    bpo-43498: Fix dictionary iteration error in _ExecutorManagerThread (GH-24868)
    7431448

    @miss-islington
    Copy link
    Contributor

    New changeset 4b11d71 by Miss Islington (bot) in branch '3.10':
    bpo-43498: Fix dictionary iteration error in _ExecutorManagerThread (GH-24868)
    4b11d71

    @miss-islington
    Copy link
    Contributor

    New changeset 3b9d886 by Miss Islington (bot) in branch '3.9':
    bpo-43498: Fix dictionary iteration error in _ExecutorManagerThread (GH-24868)
    3b9d886

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes topic-asyncio topic-installation type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants