Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool dies with excessive workers, but does not cleanup #63874

Closed
dsoprea mannequin opened this issue Nov 21, 2013 · 12 comments
Closed

Pool dies with excessive workers, but does not cleanup #63874

dsoprea mannequin opened this issue Nov 21, 2013 · 12 comments
Labels
3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@dsoprea
Copy link
Mannequin

dsoprea mannequin commented Nov 21, 2013

BPO 19675
Nosy @dsoprea, @applio, @JulienPalard
PRs
  • bpo-19675: Fixes pool cleanup issue #57
  • bpo-19675: Terminate processes if construction of a pool is failing. #5614
  • Files
  • pool.py.patch_2.7.6_20131120-1959: Patch to pool.py for Pool fork-fail cleanup bug.
  • trace.txt: Trace of bug happening on my platform
  • bug19675git.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-11-05.08:37:42.887>
    created_at = <Date 2013-11-21.01:52:19.814>
    labels = ['3.7', 'library', 'performance']
    title = 'Pool dies with excessive workers, but does not cleanup'
    updated_at = <Date 2018-11-05.08:37:42.886>
    user = 'https://github.com/dsoprea'

    bugs.python.org fields:

    activity = <Date 2018-11-05.08:37:42.886>
    actor = 'mdk'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-11-05.08:37:42.887>
    closer = 'mdk'
    components = ['Library (Lib)']
    creation = <Date 2013-11-21.01:52:19.814>
    creator = 'dsoprea'
    dependencies = []
    files = ['32742', '46560', '46635']
    hgrepos = []
    issue_num = 19675
    keywords = ['patch']
    message_count = 12.0
    messages = ['203556', '287235', '287648', '287658', '287660', '287709', '287711', '287712', '287785', '288044', '289556', '329273']
    nosy_count = 6.0
    nosy_names = ['sbt', 'dsoprea', 'davin', 'Winterflower', 'mdk', 'Vladimir Feinberg']
    pr_nums = ['57', '5614']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue19675'
    versions = ['Python 2.7', 'Python 3.6', 'Python 3.7']

    @dsoprea
    Copy link
    Mannequin Author

    dsoprea mannequin commented Nov 21, 2013

    If you provide a number of processes to a Pool that the OS can't fulfill, Pool will raise an OSError and die, but does not cleanup any of the processes that it has forked.

    This is a session in Python where I can allocate a large, but fulfillable, number of processes (just to exhibit what's possible in my current system):

    >> from multiprocessing import Pool
    >> p = Pool(500)
    >> p.close()
    >> p.join()

    Now, this is a request that will fail. However, even after this fails, I can't allocate even a single worker:

    >>> p = Pool(700)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool
        return Pool(processes, initializer, initargs, maxtasksperchild)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 159, in __init__
        self._repopulate_pool()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 222, in _repopulate_pool
        w.start()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 130, in start
        self._popen = Popen(self)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
        self.pid = os.fork()
    OSError: [Errno 35] Resource temporarily unavailable
    
    >>> p = Pool(1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool
        return Pool(processes, initializer, initargs, maxtasksperchild)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 159, in __init__
        self._repopulate_pool()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 222, in _repopulate_pool
        w.start()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 130, in start
        self._popen = Popen(self)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
        self.pid = os.fork()
    OSError: [Errno 35] Resource temporarily unavailable

    The only way to clean this up is to close the parent (the interpreter).

    I'm submitting a patch for 2.7.6 that intercepts exceptions and cleans-up the workers before bubbling. The affected method is _repopulate_pool(), and appears to be the same in 2.7.6, 3.3.3, and probably every other recent version of Python.

    This is the old version:

            for i in range(self._processes - len(self._pool)):
                w = self.Process(target=worker,
                                 args=(self._inqueue, self._outqueue,
                                       self._initializer,
                                       self._initargs, self._maxtasksperchild)
                                )
                self._pool.append(w)
                w.name = w.name.replace('Process', 'PoolWorker')
                w.daemon = True
                w.start()
                debug('added worker')

    This is the new version:

        try:
            for i in range(self._processes - len(self._pool)):
                w = self.Process(target=worker,
                                 args=(self._inqueue, self._outqueue,
                                       self._initializer,
                                       self._initargs, self._maxtasksperchild)
                                )
                self._pool.append(w)
                w.name = w.name.replace('Process', 'PoolWorker')
                w.daemon = True
                w.start()
                debug('added worker')
        except:
            debug("Process creation error. Cleaning-up (%d) workers." % (len(self._pool)))
    
                for process in self._pool:
                    if process.is_alive() is False:
                        continue
    
                    process.terminate()
                    process.join()
    
                debug("Processing cleaning-up. Bubbling error.")
                raise

    This is what happens, now: I can go from requesting a number that's too high to immediately requesting one that's also high but within limits, and there's now no problem as all resources have been freed:

    >>> from multiprocessing import Pool
    >>> p = Pool(700)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool
        return Pool(processes, initializer, initargs, maxtasksperchild)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 159, in __init__
        self._repopulate_pool()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 224, in _repopulate_pool
        w.start()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 130, in start
        self._popen = Popen(self)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
        self.pid = os.fork()
    OSError: [Errno 35] Resource temporarily unavailable
    >>> p = Pool(500)
    >>> p.close()
    >>> p.join()

    @dsoprea dsoprea mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Nov 21, 2013
    @VladimirFeinberg
    Copy link
    Mannequin

    VladimirFeinberg mannequin commented Feb 7, 2017

    I'm still hitting this issue in Python 3.6.0 :: Anaconda 4.3.0 (64-bit). Is there a reason this patch has been ignored?

    This is on CentOS release 6.5

    @Winterflower
    Copy link
    Mannequin

    Winterflower mannequin commented Feb 12, 2017

    I've reformatted Dustin Oprea's original patch to be compatible with git, applied it to master and executed _test_multiprocessing suite. All tests pass locally.

    @Winterflower
    Copy link
    Mannequin

    Winterflower mannequin commented Feb 12, 2017

    @dsoprea: would you like to open a PR for this issue on Github? if not, are you happy for me to do it?

    @dsoprea
    Copy link
    Mannequin Author

    dsoprea mannequin commented Feb 12, 2017

    #57

    On Sun, Feb 12, 2017 at 5:29 PM, Camilla Montonen <report@bugs.python.org>
    wrote:

    Camilla Montonen added the comment:

    @dsoprea: would you like to open a PR for this issue on Github? if not,
    are you happy for me to do it?

    ----------


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue19675\>


    @applio
    Copy link
    Member

    applio commented Feb 13, 2017

    @Winterflower: Thank you for encouraging @dsoprea to create the new PR and working to convert the previous patch.

    @dsoprea: Thank you for taking the time to create the PR especially after this has been sitting unloved for so long.

    Though the new workflow using PR's is still in a bit of a state of flux, my understanding is that we will want to have one PR per feature branch (i.e. one for each of 2.7, 3.6, 3.7) that we want to target.

    Now that we seem to have spawned two parallel discussion tracks (one here and one in the PR #57), I'm not sure how best to resolve that but for the time being I'll offer code-related comments here as they're much more likely to be preserved (and thus discoverable) for posterity: we do need some sort of tests around this to complete the patch -- something that would exercise both the non-exception and exception paths (and thus would detect that intended call to util.debug()).

    @applio applio added the 3.7 (EOL) end of life label Feb 13, 2017
    @dsoprea
    Copy link
    Mannequin Author

    dsoprea mannequin commented Feb 13, 2017

    Okay. Thanks for weighing-in.

    I'm trying to figure out how to write the tests. The existing set of tests
    for multiprocessing is a near nightmare. It seems like I might have to use
    one of the existing "source code" definitions to test for the no-cleanup
    (traditional) scenario but then have to define a new one to define the pool
    with an initializer that fails. However, then, I'll have to define three
    new TestCase-based classes to test for the various start-methods.

    Any suggestions?

    On Mon, Feb 13, 2017 at 11:50 AM, Davin Potts <report@bugs.python.org>
    wrote:

    Changes by Davin Potts <python@discontinuity.net>:

    ----------
    versions: +Python 2.7, Python 3.7


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue19675\>


    @applio
    Copy link
    Member

    applio commented Feb 13, 2017

    For triggering the exception, supplying a Process target that deliberately fails sounds right.

    As for tests for the various start methods (fork/forkserver/spawn), if you are looking at the 3.x branches you'll find this was been consolidated so that one test could conceivably be written to handle multiple variants (see Lib/test/_test_multiprocessing.py).

    @dsoprea
    Copy link
    Mannequin Author

    dsoprea mannequin commented Feb 14, 2017

    I don't think this can be tested. Throwing exceptions in the remote process causes exceptions that can't be caught in the same way (when the initializer fails the pool just attempts to recreate the process over and over) and I don't think it'd be acceptable to try to spawn too many processes in order to induce the original problem. There's not a lot of surface area to what we've doing here. We can't simulate failures any other way.

    Can I get an agreement on this from someone?

    @Winterflower
    Copy link
    Mannequin

    Winterflower mannequin commented Feb 17, 2017

    Would it be possible to see a review of the existing attempts at testing this issue?

    @pppery
    Copy link
    Mannequin

    pppery mannequin commented Mar 14, 2017

    Can't you just mock the Process class to have a start method that always raises an error?

    @JulienPalard
    Copy link
    Member

    New changeset 5d236ca by Julien Palard in branch 'master':
    bpo-19675: Terminate processes if construction of a pool is failing. (GH-5614)
    5d236ca

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants