Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing: MapResult shouldn't fail fast upon exception #68180

Closed
neologix mannequin opened this issue Apr 18, 2015 · 8 comments
Closed

multiprocessing: MapResult shouldn't fail fast upon exception #68180

neologix mannequin opened this issue Apr 18, 2015 · 8 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@neologix
Copy link
Mannequin

neologix mannequin commented Apr 18, 2015

BPO 23992
Nosy @pitrou, @vstinner, @applio
Files
  • mp_map_fail_fast_27.diff
  • mp_map_fail_fast_default.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-02-12.22:56:42.222>
    created_at = <Date 2015-04-18.09:00:21.528>
    labels = ['type-bug', 'library']
    title = "multiprocessing: MapResult shouldn't fail fast upon exception"
    updated_at = <Date 2016-02-12.22:56:42.221>
    user = 'https://bugs.python.org/neologix'

    bugs.python.org fields:

    activity = <Date 2016-02-12.22:56:42.221>
    actor = 'neologix'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-02-12.22:56:42.222>
    closer = 'neologix'
    components = ['Library (Lib)']
    creation = <Date 2015-04-18.09:00:21.528>
    creator = 'neologix'
    dependencies = []
    files = ['39172', '39173']
    hgrepos = []
    issue_num = 23992
    keywords = ['patch', 'needs review']
    message_count = 8.0
    messages = ['241404', '241459', '241823', '246045', '250138', '250523', '250537', '260055']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'vstinner', 'neologix', 'python-dev', 'sbt', 'davin']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue23992'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Apr 18, 2015

    hanger.py
    """
    from time import sleep

    def hang(i):
        sleep(i)
        raise ValueError("x" * 1024**2)
    """

    The following code will deadlock on pool.close():
    """
    from multiprocessing import Pool
    from time import sleep

    from hanger import hang
    
    
    with Pool() as pool:
        try:
            pool.map(hang, [0,1])
        finally:
            sleep(0.5)
            pool.close()
            pool.join()
    """

    The problem is that when one of the tasks comprising a map result fails with an exception, the corresponding MapResult is removed from the result cache:

        def _set(self, i, success_result):
            success, result = success_result
            if success:
                [snip]
            else:
                self._success = False
                self._value = result
                if self._error_callback:
                    self._error_callback(self._value)
    <

    ===
    del self._cache[self._job]
    self._event.set()
    ===>

    Which means that when the pool is closed, the result handler thread terminates right away, because it doesn't see any task left to wait for.
    Which means that it doesn't drain the result queue, and if some worker process is trying to write a large result to it (hence the large valuerrror to fill the socket/pipe buffer), it will hang, and the pool won't shut down (unless you call terminate()).

    Although I can see the advantage of fail-fast behavior, I don't think it's correct because it breaks the invariant where results won't be deleted from the cache until they're actually done.

    Also, the current fail-fast behavior breaks the semantics that the call only returns when it has completed.
    Returning while some jobs part of the map are still running is potentially very bad, e.g. if the user call retries the same call, assuming that all the jobs are done. Retrying jobs that are idempotent but not parallel execution-safe would break with the current code.

    The fix is trivial, use the same logic as in case of success to only signal failure when all jobs are done.

    I'll provide a patch if it seems sensible :-)

    @neologix neologix mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 18, 2015
    @applio
    Copy link
    Member

    applio commented Apr 18, 2015

    This is a nice example demonstrating what I agree is a problem with the current implementation of close.

    A practical concern with what I believe is being proposed in your trivial fix: if the workers are engaged in very long-running tasks (and perhaps slowly writing their overly large results to the results queue) then we would have to wait for quite a long time for these other workers to reach their natural completion.

    That said, I believe close should in fact behave just that way and have us subsequently wait for the others to be completed. It is not close's job to attempt to address the general concern I bring up.

    This change could be felt by people who have written their code to expect the result handler's immediate shutdown if there are no other visible results -- it is difficult to imagine what the impact would be.

    This is my long-winded way of saying it seems very sensible and welcome to me if you took the time to prepare a patch.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Apr 22, 2015

    Patches for 2.7 and default.

    @neologix
    Copy link
    Mannequin Author

    neologix mannequin commented Jul 1, 2015

    Barring any objections, I'll commit within the next few days.

    @applio
    Copy link
    Member

    applio commented Sep 7, 2015

    @neologix: Budgeting time this week to have a proper look -- sorry I haven't gotten back to it sooner.

    @applio
    Copy link
    Member

    applio commented Sep 12, 2015

    The patches make good sense to me -- I have no comments to add in a review.

    I spent more time than I care to admit concerned with the idea that error_callback (exposed by map_async which map sits on top of) should perhaps be called not just once at the end but each time an exception occurs. Motivated by past jobs which failed overall to yield any results because one out of a million of the inputs triggered an error, I thought the idea very appealing and experimented with implementing it (with happy results). Googling for it though, I found plenty of examples of people asking questions about how callback and error_callback are intended to work -- though the documentation is not explicit on this particular point, most of those search results correctly document in the wild that error_callback is called only once at the end just like callback. I think it best to leave that functionality just as you have it now.

    Thanks for creating the patch -- looks great to me.

    @applio
    Copy link
    Member

    applio commented Sep 12, 2015

    As an aside: bpo-24948 seems to show there are others who would find the immediate-multiple-error_callback idea attractive.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 10, 2016

    New changeset 1ba0deb52223 by Charles-François Natali in branch 'default':
    Issue bpo-23992: multiprocessing: make MapResult not fail-fast upon exception.
    https://hg.python.org/cpython/rev/1ba0deb52223

    @neologix neologix mannequin closed this as completed Feb 12, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant