Message 241404 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	neologix
Recipients	neologix, pitrou
Date	2015-04-18.09:00:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1429347621.56.0.157795358647.issue23992@psf.upfronthosting.co.za>
In-reply-to

Content
hanger.py """ from time import sleep def hang(i): sleep(i) raise ValueError("x" * 1024**2) """ The following code will deadlock on pool.close(): """ from multiprocessing import Pool from time import sleep from hanger import hang with Pool() as pool: try: pool.map(hang, [0,1]) finally: sleep(0.5) pool.close() pool.join() """ The problem is that when one of the tasks comprising a map result fails with an exception, the corresponding MapResult is removed from the result cache: def _set(self, i, success_result): success, result = success_result if success: [snip] else: self._success = False self._value = result if self._error_callback: self._error_callback(self._value) <=== del self._cache[self._job] self._event.set() ===> Which means that when the pool is closed, the result handler thread terminates right away, because it doesn't see any task left to wait for. Which means that it doesn't drain the result queue, and if some worker process is trying to write a large result to it (hence the large valuerrror to fill the socket/pipe buffer), it will hang, and the pool won't shut down (unless you call terminate()). Although I can see the advantage of fail-fast behavior, I don't think it's correct because it breaks the invariant where results won't be deleted from the cache until they're actually done. Also, the current fail-fast behavior breaks the semantics that the call only returns when it has completed. Returning while some jobs part of the map are still running is potentially very bad, e.g. if the user call retries the same call, assuming that all the jobs are done. Retrying jobs that are idempotent but not parallel execution-safe would break with the current code. The fix is trivial, use the same logic as in case of success to only signal failure when all jobs are done. I'll provide a patch if it seems sensible :-)

hanger.py
"""
from time import sleep


def hang(i):
    sleep(i)
    raise ValueError("x" * 1024**2)
"""


The following code will deadlock on pool.close():
"""
from multiprocessing import Pool
from time import sleep

from hanger import hang


with Pool() as pool:
    try:
        pool.map(hang, [0,1])
    finally:
        sleep(0.5)
        pool.close()
        pool.join()
"""

The problem is that when one of the tasks comprising a map result fails with an exception, the corresponding MapResult is removed from the result cache:

    def _set(self, i, success_result):
        success, result = success_result
        if success:
            [snip]
        else:
            self._success = False
            self._value = result
            if self._error_callback:
                self._error_callback(self._value)
<===
            del self._cache[self._job]
            self._event.set()
===>

Which means that when the pool is closed, the result handler thread terminates right away, because it doesn't see any task left to wait for.
Which means that it doesn't drain the result queue, and if some worker process is trying to write a large result to it (hence the large valuerrror to fill the socket/pipe buffer), it will hang, and the pool won't shut down (unless you call terminate()).

Although I can see the advantage of fail-fast behavior, I don't think it's correct because it breaks the invariant where results won't be deleted from the cache until they're actually done.

Also, the current fail-fast behavior breaks the semantics that the call only returns when it has completed.
Returning while some jobs part of the map are still running is potentially very bad, e.g. if the user call retries the same call, assuming that all the jobs are done. Retrying jobs that are idempotent but not parallel execution-safe would break with the current code.


The fix is trivial, use the same logic as in case of success to only signal failure when all jobs are done.

I'll provide a patch if it seems sensible :-)

History
Date	User	Action	Args
2015-04-18 09:00:21	neologix	set	recipients: + neologix, pitrou
2015-04-18 09:00:21	neologix	set	messageid: <1429347621.56.0.157795358647.issue23992@psf.upfronthosting.co.za>
2015-04-18 09:00:21	neologix	link	issue23992 messages
2015-04-18 09:00:20	neologix	create