Issue 20319: concurrent.futures.wait() can block forever even if Futures have completed

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64518

classification

Title:	concurrent.futures.wait() can block forever even if Futures have completed
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.3, Python 3.4

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:	bquinlan	Nosy List:	Sebastian.Kreft.Deezer, bquinlan, mark.dickinson, pitrou, python-dev, sbt, tim.peters
Priority:	normal	Keywords:	patch

Created on 2014-01-21 00:18 by glangford, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
stress_wait.py	glangford, 2014-01-21 00:18	Stress test function which demonstrates wait blocking under 3.3.3 (OS X 10.9.1)
futures.patch	vstinner, 2014-01-23 13:31		review
issue20319.patch	glangford, 2014-01-26 13:07		review
issue20319-no-sort.patch	glangford, 2014-01-27 12:49		review

Messages (31)
msg208592 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-01-21 00:18
concurrent.futures.wait() can get into a state where it blocks forever on waiter.event.wait(), even when the underlying Futures have completed. This is demonstrated in a stress test where a large number of wait() calls are run in multiple threads, contending for the same Futures. The cause is believed to be waiter removal, which is done without locking the Future. A suggested fix which appears to work is to change the following code in wait(): for f in fs: f._waiters.remove(waiter) to: for f in fs: with f._condition: f._waiters.remove(waiter)
msg208840 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-01-22 18:54
The same bug also exists in concurrent.futures.as_completed(). The minimal fix suggested here also works, but the bigger fix suggested in issue #20297 is recommended for as_completed().
msg208915 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2014-01-23 12:31
Adding Tim Peters to the nosy, since I suspect he has a general interest in this kind of issue. As far as I know Brian Quinlan isn't actively maintaining concurrent.futures at the moment (Brian: please correct me if I'm wrong).
msg208917 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-01-23 13:13
@Glenn: Can you maybe suggest a patch fixing the issue?
msg208918 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-01-23 13:20
@Victor: Would like to give a patch but I am not a core developer, and I don't version control set up yet. The proposed fix is based on reading the distribution source code.
msg208920 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-01-23 13:31
futures.patch: reuse _AcquireFutures context manager to protect the list of futures against concurrent access. Without the patch, stress_wait.py hangs. With the patch, it works correctly.
msg208977 - (view)	Author: Brian Quinlan (bquinlan) *	Date: 2014-01-23 18:54
I'm not currently working on concurrent.futures but I do look at patches and bug reports. I'll take a look at this and Issue20297 sometime this week.
msg209271 - (view)	Author: Brian Quinlan (bquinlan) *	Date: 2014-01-26 03:18
I'm looking at "futures.patch". I don't understand why these blocks are helpful -"_create_and_install_waiters" has two call sites and both ("as_completed" and "wait") call "_create_and_install_waiters" from within an _AcquireFutures context manager: - pending_count = sum( - f._state not in [CANCELLED_AND_NOTIFIED, FINISHED] for f in fs) + with _AcquireFutures(fs): + pending_count = sum( + f._state not in [CANCELLED_AND_NOTIFIED, FINISHED] + for f in fs) - for f in fs: - f._waiters.append(waiter) + with _AcquireFutures(fs): + for f in fs: + f._waiters.append(waiter) It seems more plausible that the locks around the removals are fixing the bug but I don't see how. I'll look into it some more.
msg209272 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-01-26 03:35
> It seems more plausible that the locks around the removals are fixing the bug but I don't see how. I'll look into it some more. It is the locks around the waiter removals that matter; I think there are only formatting changes elsewhere in the patch. The reason the locks make a difference is that there can be a race condition if multiple wait() calls compete to modify the f._waiters list.
msg209300 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-01-26 12:22
@Brian - Ah, I see now what you are referring to. The patch has changes to _create_and_install_waiters() which should not be there. The only code that needs to change is waiter removal as I originally suggested. I am set up with a dev environment now and will submit a revised patch.
msg209306 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-01-26 13:07
This patch shows the minimal desired outcome. It is not elegant in its current form, but does only what is necessary. Ultimately I think as_completed() should go its own way and not lock all Futures at once (#20297).
msg209439 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-01-27 12:49
Revised patch; I don't think there is a need to sort the keys when waiters are being removed since only one lock is acquired at a time. Stress tests on both wait() and as_completed() work with this approach.
msg209741 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-01-30 22:26
An idea for a different possible fix - rather than cleaning up waiters in wait() and as_completed(), could they be removed in Future.set_result() and Futures.set_exception() ? I'm not certain if any waiter should ever be notified twice; if not, perhaps set_result() and set_exception() could just include self._waiters = [] after all waiters have been signalled.
msg209853 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-02-01 00:51
New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default': Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed http://hg.python.org/cpython/rev/0bcf23a52d55
msg209854 - (view)	Author: Brian Quinlan (bquinlan) *	Date: 2014-02-01 00:52
Thanks very much for the patch Glenn!
msg209855 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-02-01 01:03
> New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default': > Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed > http://hg.python.org/cpython/rev/0bcf23a52d55 Hum, the change also contains: +Fix warning message when `os.chdir()` fails inside + `test.support.temp_cwd()`. Patch by Chris Jerdonek. Is it related to this issue?
msg209856 - (view)	Author: Brian Quinlan (bquinlan) *	Date: 2014-02-01 01:06
Oops, no. That was junk due to my sloppiness. I’ll fix it in a minute. On Jan 31, 2014, at 5:03 PM, STINNER Victor <report@bugs.python.org> wrote: > > STINNER Victor added the comment: > >> New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default': >> Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed >> http://hg.python.org/cpython/rev/0bcf23a52d55 > > Hum, the change also contains: > > +Fix warning message when `os.chdir()` fails inside > + `test.support.temp_cwd()`. Patch by Chris Jerdonek. > > Is it related to this issue? > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue20319> > _______________________________________
msg209858 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-02-01 01:21
Shouldn't it be fixed in 3.3 too?
msg219674 - (view)	Author: Sebastian Kreft (Sebastian.Kreft.Deezer)	Date: 2014-06-03 09:35
I'm using the Python 3.4.1 compiled from source and I'm may be hitting this issue. My workload is launching two subprocess in parallel, and whenever one is ready, launches another one. In one of the runs, the whole process got stuck after launching about 3K subprocess, and the underlying processes had in fact finished. To wait for the finished subprocesses, I'm using FIRST_COMPLETED. Below is the core of my workload: for element in element_generator: while len(running) >= max_tasks: done, pending = concurrent.futures.wait(running, timeout=15.0, return_when=concurrent.futures.FIRST_COMPLETED) process_results(done) running = pending running.add(executor.submit(exe_subprocess, element)) I don't really know what's the best way to reproduce this, as I've run the same workload with different executables, more concurrency and faster response times, and I haven't seen the issue.
msg219676 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-06-03 10:12
> the whole process got stuck after launching about 3K subprocess How many processes are running at the same time when the whole process is stuck?
msg219689 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-06-03 14:06
> My workload is launching two subprocess in parallel, and whenever one is ready, launches another one. Since you have timeout=15.0, wait() should return at least every 15s. Can you determine if the wait is being repeatedly called in the while loop, and if so what Futures it is waiting on? In other words, is wait() being called continuously, or is wait() called and never returns?
msg219752 - (view)	Author: Sebastian Kreft (Sebastian.Kreft.Deezer)	Date: 2014-06-04 08:22
@haypo: I've reproduced the issue with both 2 and 3 processes in parallel. @glangford: the wait is actually returning after the 15 seconds, although nothing is reported as finished. So, it's getting stuck in the while loop. However, I imagine that without timeout, the call would block forever. What kind of debug information from the futures would be useful?
msg219769 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-06-04 18:47
> the wait is actually returning after the 15 seconds, although nothing is reported as finished...What kind of debug information from the futures would be useful? What is the state of the pending Futures that wait() is stuck on? (e.g. display f.running() and f.done() ). This could be logged any time the "done" set is empty after wait() returns. For each "stuck" Future, was it previously logged as completed by a prior call to wait()? What is the state of the ProcessPoolExecutor that the futures have been submitted to? Does it still function? (e.g. try submitting a trivial Future to the executor).
msg219786 - (view)	Author: Sebastian Kreft (Sebastian.Kreft.Deezer)	Date: 2014-06-05 08:24
The Executor is still working (but I'm using a ThreadPoolExcutor). I can dynamically change the number of max tasks allowed, which successfully fires the new tasks. After 2 days running, five tasks are in this weird state. I will change the code as suggested and post my results.
msg220150 - (view)	Author: Sebastian Kreft (Sebastian.Kreft.Deezer)	Date: 2014-06-10 08:28
I was able to recreate the issue again, and now i have some info about the offending futures: State: RUNNING, Result: None, Exception: None, Waiters: 0, Cancelled: False, Running: True, Done: False The information does not seem very relevant. However, I can attach a console and debug from there.
msg220704 - (view)	Author: Sebastian Kreft (Sebastian.Kreft.Deezer)	Date: 2014-06-16 09:24
Any ideas how to debug this further? In order to overcome this issue I have an awful workaround that tracks the maximum running time of a successful task, and if any task has been running more than x times that maximum I consider it defunct, and increase the number of concurrent allowed tasks. However, if the problem persist, I eventually will have lot of zombie tasks, which will expose some additional problems.
msg220712 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-06-16 12:00
> Any ideas how to debug this further? Wherever the cause of the problem might live, and to either work around it or gain additional information, here is one idea to consider. Do you need to submit your Futures just two at a time, and tightly loop every 15s? Why not submit a block of a larger number and wait for the block with as_completed(), logging for each completion. Then submit another block when they are all done. To control how many run at one time, create the Executor with max_workers=2 for example. (I had an app that ran > 1,000 futures in this way, which worked fine). In general I suggest to only timeout when there is really a problem, not as an expected event.
msg220715 - (view)	Author: Sebastian Kreft (Sebastian.Kreft.Deezer)	Date: 2014-06-16 12:12
I'm running actually millions of tasks, so sending them all at once will consume much more resources than needed. The issue happens no only with 2 tasks in parallel but with higher numbers as well. Also your proposed solution, has the problem that when you are waiting for the last tasks to finish you lose some parallelism. In any case, it seems to me that there's some kind of race condition preventing the task to finish, so if that's true the same could happen with as_completed. On Jun 16, 2014 2:00 PM, "Glenn Langford" <report@bugs.python.org> wrote: > > Glenn Langford added the comment: > > > Any ideas how to debug this further? > > Wherever the cause of the problem might live, and to either work around it > or gain additional information, here is one idea to consider. > > Do you need to submit your Futures just two at a time, and tightly loop > every 15s? Why not submit a block of a larger number and wait for the block > with as_completed(), logging for each completion. Then submit another block > when they are all done. To control how many run at one time, create the > Executor with max_workers=2 for example. (I had an app that ran > 1,000 > futures in this way, which worked fine). > > In general I suggest to only timeout when there is really a problem, not > as an expected event. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue20319> > _______________________________________ >
msg220716 - (view)	Author: Glenn Langford (glangford) *	Date: 2014-06-16 12:31
Under the hood, the behaviour of as_completed is quite different. So there is no guarantee it would behave the same. In any event, with millions of tasks you might consider Celery (I haven't used it myself): http://www.celeryproject.org
msg220934 - (view)	Author: Sebastian Kreft (Sebastian.Kreft.Deezer)	Date: 2014-06-18 12:31
@glangford: Is that really your recommendation, to switch to celery? Python 3.4.1 should be production quality and issues like this should be addressed. Note that I've successfully run millions of tasks using the same method, the only difference being that in that case the tasks weren't launching subprocesses. So I think that may be a starting point for debugging.
msg223219 - (view)	Author: Sebastian Kreft (Sebastian.Kreft.Deezer)	Date: 2014-07-16 14:57
Disregard the last messages, It seems to be a deadblocking due to subprocess.

History
Date	User	Action	Args
2022-04-11 14:57:57	admin	set	github: 64518
2014-07-16 14:57:48	Sebastian.Kreft.Deezer	set	nosy: tim.peters, bquinlan, mark.dickinson, pitrou, python-dev, sbt, Sebastian.Kreft.Deezer messages: + msg223219
2014-06-18 12:31:15	Sebastian.Kreft.Deezer	set	messages: + msg220934
2014-06-18 12:22:50	glangford	set	nosy: - glangford
2014-06-16 12:31:59	glangford	set	messages: + msg220716
2014-06-16 12:12:52	Sebastian.Kreft.Deezer	set	messages: + msg220715
2014-06-16 12:00:52	glangford	set	messages: + msg220712
2014-06-16 09:37:52	vstinner	set	nosy: - vstinner
2014-06-16 09:24:34	Sebastian.Kreft.Deezer	set	messages: + msg220704
2014-06-10 08:28:40	Sebastian.Kreft.Deezer	set	messages: + msg220150
2014-06-05 08:24:19	Sebastian.Kreft.Deezer	set	messages: + msg219786
2014-06-04 18:47:55	glangford	set	messages: + msg219769
2014-06-04 08:22:33	Sebastian.Kreft.Deezer	set	messages: + msg219752
2014-06-03 14:06:44	glangford	set	messages: + msg219689
2014-06-03 10:12:11	vstinner	set	messages: + msg219676
2014-06-03 09:35:41	Sebastian.Kreft.Deezer	set	nosy: + Sebastian.Kreft.Deezer messages: + msg219674
2014-02-01 01:21:03	pitrou	set	nosy: + pitrou messages: + msg209858
2014-02-01 01:06:27	bquinlan	set	messages: + msg209856
2014-02-01 01:03:46	vstinner	set	messages: + msg209855
2014-02-01 00:52:48	bquinlan	set	status: open -> closed messages: + msg209854
2014-02-01 00:51:39	python-dev	set	nosy: + python-dev messages: + msg209853
2014-01-30 22:26:00	glangford	set	messages: + msg209741
2014-01-29 14:41:00	bquinlan	set	assignee: bquinlan
2014-01-27 12:49:14	glangford	set	files: + issue20319-no-sort.patch messages: + msg209439
2014-01-26 13:07:44	glangford	set	files: + issue20319.patch messages: + msg209306
2014-01-26 12:22:29	glangford	set	messages: + msg209300
2014-01-26 03:35:39	glangford	set	messages: + msg209272
2014-01-26 03:18:54	bquinlan	set	messages: + msg209271
2014-01-23 18:54:15	bquinlan	set	messages: + msg208977
2014-01-23 13:31:37	vstinner	set	files: + futures.patch keywords: + patch messages: + msg208920
2014-01-23 13:20:39	glangford	set	messages: + msg208918
2014-01-23 13:13:15	vstinner	set	nosy: + vstinner messages: + msg208917
2014-01-23 13:12:50	vstinner	set	nosy: + sbt
2014-01-23 12:31:42	mark.dickinson	set	nosy: + tim.peters
2014-01-23 12:31:25	mark.dickinson	set	messages: + msg208915
2014-01-23 12:29:56	mark.dickinson	set	nosy: + mark.dickinson
2014-01-22 18:54:39	glangford	set	messages: + msg208840
2014-01-21 21:24:11	glangford	set	nosy: + bquinlan
2014-01-21 00:18:39	glangford	create