classification
Title: concurrent.futures.wait() can block forever even if Futures have completed
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.3, Python 3.4
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: bquinlan Nosy List: Sebastian.Kreft.Deezer, bquinlan, mark.dickinson, pitrou, python-dev, sbt, tim.peters
Priority: normal Keywords: patch

Created on 2014-01-21 00:18 by glangford, last changed 2014-07-16 14:57 by Sebastian.Kreft.Deezer. This issue is now closed.

Files
File name Uploaded Description Edit
stress_wait.py glangford, 2014-01-21 00:18 Stress test function which demonstrates wait blocking under 3.3.3 (OS X 10.9.1)
futures.patch vstinner, 2014-01-23 13:31 review
issue20319.patch glangford, 2014-01-26 13:07 review
issue20319-no-sort.patch glangford, 2014-01-27 12:49 review
Messages (31)
msg208592 - (view) Author: Glenn Langford (glangford) * Date: 2014-01-21 00:18
concurrent.futures.wait() can get into a state where it blocks forever on waiter.event.wait(), even when the underlying Futures have completed.

This is demonstrated in a stress test where a large number of wait() calls are run in multiple threads, contending for the same Futures.

The cause is believed to be waiter removal, which is done without locking the Future. 

A suggested fix which appears to work is to change the following code in wait():

for f in fs:
    f._waiters.remove(waiter)

to:

for f in fs:
    with f._condition:
        f._waiters.remove(waiter)
msg208840 - (view) Author: Glenn Langford (glangford) * Date: 2014-01-22 18:54
The same bug also exists in concurrent.futures.as_completed(). The minimal fix suggested here also works, but the bigger fix suggested in issue #20297 is recommended for as_completed().
msg208915 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2014-01-23 12:31
Adding Tim Peters to the nosy, since I suspect he has a general interest in this kind of issue.  As far as I know Brian Quinlan isn't actively maintaining concurrent.futures at the moment (Brian: please correct me if I'm wrong).
msg208917 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-23 13:13
@Glenn: Can you maybe suggest a patch fixing the issue?
msg208918 - (view) Author: Glenn Langford (glangford) * Date: 2014-01-23 13:20
@Victor: Would like to give a patch but I am not a core developer, and I don't version control set up yet. The proposed fix is based on reading the distribution source code.
msg208920 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-23 13:31
futures.patch: reuse _AcquireFutures context manager to protect the list of futures against concurrent access.

Without the patch, stress_wait.py hangs. With the patch, it works correctly.
msg208977 - (view) Author: Brian Quinlan (bquinlan) * (Python committer) Date: 2014-01-23 18:54
I'm not currently working on concurrent.futures but I do look at patches and bug reports. I'll take a look at this and Issue20297 sometime this week.
msg209271 - (view) Author: Brian Quinlan (bquinlan) * (Python committer) Date: 2014-01-26 03:18
I'm looking at "futures.patch".

I don't understand why these blocks are helpful -"_create_and_install_waiters" has two call sites and both ("as_completed" and "wait") call "_create_and_install_waiters" from within an _AcquireFutures context manager:

-        pending_count = sum(
-                f._state not in [CANCELLED_AND_NOTIFIED, FINISHED] for f in fs)
+        with _AcquireFutures(fs):
+            pending_count = sum(
+                    f._state not in [CANCELLED_AND_NOTIFIED, FINISHED]
+                    for f in fs)

-    for f in fs:
-        f._waiters.append(waiter)
+    with _AcquireFutures(fs):
+        for f in fs:
+            f._waiters.append(waiter)

It seems more plausible that the locks around the removals are fixing the bug but I don't see how. I'll look into it some more.
msg209272 - (view) Author: Glenn Langford (glangford) * Date: 2014-01-26 03:35
> It seems more plausible that the locks around the removals are fixing the bug but I don't see how. I'll look into it some more.

It is the locks around the waiter removals that matter; I think there are only formatting changes elsewhere in the patch. The reason the locks make a difference is that there can be a race condition if multiple wait() calls compete to modify the f._waiters list.
msg209300 - (view) Author: Glenn Langford (glangford) * Date: 2014-01-26 12:22
@Brian - Ah, I see now what you are referring to. The patch has changes to _create_and_install_waiters() which should not be there. The only code that needs to change is waiter removal as I originally suggested. I am set up with a dev environment now and will submit a revised patch.
msg209306 - (view) Author: Glenn Langford (glangford) * Date: 2014-01-26 13:07
This patch shows the minimal desired outcome. It is not elegant in its current form, but does only what is necessary. Ultimately I think as_completed() should go its own way and not lock all Futures at once (#20297).
msg209439 - (view) Author: Glenn Langford (glangford) * Date: 2014-01-27 12:49
Revised patch; I don't think there is a need to sort the keys when waiters are being removed since only one lock is acquired at a time. Stress tests on both wait() and as_completed() work with this approach.
msg209741 - (view) Author: Glenn Langford (glangford) * Date: 2014-01-30 22:26
An idea for a different possible fix - rather than cleaning up waiters in wait() and as_completed(), could they be removed in Future.set_result() and Futures.set_exception() ? 

I'm not certain if any waiter should ever be notified twice; if not, perhaps set_result() and set_exception() could just include

self._waiters = []

after all waiters have been signalled.
msg209853 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-02-01 00:51
New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default':
Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed
http://hg.python.org/cpython/rev/0bcf23a52d55
msg209854 - (view) Author: Brian Quinlan (bquinlan) * (Python committer) Date: 2014-02-01 00:52
Thanks very much for the patch Glenn!
msg209855 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-01 01:03
> New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default':
> Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed
> http://hg.python.org/cpython/rev/0bcf23a52d55

Hum, the change also contains:

+Fix warning message when `os.chdir()` fails inside
+  `test.support.temp_cwd()`.  Patch by Chris Jerdonek.

Is it related to this issue?
msg209856 - (view) Author: Brian Quinlan (bquinlan) * (Python committer) Date: 2014-02-01 01:06
Oops, no. That was junk due to my sloppiness. I’ll fix it in a minute.

On Jan 31, 2014, at 5:03 PM, STINNER Victor <report@bugs.python.org> wrote:

> 
> STINNER Victor added the comment:
> 
>> New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default':
>> Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed
>> http://hg.python.org/cpython/rev/0bcf23a52d55
> 
> Hum, the change also contains:
> 
> +Fix warning message when `os.chdir()` fails inside
> +  `test.support.temp_cwd()`.  Patch by Chris Jerdonek.
> 
> Is it related to this issue?
> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue20319>
> _______________________________________
msg209858 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-02-01 01:21
Shouldn't it be fixed in 3.3 too?
msg219674 - (view) Author: Sebastian Kreft (Sebastian.Kreft.Deezer) Date: 2014-06-03 09:35
I'm using the Python 3.4.1 compiled from source and I'm may be hitting this issue.

My workload is launching two subprocess in parallel, and whenever one is ready, launches another one. In one of the runs, the whole process got stuck after launching about 3K subprocess, and the underlying processes had in fact finished.

To wait for the finished subprocesses, I'm using FIRST_COMPLETED. Below is the core of my workload:

for element in element_generator:
    while len(running) >= max_tasks:
        done, pending = concurrent.futures.wait(running, timeout=15.0, return_when=concurrent.futures.FIRST_COMPLETED)
        process_results(done)
        running = pending

    running.add(executor.submit(exe_subprocess, element)) 
 

I don't really know what's the best way to reproduce this, as I've run the same workload with different executables, more concurrency and faster response times, and I haven't seen the issue.
msg219676 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-06-03 10:12
> the whole process got stuck after launching about 3K subprocess

How many processes are running at the same time when the whole process is stuck?
msg219689 - (view) Author: Glenn Langford (glangford) * Date: 2014-06-03 14:06
> My workload is launching two subprocess in parallel, and whenever one is ready, launches another one.

Since you have timeout=15.0, wait() should return at least every 15s. Can you determine if the wait is being repeatedly called in the while loop, and if so what Futures it is waiting on? In other words, is wait() being called continuously, or is wait() called and never returns?
msg219752 - (view) Author: Sebastian Kreft (Sebastian.Kreft.Deezer) Date: 2014-06-04 08:22
@haypo: I've reproduced the issue with both 2 and 3 processes in parallel.

@glangford: the wait is actually returning after the 15 seconds, although nothing is reported as finished. So, it's getting stuck in the while loop. However, I imagine that without timeout, the call would block forever.

What kind of debug information from the futures would be useful?
msg219769 - (view) Author: Glenn Langford (glangford) * Date: 2014-06-04 18:47
> the wait is actually returning after the 15 seconds, although nothing is reported as finished...What kind of debug information from the futures would be useful?

What is the state of the pending Futures that wait() is stuck on? (e.g. display f.running() and f.done() ). This could be logged any time the "done" set is empty after wait() returns. For each "stuck" Future, was it previously logged as completed by a prior call to wait()?

What is the state of the ProcessPoolExecutor that the futures have been submitted to? Does it still function? (e.g. try submitting a trivial Future to the executor).
msg219786 - (view) Author: Sebastian Kreft (Sebastian.Kreft.Deezer) Date: 2014-06-05 08:24
The Executor is still working (but I'm using a ThreadPoolExcutor). I can dynamically change the number of max tasks allowed, which successfully fires the new tasks.

After 2 days running, five tasks are in this weird state.

I will change the code as suggested and post my results.
msg220150 - (view) Author: Sebastian Kreft (Sebastian.Kreft.Deezer) Date: 2014-06-10 08:28
I was able to recreate the issue again, and now i have some info about the offending futures:

State: RUNNING, Result: None, Exception: None, Waiters: 0, Cancelled: False, Running: True, Done: False

The information does not seem very relevant. However, I can attach a console and debug from there.
msg220704 - (view) Author: Sebastian Kreft (Sebastian.Kreft.Deezer) Date: 2014-06-16 09:24
Any ideas how to debug this further?

In order to overcome this issue I have an awful workaround that tracks the maximum running time of a successful task, and if any task has been running more than x times that maximum I consider it defunct, and increase the number of concurrent allowed tasks. However, if the problem persist, I eventually will have lot of zombie tasks, which will expose some additional problems.
msg220712 - (view) Author: Glenn Langford (glangford) * Date: 2014-06-16 12:00
> Any ideas how to debug this further?

Wherever the cause of the problem might live, and to either work around it or gain additional information, here is one idea to consider.

Do you need to submit your Futures just two at a time, and tightly loop every 15s? Why not submit a block of a larger number and wait for the block with as_completed(), logging for each completion. Then submit another block when they are all done. To control how many run at one time, create the Executor with max_workers=2 for example. (I had an app that ran > 1,000 futures in this way, which worked fine). 

In general I suggest to only timeout when there is really a problem, not as an expected event.
msg220715 - (view) Author: Sebastian Kreft (Sebastian.Kreft.Deezer) Date: 2014-06-16 12:12
I'm running actually millions of tasks, so sending them all at once will
consume much more resources than needed.

The issue happens no only with 2 tasks in parallel but with higher numbers
as well.

Also your proposed solution, has the problem that when you are waiting for
the last tasks to finish you lose some parallelism.

In any case, it seems to me that there's some kind of race condition
preventing the task to finish, so if that's true the same could happen with
as_completed.
On Jun 16, 2014 2:00 PM, "Glenn Langford" <report@bugs.python.org> wrote:

>
> Glenn Langford added the comment:
>
> > Any ideas how to debug this further?
>
> Wherever the cause of the problem might live, and to either work around it
> or gain additional information, here is one idea to consider.
>
> Do you need to submit your Futures just two at a time, and tightly loop
> every 15s? Why not submit a block of a larger number and wait for the block
> with as_completed(), logging for each completion. Then submit another block
> when they are all done. To control how many run at one time, create the
> Executor with max_workers=2 for example. (I had an app that ran > 1,000
> futures in this way, which worked fine).
>
> In general I suggest to only timeout when there is really a problem, not
> as an expected event.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue20319>
> _______________________________________
>
msg220716 - (view) Author: Glenn Langford (glangford) * Date: 2014-06-16 12:31
Under the hood, the behaviour of as_completed is quite different. So there is no guarantee it would behave the same.

In any event, with millions of tasks you might consider Celery (I haven't used it myself):
http://www.celeryproject.org
msg220934 - (view) Author: Sebastian Kreft (Sebastian.Kreft.Deezer) Date: 2014-06-18 12:31
@glangford: Is that really your recommendation, to switch to celery? Python 3.4.1 should be production quality and issues like this should be addressed.

Note that I've successfully run millions of tasks using the same method, the only difference being that in that case the tasks weren't launching subprocesses. So I think that may be a starting point for debugging.
msg223219 - (view) Author: Sebastian Kreft (Sebastian.Kreft.Deezer) Date: 2014-07-16 14:57
Disregard the last messages, It seems to be a deadblocking due to subprocess.
History
Date User Action Args
2014-07-16 14:57:48Sebastian.Kreft.Deezersetnosy: tim.peters, bquinlan, mark.dickinson, pitrou, python-dev, sbt, Sebastian.Kreft.Deezer
messages: + msg223219
2014-06-18 12:31:15Sebastian.Kreft.Deezersetmessages: + msg220934
2014-06-18 12:22:50glangfordsetnosy: - glangford
2014-06-16 12:31:59glangfordsetmessages: + msg220716
2014-06-16 12:12:52Sebastian.Kreft.Deezersetmessages: + msg220715
2014-06-16 12:00:52glangfordsetmessages: + msg220712
2014-06-16 09:37:52vstinnersetnosy: - vstinner
2014-06-16 09:24:34Sebastian.Kreft.Deezersetmessages: + msg220704
2014-06-10 08:28:40Sebastian.Kreft.Deezersetmessages: + msg220150
2014-06-05 08:24:19Sebastian.Kreft.Deezersetmessages: + msg219786
2014-06-04 18:47:55glangfordsetmessages: + msg219769
2014-06-04 08:22:33Sebastian.Kreft.Deezersetmessages: + msg219752
2014-06-03 14:06:44glangfordsetmessages: + msg219689
2014-06-03 10:12:11vstinnersetmessages: + msg219676
2014-06-03 09:35:41Sebastian.Kreft.Deezersetnosy: + Sebastian.Kreft.Deezer
messages: + msg219674
2014-02-01 01:21:03pitrousetnosy: + pitrou
messages: + msg209858
2014-02-01 01:06:27bquinlansetmessages: + msg209856
2014-02-01 01:03:46vstinnersetmessages: + msg209855
2014-02-01 00:52:48bquinlansetstatus: open -> closed

messages: + msg209854
2014-02-01 00:51:39python-devsetnosy: + python-dev
messages: + msg209853
2014-01-30 22:26:00glangfordsetmessages: + msg209741
2014-01-29 14:41:00bquinlansetassignee: bquinlan
2014-01-27 12:49:14glangfordsetfiles: + issue20319-no-sort.patch

messages: + msg209439
2014-01-26 13:07:44glangfordsetfiles: + issue20319.patch

messages: + msg209306
2014-01-26 12:22:29glangfordsetmessages: + msg209300
2014-01-26 03:35:39glangfordsetmessages: + msg209272
2014-01-26 03:18:54bquinlansetmessages: + msg209271
2014-01-23 18:54:15bquinlansetmessages: + msg208977
2014-01-23 13:31:37vstinnersetfiles: + futures.patch
keywords: + patch
messages: + msg208920
2014-01-23 13:20:39glangfordsetmessages: + msg208918
2014-01-23 13:13:15vstinnersetnosy: + vstinner
messages: + msg208917
2014-01-23 13:12:50vstinnersetnosy: + sbt
2014-01-23 12:31:42mark.dickinsonsetnosy: + tim.peters
2014-01-23 12:31:25mark.dickinsonsetmessages: + msg208915
2014-01-23 12:29:56mark.dickinsonsetnosy: + mark.dickinson
2014-01-22 18:54:39glangfordsetmessages: + msg208840
2014-01-21 21:24:11glangfordsetnosy: + bquinlan
2014-01-21 00:18:39glangfordcreate