msg208592 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-01-21 00:18 |
concurrent.futures.wait() can get into a state where it blocks forever on waiter.event.wait(), even when the underlying Futures have completed.
This is demonstrated in a stress test where a large number of wait() calls are run in multiple threads, contending for the same Futures.
The cause is believed to be waiter removal, which is done without locking the Future.
A suggested fix which appears to work is to change the following code in wait():
for f in fs:
f._waiters.remove(waiter)
to:
for f in fs:
with f._condition:
f._waiters.remove(waiter)
|
msg208840 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-01-22 18:54 |
The same bug also exists in concurrent.futures.as_completed(). The minimal fix suggested here also works, but the bigger fix suggested in issue #20297 is recommended for as_completed().
|
msg208915 - (view) |
Author: Mark Dickinson (mark.dickinson) * |
Date: 2014-01-23 12:31 |
Adding Tim Peters to the nosy, since I suspect he has a general interest in this kind of issue. As far as I know Brian Quinlan isn't actively maintaining concurrent.futures at the moment (Brian: please correct me if I'm wrong).
|
msg208917 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-01-23 13:13 |
@Glenn: Can you maybe suggest a patch fixing the issue?
|
msg208918 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-01-23 13:20 |
@Victor: Would like to give a patch but I am not a core developer, and I don't version control set up yet. The proposed fix is based on reading the distribution source code.
|
msg208920 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-01-23 13:31 |
futures.patch: reuse _AcquireFutures context manager to protect the list of futures against concurrent access.
Without the patch, stress_wait.py hangs. With the patch, it works correctly.
|
msg208977 - (view) |
Author: Brian Quinlan (bquinlan) * |
Date: 2014-01-23 18:54 |
I'm not currently working on concurrent.futures but I do look at patches and bug reports. I'll take a look at this and Issue20297 sometime this week.
|
msg209271 - (view) |
Author: Brian Quinlan (bquinlan) * |
Date: 2014-01-26 03:18 |
I'm looking at "futures.patch".
I don't understand why these blocks are helpful -"_create_and_install_waiters" has two call sites and both ("as_completed" and "wait") call "_create_and_install_waiters" from within an _AcquireFutures context manager:
- pending_count = sum(
- f._state not in [CANCELLED_AND_NOTIFIED, FINISHED] for f in fs)
+ with _AcquireFutures(fs):
+ pending_count = sum(
+ f._state not in [CANCELLED_AND_NOTIFIED, FINISHED]
+ for f in fs)
- for f in fs:
- f._waiters.append(waiter)
+ with _AcquireFutures(fs):
+ for f in fs:
+ f._waiters.append(waiter)
It seems more plausible that the locks around the removals are fixing the bug but I don't see how. I'll look into it some more.
|
msg209272 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-01-26 03:35 |
> It seems more plausible that the locks around the removals are fixing the bug but I don't see how. I'll look into it some more.
It is the locks around the waiter removals that matter; I think there are only formatting changes elsewhere in the patch. The reason the locks make a difference is that there can be a race condition if multiple wait() calls compete to modify the f._waiters list.
|
msg209300 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-01-26 12:22 |
@Brian - Ah, I see now what you are referring to. The patch has changes to _create_and_install_waiters() which should not be there. The only code that needs to change is waiter removal as I originally suggested. I am set up with a dev environment now and will submit a revised patch.
|
msg209306 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-01-26 13:07 |
This patch shows the minimal desired outcome. It is not elegant in its current form, but does only what is necessary. Ultimately I think as_completed() should go its own way and not lock all Futures at once (#20297).
|
msg209439 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-01-27 12:49 |
Revised patch; I don't think there is a need to sort the keys when waiters are being removed since only one lock is acquired at a time. Stress tests on both wait() and as_completed() work with this approach.
|
msg209741 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-01-30 22:26 |
An idea for a different possible fix - rather than cleaning up waiters in wait() and as_completed(), could they be removed in Future.set_result() and Futures.set_exception() ?
I'm not certain if any waiter should ever be notified twice; if not, perhaps set_result() and set_exception() could just include
self._waiters = []
after all waiters have been signalled.
|
msg209853 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2014-02-01 00:51 |
New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default':
Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed
http://hg.python.org/cpython/rev/0bcf23a52d55
|
msg209854 - (view) |
Author: Brian Quinlan (bquinlan) * |
Date: 2014-02-01 00:52 |
Thanks very much for the patch Glenn!
|
msg209855 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-02-01 01:03 |
> New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default':
> Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed
> http://hg.python.org/cpython/rev/0bcf23a52d55
Hum, the change also contains:
+Fix warning message when `os.chdir()` fails inside
+ `test.support.temp_cwd()`. Patch by Chris Jerdonek.
Is it related to this issue?
|
msg209856 - (view) |
Author: Brian Quinlan (bquinlan) * |
Date: 2014-02-01 01:06 |
Oops, no. That was junk due to my sloppiness. I’ll fix it in a minute.
On Jan 31, 2014, at 5:03 PM, STINNER Victor <report@bugs.python.org> wrote:
>
> STINNER Victor added the comment:
>
>> New changeset 0bcf23a52d55 by Brian Quinlan in branch 'default':
>> Issue #20319: concurrent.futures.wait() can block forever even if Futures have completed
>> http://hg.python.org/cpython/rev/0bcf23a52d55
>
> Hum, the change also contains:
>
> +Fix warning message when `os.chdir()` fails inside
> + `test.support.temp_cwd()`. Patch by Chris Jerdonek.
>
> Is it related to this issue?
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue20319>
> _______________________________________
|
msg209858 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2014-02-01 01:21 |
Shouldn't it be fixed in 3.3 too?
|
msg219674 - (view) |
Author: Sebastian Kreft (Sebastian.Kreft.Deezer) |
Date: 2014-06-03 09:35 |
I'm using the Python 3.4.1 compiled from source and I'm may be hitting this issue.
My workload is launching two subprocess in parallel, and whenever one is ready, launches another one. In one of the runs, the whole process got stuck after launching about 3K subprocess, and the underlying processes had in fact finished.
To wait for the finished subprocesses, I'm using FIRST_COMPLETED. Below is the core of my workload:
for element in element_generator:
while len(running) >= max_tasks:
done, pending = concurrent.futures.wait(running, timeout=15.0, return_when=concurrent.futures.FIRST_COMPLETED)
process_results(done)
running = pending
running.add(executor.submit(exe_subprocess, element))
I don't really know what's the best way to reproduce this, as I've run the same workload with different executables, more concurrency and faster response times, and I haven't seen the issue.
|
msg219676 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2014-06-03 10:12 |
> the whole process got stuck after launching about 3K subprocess
How many processes are running at the same time when the whole process is stuck?
|
msg219689 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-06-03 14:06 |
> My workload is launching two subprocess in parallel, and whenever one is ready, launches another one.
Since you have timeout=15.0, wait() should return at least every 15s. Can you determine if the wait is being repeatedly called in the while loop, and if so what Futures it is waiting on? In other words, is wait() being called continuously, or is wait() called and never returns?
|
msg219752 - (view) |
Author: Sebastian Kreft (Sebastian.Kreft.Deezer) |
Date: 2014-06-04 08:22 |
@haypo: I've reproduced the issue with both 2 and 3 processes in parallel.
@glangford: the wait is actually returning after the 15 seconds, although nothing is reported as finished. So, it's getting stuck in the while loop. However, I imagine that without timeout, the call would block forever.
What kind of debug information from the futures would be useful?
|
msg219769 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-06-04 18:47 |
> the wait is actually returning after the 15 seconds, although nothing is reported as finished...What kind of debug information from the futures would be useful?
What is the state of the pending Futures that wait() is stuck on? (e.g. display f.running() and f.done() ). This could be logged any time the "done" set is empty after wait() returns. For each "stuck" Future, was it previously logged as completed by a prior call to wait()?
What is the state of the ProcessPoolExecutor that the futures have been submitted to? Does it still function? (e.g. try submitting a trivial Future to the executor).
|
msg219786 - (view) |
Author: Sebastian Kreft (Sebastian.Kreft.Deezer) |
Date: 2014-06-05 08:24 |
The Executor is still working (but I'm using a ThreadPoolExcutor). I can dynamically change the number of max tasks allowed, which successfully fires the new tasks.
After 2 days running, five tasks are in this weird state.
I will change the code as suggested and post my results.
|
msg220150 - (view) |
Author: Sebastian Kreft (Sebastian.Kreft.Deezer) |
Date: 2014-06-10 08:28 |
I was able to recreate the issue again, and now i have some info about the offending futures:
State: RUNNING, Result: None, Exception: None, Waiters: 0, Cancelled: False, Running: True, Done: False
The information does not seem very relevant. However, I can attach a console and debug from there.
|
msg220704 - (view) |
Author: Sebastian Kreft (Sebastian.Kreft.Deezer) |
Date: 2014-06-16 09:24 |
Any ideas how to debug this further?
In order to overcome this issue I have an awful workaround that tracks the maximum running time of a successful task, and if any task has been running more than x times that maximum I consider it defunct, and increase the number of concurrent allowed tasks. However, if the problem persist, I eventually will have lot of zombie tasks, which will expose some additional problems.
|
msg220712 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-06-16 12:00 |
> Any ideas how to debug this further?
Wherever the cause of the problem might live, and to either work around it or gain additional information, here is one idea to consider.
Do you need to submit your Futures just two at a time, and tightly loop every 15s? Why not submit a block of a larger number and wait for the block with as_completed(), logging for each completion. Then submit another block when they are all done. To control how many run at one time, create the Executor with max_workers=2 for example. (I had an app that ran > 1,000 futures in this way, which worked fine).
In general I suggest to only timeout when there is really a problem, not as an expected event.
|
msg220715 - (view) |
Author: Sebastian Kreft (Sebastian.Kreft.Deezer) |
Date: 2014-06-16 12:12 |
I'm running actually millions of tasks, so sending them all at once will
consume much more resources than needed.
The issue happens no only with 2 tasks in parallel but with higher numbers
as well.
Also your proposed solution, has the problem that when you are waiting for
the last tasks to finish you lose some parallelism.
In any case, it seems to me that there's some kind of race condition
preventing the task to finish, so if that's true the same could happen with
as_completed.
On Jun 16, 2014 2:00 PM, "Glenn Langford" <report@bugs.python.org> wrote:
>
> Glenn Langford added the comment:
>
> > Any ideas how to debug this further?
>
> Wherever the cause of the problem might live, and to either work around it
> or gain additional information, here is one idea to consider.
>
> Do you need to submit your Futures just two at a time, and tightly loop
> every 15s? Why not submit a block of a larger number and wait for the block
> with as_completed(), logging for each completion. Then submit another block
> when they are all done. To control how many run at one time, create the
> Executor with max_workers=2 for example. (I had an app that ran > 1,000
> futures in this way, which worked fine).
>
> In general I suggest to only timeout when there is really a problem, not
> as an expected event.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue20319>
> _______________________________________
>
|
msg220716 - (view) |
Author: Glenn Langford (glangford) * |
Date: 2014-06-16 12:31 |
Under the hood, the behaviour of as_completed is quite different. So there is no guarantee it would behave the same.
In any event, with millions of tasks you might consider Celery (I haven't used it myself):
http://www.celeryproject.org
|
msg220934 - (view) |
Author: Sebastian Kreft (Sebastian.Kreft.Deezer) |
Date: 2014-06-18 12:31 |
@glangford: Is that really your recommendation, to switch to celery? Python 3.4.1 should be production quality and issues like this should be addressed.
Note that I've successfully run millions of tasks using the same method, the only difference being that in that case the tasks weren't launching subprocesses. So I think that may be a starting point for debugging.
|
msg223219 - (view) |
Author: Sebastian Kreft (Sebastian.Kreft.Deezer) |
Date: 2014-07-16 14:57 |
Disregard the last messages, It seems to be a deadblocking due to subprocess.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:57 | admin | set | github: 64518 |
2014-07-16 14:57:48 | Sebastian.Kreft.Deezer | set | nosy:
tim.peters, bquinlan, mark.dickinson, pitrou, python-dev, sbt, Sebastian.Kreft.Deezer messages:
+ msg223219 |
2014-06-18 12:31:15 | Sebastian.Kreft.Deezer | set | messages:
+ msg220934 |
2014-06-18 12:22:50 | glangford | set | nosy:
- glangford
|
2014-06-16 12:31:59 | glangford | set | messages:
+ msg220716 |
2014-06-16 12:12:52 | Sebastian.Kreft.Deezer | set | messages:
+ msg220715 |
2014-06-16 12:00:52 | glangford | set | messages:
+ msg220712 |
2014-06-16 09:37:52 | vstinner | set | nosy:
- vstinner
|
2014-06-16 09:24:34 | Sebastian.Kreft.Deezer | set | messages:
+ msg220704 |
2014-06-10 08:28:40 | Sebastian.Kreft.Deezer | set | messages:
+ msg220150 |
2014-06-05 08:24:19 | Sebastian.Kreft.Deezer | set | messages:
+ msg219786 |
2014-06-04 18:47:55 | glangford | set | messages:
+ msg219769 |
2014-06-04 08:22:33 | Sebastian.Kreft.Deezer | set | messages:
+ msg219752 |
2014-06-03 14:06:44 | glangford | set | messages:
+ msg219689 |
2014-06-03 10:12:11 | vstinner | set | messages:
+ msg219676 |
2014-06-03 09:35:41 | Sebastian.Kreft.Deezer | set | nosy:
+ Sebastian.Kreft.Deezer messages:
+ msg219674
|
2014-02-01 01:21:03 | pitrou | set | nosy:
+ pitrou messages:
+ msg209858
|
2014-02-01 01:06:27 | bquinlan | set | messages:
+ msg209856 |
2014-02-01 01:03:46 | vstinner | set | messages:
+ msg209855 |
2014-02-01 00:52:48 | bquinlan | set | status: open -> closed
messages:
+ msg209854 |
2014-02-01 00:51:39 | python-dev | set | nosy:
+ python-dev messages:
+ msg209853
|
2014-01-30 22:26:00 | glangford | set | messages:
+ msg209741 |
2014-01-29 14:41:00 | bquinlan | set | assignee: bquinlan |
2014-01-27 12:49:14 | glangford | set | files:
+ issue20319-no-sort.patch
messages:
+ msg209439 |
2014-01-26 13:07:44 | glangford | set | files:
+ issue20319.patch
messages:
+ msg209306 |
2014-01-26 12:22:29 | glangford | set | messages:
+ msg209300 |
2014-01-26 03:35:39 | glangford | set | messages:
+ msg209272 |
2014-01-26 03:18:54 | bquinlan | set | messages:
+ msg209271 |
2014-01-23 18:54:15 | bquinlan | set | messages:
+ msg208977 |
2014-01-23 13:31:37 | vstinner | set | files:
+ futures.patch keywords:
+ patch messages:
+ msg208920
|
2014-01-23 13:20:39 | glangford | set | messages:
+ msg208918 |
2014-01-23 13:13:15 | vstinner | set | nosy:
+ vstinner messages:
+ msg208917
|
2014-01-23 13:12:50 | vstinner | set | nosy:
+ sbt
|
2014-01-23 12:31:42 | mark.dickinson | set | nosy:
+ tim.peters
|
2014-01-23 12:31:25 | mark.dickinson | set | messages:
+ msg208915 |
2014-01-23 12:29:56 | mark.dickinson | set | nosy:
+ mark.dickinson
|
2014-01-22 18:54:39 | glangford | set | messages:
+ msg208840 |
2014-01-21 21:24:11 | glangford | set | nosy:
+ bquinlan
|
2014-01-21 00:18:39 | glangford | create | |