This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: More informative Queue class: new method that returns number of unfinished tasks
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.7
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: rhettinger, serhiy.storchaka, slytomcat
Priority: normal Keywords:

Created on 2017-02-20 09:12 by slytomcat, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 192 closed slytomcat, 2017-02-20 10:00
Messages (11)
msg288189 - (view) Author: (slytomcat) * Date: 2017-02-20 09:12
Class queue.Queue control the number of unfinished tasks via method task_done(). But it is only possible to get the information about all task done (via join() method). 
I'm sure that exposing the number of unfinished tasks (unfinished_tasks class variable) can be very useful in many situations when you need more control over the process.

But it is not good idea to provide write access to this internal variable (as it controls internal queue class status). Better way - provide RO access via class method like qsize or empty. It can be look like this:

    def unfinished(self):
        return self.unfinished_tasks


One example of this method usage: there is not optimal function _adjust_thread_count in concurrent.futures.ThreadPoolExecutor with following comment: 
    # TODO(bquinlan): Should avoid creating new threads if there are more
    # idle threads than items in the work queue.

It can be easily done with following condition:

    if self._work_queue.unfinished() <= len(self._threads):
        return
msg288261 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-02-21 03:56
> I'm sure that exposing the number of unfinished tasks
> (unfinished_tasks class variable) can be very useful in many situations

Sorry, but I don't share your certainty of the usefulness of this method.  Since the task_done/join API was added many years ago, I've not seen any use cases arise where we needed to quantify the number of unfinished tasks.  Have you seen an actual need in real code or is this PR more of an intuitive guess that the extra method might be useful?

In the examples of task_done/join that I've seen, the number of unfinished tasks is typically in the range: qsize() plus between zero and the number of parallel consumers.  Where the actual value it falls this range doesn't seem very useful.  One the queue is empty, join() just waits to the parallel consumers to complete their one final task that is already under way.

Also, I'm reluctant to expand the API for several reasons.  Keeping it small makes it more intelligible (giving users additional options for things theh rarely need makes it more difficult to make correct decisions in the common cases). Also, it would be nice to avoid a ripple effect into the other APIs such as multiprocessing which are supersets of this API. And lastly, I'm disinclined to have another informational method like empty, full, and qsize which have to be documented as potentially introducing unexpected race conditions in user code (all three methods return information that may be completely out-of-date or wrong by the time the caller sees the result).
msg288269 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-21 05:41
I concur with Raymond.
msg288277 - (view) Author: (slytomcat) * Date: 2017-02-21 08:25
Raymond, Serhiy, thanks for your opinions.

I agree that this method like empty, full, and qsize returns information that may be out-of-date in time of its usage.

But like those empty, full, and qsize it provides information that helps to make some decisions.

What I disagree that this information can be correctly estimated based on qsize and number of parallel consumers as there is no information what exactly consumers do at the moment. Such estimation will be absolutely useless for a decision making.

In the example (see https://github.com/slytomcat/cpython/commit/ea313986d86c083e81624fbc8a7ab6a75784e15d) we need to estimate the number of unfinished tasks and comparing it to the number of active threads we can decide: should we create a new thread or not. There is no reason to wait until all tasks done (via join) we need to make decision in working conditions.

Actually I've implemented this solution (Queue.unfinished + modified concurrent.futures.ThreadPoolExecutor) in my project (see https://github.com/slytomcat/yandex-disk-client) and it works perfectly: the number of working threads is increasing only on massive flow of tasks, while weak flow of tasks keeps low number of working threads.
And even when Queue.unfinished gives wrong (outdated) information, it can lead to creation one new thread that is not really required at the moment. That is not a big mistake, I suppose (comparing to creation all N threads during placing first N tasks in queue without considering the volume of concurrent tasks).
msg288294 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-02-21 13:09
> Should avoid creating new threads if there are more
# idle threads than items in the work queue.

Shouldn't this just check to see if qsize() is greater than zero?  That would mean that there are no idle threads.  I'm not seeing why there would be any idle threads if there is work in the queue (the queue unblocks idle threads).

I don't think "unfinished() - num_threads" makes sense.  The meaning of "unfinished() - qsize()" is the number of non-idle worker threads (work is taken out of the queue but has not yet reported that the task is done).
msg288305 - (view) Author: (slytomcat) * Date: 2017-02-21 15:38
Not exactly.... there are 3 cases:

If qsize <> 0 it means there is no idle consumers threads, all of them must be busy: we need to create one more. No doubt.

If qsize = 0 it means one of two cases: 
- all consumers threads are busy: we need to create one more
- some or all consumers threads are idle: no need create new one.

But there is no way to distinguish two last cases.

That's why I consider that (unfinished() <= num_threads) gives clear key for decision.
msg288306 - (view) Author: (slytomcat) * Date: 2017-02-21 15:55
One more problem that adjusting of number of threads is performed exactly after placing the new task in the queue. In in some cases we can find that qsuze <> 0 but it doesn't mean that there is no idle threads. It can be equal to 1 (just queued task) as no threads manage to get it yet (don't forgot about GIL).

So even qsuze <> 0 dosn't mean that we need to create a new thread.

But (unfinished() <= num_threads) works perfect in this case also.
msg288307 - (view) Author: (slytomcat) * Date: 2017-02-21 15:59
num_threads - unfinished() = the estimation for number of idle threads.
msg288357 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-02-22 14:01
> And even when Queue.unfinished gives wrong (outdated) information, 
> it can lead to creation one new thread that is not really required 
> at the moment. That is not a big mistake, I suppose (comparing to
> creation all N threads during placing first N tasks in queue 
> without considering the volume of concurrent tasks).

IMO, this isn't worth expanding the Queue API.  The thread pool concept explicitly allows up to poolsize threads and occasionally having one extra thread within the allowed poolsize isn't important.   Also, it is unclear why a new thread is being launched if there in nothing in the work queue (unfinished tasks is only interesting when qsize==0).

For other users, the new method is problematic because it is distracting and prone to misuse (potentially out-of-date upon return and potentially meaningless if task_done isn't being used).  I believe the new method will cause more problems than it fixes.

Also, I really don't like the mission creep.  Queue objects are primarily about managing a queue of inputs.  Monitoring and managing consumers is another (mostly orthogonal) realm.
msg288376 - (view) Author: (slytomcat) * Date: 2017-02-22 17:51
I can't fully agree with this:
>Queue objects are primarily about managing a queue of inputs.  
>Monitoring and managing consumers is another (mostly orthogonal) realm.

Monitoring of consumers is already added via task_done() and join() methods. At least this two methods allows to understand that all consumers are in idle state.
The unfinished() - is just functional extension to this existing functionality.

>For other users, the new method is problematic because it is distracting
>and prone to misuse (potentially out-of-date upon return and potentially 
>meaningless if task_done isn't being used).  I believe the new method 
>will cause more problems than it fixes.

This sounds reasonable because I also understand that this method is not usable in all cases as in most cases task_done is not used.

Probably my idea for method unfinished is really not so good...

Actually I've manage to find unpublished Queue.unfinished_tasks variable just in several minutes when I tried to find solution for threaded PoolExecutor. Hope that any curious developer can find it too.

I don't mind if you close this CR. 

Thanks for pleasant discussion.
msg288377 - (view) Author: (slytomcat) * Date: 2017-02-22 17:54
I've closed pull request on GitHub.
History
Date User Action Args
2022-04-11 14:58:43adminsetgithub: 73789
2017-02-22 17:54:15slytomcatsetstatus: open -> closed

messages: + msg288377
stage: resolved
2017-02-22 17:51:41slytomcatsetmessages: + msg288376
2017-02-22 14:01:16rhettingersetmessages: + msg288357
2017-02-21 15:59:50slytomcatsetmessages: + msg288307
2017-02-21 15:55:12slytomcatsetmessages: + msg288306
2017-02-21 15:38:22slytomcatsetmessages: + msg288305
2017-02-21 13:09:23rhettingersetmessages: + msg288294
2017-02-21 08:25:26slytomcatsetmessages: + msg288277
2017-02-21 05:41:22serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg288269
2017-02-21 03:56:52rhettingersetassignee: rhettinger
messages: + msg288261
2017-02-20 10:02:35slytomcatsettype: enhancement
2017-02-20 10:00:23slytomcatsetpull_requests: + pull_request158
2017-02-20 09:12:24slytomcatcreate