Message 360017 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	belopolsky, christian.heimes, eric.smith, gdr@garethrees.org, lukasz.langa, martin.panter, orsenthil, pablogsal, remi.lapeyre, rhettinger, terry.reedy, tim.peters, tshepang
Date	2020-01-15.02:03:59
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1579053840.4.0.695709367978.issue17005@roundup.psfhosted.org>
In-reply-to

Content
> I am slightly confused about what .prepare() should do. Why > is this step necessary? To say "I'm done adding edges". Any call to add() after prepare() should raise an exception. Likewise, e.g., any call to get_ready() before prepare() should raise an exception. In a bog-standard topsort implementation, saving for each node a sequence of successors and a count of predecessors, this is also the time to collect all the nodes that have no predecessors (the first batch returned by get_ready()). Much the same could be done without prepare() by get_ready() making a special case out of the first time it's called. That's more magical, though. "I'm done adding edges" is utterly non-magical. > - Why we need the .done() method here? Why not instead make get_ready() > simply a generator so you can just write > > for node in self.get_ready(): The point of done() is to enable maximum exploitation of parallelism. As already sketched, if a user doesn't care about that, fine, a different method (like static_order()) can generate all the nodes in _some_ static topsort order, with no need for done(). But suppose a user does care about parallelism. Consider graph A -> B A -> C A -> D B -> D Output A B C D is a topsort, but useless unless the user is content to "do" one node at a time. Instead get_ready() first returns [A] (or a tuple, or a generator, or a set ... something iterable). A is handed out to worker processes/threads, but get_ready() will return an empty iterable until done(A) is called. Indeed, if "doing" A fails, it's NOT the case that anything else can ever be started. If/when "doing A" succeeds, then done(A) is called, and the next get_ready() returns [B, C]. Those can be done in parallel, but D can't be started until done(B) is called. done(B) may or may not be called before done(C) is called - the topsort itself has no way to know in advance, nor _should_ it impose its own view of that. Note that D does not depend on C, so there's no need to wait for _both_ in [B, C] to finish. It's necessary and sufficient that B be marked done() for D to be ready to go. > It seems that the .done() is very tight to use this API as a "task > scheduler" but maybe I am doing something here in my understanding > of the API. done() says nothing about how the user "should" schedule work items, but instead allows get_ready() to return all the work items whose predecessors have been marked done() (but haven't already been passed out by get_ready()). That's the maximum set of nodes that _can_ be worked on at the time. The topsort class itself has no policy about how or when they "should" be worked on, get_ready() is just identifying all the possibilities that exist. Which is impossible to know unless the class is also told which nodes it already passed out have finished - the purpose of done(). is_active() eventually returns False when all the nodes passed out by get_ready() have been marked done(), _and_ there are no more nodes ready to pass out. At that point, there's a cycle in the input relations if and only if there's some node get_ready() never passed out. In my prototype implementation, that's another thing prepare() does: checks for a cycle, and raises CycleError if there is one. The user can catch & ignore that if they like, and continue calling get_ready() and done() until no more progress can be made. I think it's more likely, though, that the user would stare at the cycle attached to the CycleError instance, do whatever it takes to break the cycle, and start over again.

> I am slightly confused about what .prepare() should do. Why
> is this step necessary?

To say "I'm done adding edges".  Any call to add() after prepare() should raise an exception.  Likewise, e.g., any call to get_ready() before prepare() should raise an exception.  In a bog-standard topsort implementation, saving for each node a sequence of successors and a count of predecessors, this is also the time to collect all the nodes that have no predecessors (the first batch returned by get_ready()).

Much the same could be done without prepare() by get_ready() making a special case out of the first time it's called.  That's more magical, though.  "I'm done adding edges" is utterly non-magical.

> - Why we need the .done() method here? Why not instead make get_ready()
> simply a generator so you can just write
>
>    for node in self.get_ready():

The point of done() is to enable maximum exploitation of parallelism.  As already sketched, if a user doesn't care about that, fine, a different method (like static_order()) can generate all the nodes in _some_ static topsort order, with no need for done().

But suppose a user does care about parallelism.  Consider graph

A -> B
A -> C
A -> D
B -> D

Output A B C D is a topsort, but useless unless the user is content to "do" one node at a time.

Instead get_ready() first returns [A] (or a tuple, or a generator, or a set ... something iterable).  A is handed out to worker processes/threads, but get_ready() will return an empty iterable until done(A) is called.  Indeed, if "doing" A fails, it's NOT the case that anything else can ever be started.

If/when "doing A" succeeds, then done(A) is called, and the next get_ready() returns [B, C].  Those can be done in parallel, but D can't be started until done(B) is called.  done(B) may or may not be called before done(C) is called - the topsort itself has no way to know in advance, nor _should_ it impose its own view of that.  Note that D does not depend on C, so there's no need to wait for _both_ in [B, C] to finish.  It's necessary and sufficient that B be marked done() for D to be ready to go.

> It seems that the .done() is very tight to use this API as a "task
> scheduler" but maybe I am doing something here in my understanding
> of the API.

done() says nothing about how the user "should" schedule work items, but instead allows get_ready() to return all the work items whose predecessors have been marked done() (but haven't already been passed out by get_ready()).  That's the maximum set of nodes that _can_ be worked on at the time.  The topsort class itself has no policy about how or when they "should" be worked on, get_ready() is just identifying all the possibilities that exist.  Which is impossible to know unless the class is also told which nodes it already passed out have finished - the purpose of done().

is_active() eventually returns False when all the nodes passed out by get_ready() have been marked done(), _and_ there are no more nodes ready to pass out.  At that point, there's a cycle in the input relations if and only if there's some node get_ready() never passed out.

In my prototype implementation, that's another thing prepare() does:  checks for a cycle, and raises CycleError if there is one.  The user can catch & ignore that if they like, and continue calling get_ready() and done() until no more progress can be made.  I think it's more likely, though, that the user would stare at the cycle attached to the CycleError instance, do whatever it takes to break the cycle, and start over again.

History
Date	User	Action	Args
2020-01-15 02:04:00	tim.peters	set	recipients: + tim.peters, rhettinger, terry.reedy, belopolsky, orsenthil, eric.smith, christian.heimes, lukasz.langa, tshepang, gdr@garethrees.org, martin.panter, pablogsal, remi.lapeyre
2020-01-15 02:04:00	tim.peters	set	messageid: <1579053840.4.0.695709367978.issue17005@roundup.psfhosted.org>
2020-01-15 02:04:00	tim.peters	link	issue17005 messages
2020-01-15 02:03:59	tim.peters	create