Message360017
> I am slightly confused about what .prepare() should do. Why
> is this step necessary?
To say "I'm done adding edges". Any call to add() after prepare() should raise an exception. Likewise, e.g., any call to get_ready() before prepare() should raise an exception. In a bog-standard topsort implementation, saving for each node a sequence of successors and a count of predecessors, this is also the time to collect all the nodes that have no predecessors (the first batch returned by get_ready()).
Much the same could be done without prepare() by get_ready() making a special case out of the first time it's called. That's more magical, though. "I'm done adding edges" is utterly non-magical.
> - Why we need the .done() method here? Why not instead make get_ready()
> simply a generator so you can just write
>
> for node in self.get_ready():
The point of done() is to enable maximum exploitation of parallelism. As already sketched, if a user doesn't care about that, fine, a different method (like static_order()) can generate all the nodes in _some_ static topsort order, with no need for done().
But suppose a user does care about parallelism. Consider graph
A -> B
A -> C
A -> D
B -> D
Output A B C D is a topsort, but useless unless the user is content to "do" one node at a time.
Instead get_ready() first returns [A] (or a tuple, or a generator, or a set ... something iterable). A is handed out to worker processes/threads, but get_ready() will return an empty iterable until done(A) is called. Indeed, if "doing" A fails, it's NOT the case that anything else can ever be started.
If/when "doing A" succeeds, then done(A) is called, and the next get_ready() returns [B, C]. Those can be done in parallel, but D can't be started until done(B) is called. done(B) may or may not be called before done(C) is called - the topsort itself has no way to know in advance, nor _should_ it impose its own view of that. Note that D does not depend on C, so there's no need to wait for _both_ in [B, C] to finish. It's necessary and sufficient that B be marked done() for D to be ready to go.
> It seems that the .done() is very tight to use this API as a "task
> scheduler" but maybe I am doing something here in my understanding
> of the API.
done() says nothing about how the user "should" schedule work items, but instead allows get_ready() to return all the work items whose predecessors have been marked done() (but haven't already been passed out by get_ready()). That's the maximum set of nodes that _can_ be worked on at the time. The topsort class itself has no policy about how or when they "should" be worked on, get_ready() is just identifying all the possibilities that exist. Which is impossible to know unless the class is also told which nodes it already passed out have finished - the purpose of done().
is_active() eventually returns False when all the nodes passed out by get_ready() have been marked done(), _and_ there are no more nodes ready to pass out. At that point, there's a cycle in the input relations if and only if there's some node get_ready() never passed out.
In my prototype implementation, that's another thing prepare() does: checks for a cycle, and raises CycleError if there is one. The user can catch & ignore that if they like, and continue calling get_ready() and done() until no more progress can be made. I think it's more likely, though, that the user would stare at the cycle attached to the CycleError instance, do whatever it takes to break the cycle, and start over again. |
|
Date |
User |
Action |
Args |
2020-01-15 02:04:00 | tim.peters | set | recipients:
+ tim.peters, rhettinger, terry.reedy, belopolsky, orsenthil, eric.smith, christian.heimes, lukasz.langa, tshepang, gdr@garethrees.org, martin.panter, pablogsal, remi.lapeyre |
2020-01-15 02:04:00 | tim.peters | set | messageid: <1579053840.4.0.695709367978.issue17005@roundup.psfhosted.org> |
2020-01-15 02:04:00 | tim.peters | link | issue17005 messages |
2020-01-15 02:03:59 | tim.peters | create | |
|