Issue 47145: Improve graphlib.TopologicalSort by removing the prepare step

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/91301

classification

Title:	Improve graphlib.TopologicalSort by removing the prepare step
Type:	enhancement	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	larry	Nosy List:	Dennis Sweeney, eric.smith, larry, pablogsal, tim.peters
Priority:	normal	Keywords:

Created on 2022-03-28 16:20 by larry, last changed 2022-04-11 14:59 by admin.

Messages (12)
msg416182 - (view)	Author: Larry Hastings (larry) *	Date: 2022-03-28 16:20
I've maintained my own topological sort class for a while now. The API I came up with (or was it Tim and I?) was adopted for graphlib.TopologicalSort(). Some of my method names are slightly different, but the APIs are so similar, I can do a little renaming and run Lib/test/test_graphlib using my implementation. (For clarity I'm going to write the rest of this issue using the names from graphlib.) Recently I made an improvement to my version: it no longer has the modal behavior where there's a "before prepare" phase where you add nodes and an "after prepare" phase where you can call get_ready and done. Instead, in my new and improved version the graph is always maintained in a "ready" state. You can call add, get_ready, and done in any order, as much as you like. prepare doesn't do anything besides the cycle check--but read on! This approach required tweaking some semantics behind the scenes. My graph object now maintains an internal "dirty" flag. It's set to True after any edges are added, and only gets cleared after checking for cycles. prepare runs this cycle check, but get_ready also runs this cycle check. (Though in both cases, only when the dirty flag is set, of course.) In theory this means you no longer need the prepare call. However, it's still useful, because you can call it to check for a CycleError. So, on the one hand, this means that get_ready can throw a CycleError, which it never did before. On the other hand, it won't throw for existing users of the library, because they never add edges after their prepare call. So this semantic change shouldn't break existing users, assuming they're not misusing the existing API. Another wrinkle: adding a dependency on a node that's already been handed out by get_ready (but hasn't been returned by done) is ignored. Again, not something existing users of graphlib can do, so this shouldn't break code. There's one final semantic change I made worth knowing about: when you return a node via done, the graph simply forgets about it. This means you could add it a second time (!) and it could go for a complete ride through the graph again, wheeee! This also means that when all nodes have been returned by done, the graph is essentially returned to its initial pristine state. This too seems completely harmless, because with the existing library it's illegal to add nodes to a graph after calling prepare, so nobody's doing it, so it won't break any code. I'm happy to contribute my version. But I was lazy and didn't worry about speed or memory implementation; it's strewn with sets and things. I figured I could always optimize it later. But "later" could become "now" if we were interested in trying to merge this behavior into Python's graphlib. Interested?
msg416306 - (view)	Author: Eric V. Smith (eric.smith) *	Date: 2022-03-29 22:42
My personal usage of a topological sort are restricted to maybe 100 entries max, so I can't really test this in any meaningful way. That, and the project that uses it is stuck on 3.6, so I haven't switched to the graphlib version yet.
msg416321 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2022-03-30 05:50
Out of curiosity, what are the use cases for adding nodes after get_ready has already produced nodes? I was wondering about avoiding the need to call prepare() by having it automatically do the cycle-checking at the first get_ready() call and then raising ValueError if add() is called any time thereafter. Assuming we do want to be able to add() after a get_ready(), is there a reason that "forgetting" already-produced nodes is the correct behavior, as opposed to remembering all nodes ever added, and raising iff the addition creates a cycle among all nodes ever added or depends on an already-yielded node?
msg416322 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2022-03-30 05:54
> depends on an already-yielded node I mean "creates a new not-yet-yielded dependency for an already-yielded node".
msg416323 - (view)	Author: Larry Hastings (larry) *	Date: 2022-03-30 06:06
I'm using my graph library to manage a list of tasks that need doing in some sort of proper order. One task may spawn other tasks at runtime, and we don't necessarily know what the tasks will be until runtime. It's way more convenient to simply add such tasks on demand, rather than trying to preemptively pre-add all such possible tasks before preparing the graph, or creating additional graphs. For example, consider a tool that downloads and digests zip files full of music from an online store. Your initial tasks might represent "download the zip file", then "decompress and examine the contents of the zip file". You could then iterate over the contents of the zip file, adding different tasks based on what you find--one pipeline of tasks for media files (FLAC/MP3/OGG/etc), another for the playlist, a third if you don't find a playlist, a fourth for image files, etc. (Not that you'd necessarily write such a tool this way, but it's at least plausible.) The new nodes needn't be connected to the existing nodes for this to still be useful. You could reuse the same graph object for all your tasks.
msg416324 - (view)	Author: Larry Hastings (larry) *	Date: 2022-03-30 06:29
> Assuming we do want to be able to add() after a get_ready(), is there > a reason that "forgetting" already-produced nodes is the correct > behavior, as opposed to remembering all nodes ever added, and > raising iff the addition creates a cycle among all nodes ever > added or depends on an already-yielded node? I'm not sure "correct" applies here, because I don't have a sense that one behavior is conceptually more correct than the other. But in implementing my API change, forgetting about the done nodes seemed more practical. The "benefit" to remembering done nodes: the library can ignore dependencies to them in the future, forever. Adding a dependency to a node that's already been marked as "done" doesn't make much conceptual sense to me, but as a practical thing maybe it's useful? I'm not sure, but it doesn't seem all that useful. I can only come up with a marginal reason why remembering done nodes is useful. Let's say all your tasks fan out from some fundamental task, like "download master list of work" or something. One of your tasks might discover additional tasks that need to run, and conceptually those tasks might depend on your "download master list of work" task. If the graph remembers the done list forever, then adding that dependency is harmless. If the graph forgets about done nodes, then adding that dependency could re-introduce that task to the graph, which could goof things up. So maybe it's a little bit of a footgun? But on the other hand: you already know you're running, and you're a task that was dependent on the master list of work, which means you implicitly know that dependency has been met. So just skip adding the redundant dependency and you're fine. On the other hand, forgetting about the nodes has a definite practical benefit: the graph consumes less memory. If you use a graph object for a long time, the list of done nodes it's holding references to would continue to grow and grow and grow. If we forget about done nodes, we free up all that memory, and done membership testing maybe gets faster. I guess I'm not married to the behavior. If someone had a great conceptual or practical reason why remembering the done nodes forever was better, I'd be willing to listen to reason.
msg416404 - (view)	Author: Larry Hastings (larry) *	Date: 2022-03-30 23:38
Having slept on it, I think the footgun is slightly bigger than I gave it credit for. Also the problem of nodes lingering forever is easily solved: give the user control. My next iteration will keep the done nodes around, but I'll also add a forget() method that compels the graph to forget all done nodes.
msg416405 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-03-31 00:02
Various kinds of tasks: - "Power switch must be on." Needs to done the first time. _May_ need to be done again later (if some later task turns the power off again). Can be done any number of times without harm (beyond the expense of checking), so long as the switch is on. - "Ensure gas tank is full." Probably needs to be done anew for every new added task that depends on it. - "Remove outermost layer of skin." Probably shouldn't be done more than once ;-) - "Flip Najdorf switch." Who knows? Doing it a second time - or failing to do it a second time - may be necessary, harmless, or deadly. I'd rather not bother with any of this dicey guessing. While some dynamism may be attractive, what it all "should mean" appears to be a rat's nest, and depends on domain knowledge graphlib can't possibly have. I doubt there is a compelling default. As is, two tasks are considered to be "the same" task if and only if they compare equal, so that's the least _surprising_ default for tasks added later. "Remove outermost layer of skin" "Two tasks that compare equal may or may not be considered the same task, depending on the execution history at the time the question is posed" is at best expedient, at worst disastrous.
msg416407 - (view)	Author: Larry Hastings (larry) *	Date: 2022-03-31 00:08
I'm not sure I follow you. What do you suggest graphlib is guessing at? I thought we were discussing what graphlib's (documented, intentional) behavior should be; if there was any guessing going on, it was us doing it, guessing at what behavior the user would prefer. I appreciate you corresponding on the issue, but I'm having difficulty understanding what light you're shining on the problem.
msg416408 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-03-31 00:57
I believe I'm elaborating on your "footgun". It doesn't matter to me whether we pick some scheme and document it, _if_ that scheme is incoherent, impossible to remember, error-prone, etc. That's how, e.g., regular expression syntax was designed. "I know! ++*??+) doesn't have a coherent meaning now, so let's say it means to match a prime number of the pattern it follows" ;-) That is, an error-prone extension is worse than no extension at all. The algorithm now isn't guessing at anything: it's saying up front that two tasks are the same task if and only if they compare equal. Context, execution history, ..., nothing else is relevant. It's simple. Complicating a simple thing may open new possibilities, but creates new traps too. One trap is pretty obviously making "the rules" for when two tasks are the same task depend on the execution history at the time the question is asked. That goes away, though, if the current rule is retained ("the same iff =="), but can be explicitly overridden by .forget() (or some such). That doesn't make it a no-brainer, though. For example, do .add(A, B) and run until A and B are marked done. Now do .add(B, C) What then? We're back in a guessing game again. We've just been told that B depends on C first. But we already did B. Depending on _domain_ knowledge we cannot have, that may or may not be worthy of raising an exception. You can make up any rule you want about that and arbitrarily declare victory, but the chance that a user will realize it's not the behavior _their_ domain needs is approximately 0 until after they've been burned by it. So now it's error-prone, at least to them. FWIW, in that specific case I'd say "tough beans - you told us too late that B depends on C to stop B the first time, but now that you've told us we'll respect it in future". Another twist: assuming that's "the rule", what does .add(B, C) really mean? If B really is "the same task" as it was the first time around, well, it's already been marked done. Are we supposed to do it _again_ now? Why? Why not? It's hard for me to find any _compelling_ sense here - just masses of more-or-less arbitrary implementation decisions. In which case "hard to remember" follows soon after. None of that haunts the current API. It all follows quite directly from what "depends on" means to essentially everyone.
msg416592 - (view)	Author: Larry Hastings (larry) *	Date: 2022-04-02 21:50
I agree that the API should have as few surprises as possible. AFAICT you haven't made any terminal pronouncements like "it's impossible to add this feature without too many unacceptable surprises", so I'll proceed assuming we can find an API (and semantics) that we're all happy with. I've come around on the idea that forgetting "done" nodes is too surprising. The least surprising behavior is: once we've added a node to the graph, the graph remembers it. Now I'll propose a second simple rule, which you've already touched on: you can't add a dependency after a node has been returned by get_ready(). Attempting this always throws an exception; which exception specifically TBD. "Tough beans". I think those two rules are easy enough to remember and to reason about. It meaningfully expands the utility of the library with minimal surprise. The library behaves identically for users of the existing API but permits new use cases too. Adding this functionality would therefore mean fewer users would discover too late their use case isn't supported. As far as my "long-lived graph objects will consume more and more memory over time" caveat, there's a better solution than "forget": graph.remove(*nodes). (My version already has a "remove" method, and I forgot that graphlib's doesn't.) Allowing the user to remove a node from the graph gives them explicit control, and the semantics should be obvious and unsurprising; if you--the user--remove a node, and later you--the user--re-add that node to the graph, it behaves identically to any other node the graph has never seen before. Removing a node intuitively removes all edges to that node. Two notes on "remove" if we decide to go that route. First, I'd ensure you can remove a node at any time. Nodes have three externally visible states wrt TopologicalSort: 1) added but not published by get_ready, 2) published by get_ready but not returned using done, and 3) done. You should be able to remove a node in any of those three states. Removing a node in 2) should be equivalent to calling done before calling remove; that is, if you're removing the node anyway, you don't need to call done. Second, the current underlying implementation would make remove really slow. Nodes don't remember their predecessors, only their successors, so removing a node would be O(n). If we expected remove to get a lot of use, we'd probably want to change how the graph is stored to speed up this operation.
msg416594 - (view)	Author: Larry Hastings (larry) *	Date: 2022-04-02 21:50
One final aside. Let me preface this by saying: I'm not proposing the following for graphlib.TopologicalSort. It's obviously too late to change that object this much. It's just something I'm thinking about--maybe I'll use this in my own library. Where we are now, the graphlib.TopologicalSort object is simultaneously a static representation of a graph--which the object doesn't provide a public API for you to examine!--and a stateful single-use iterator over that graph. My proposed change to the API seems to increase the tension between these two sets of semantics. Perhaps a better set of semantics, that more successfully maps my use case to the graph object, would be as follows: * you can add() nodes and edges at any time. * get_ready() always returns the current list of nodes with no prececessors. There is no internal "we already told you about this one" state. * replace done() with remove(), which removes the node and all edges from/to it from the graph. * static_order() is still fine. I think this would make it easy to reason about the object's behavior, and would be a better match to my use case where you're continually adding (and removing?) nodes, not just during an initial "populate the graph" phase. Again, not appropriate to consider for graphlib.TopologicalSort.

History
Date	User	Action	Args
2022-04-11 14:59:57	admin	set	github: 91301
2022-04-02 21:50:46	larry	set	messages: + msg416594
2022-04-02 21:50:30	larry	set	messages: + msg416592
2022-03-31 00:57:07	tim.peters	set	messages: + msg416408
2022-03-31 00:08:32	larry	set	messages: + msg416407
2022-03-31 00:02:25	tim.peters	set	messages: + msg416405
2022-03-30 23:38:43	larry	set	messages: + msg416404
2022-03-30 06:29:14	larry	set	messages: + msg416324
2022-03-30 06:06:50	larry	set	messages: + msg416323
2022-03-30 05:54:44	Dennis Sweeney	set	messages: + msg416322
2022-03-30 05:50:51	Dennis Sweeney	set	nosy: + Dennis Sweeney messages: + msg416321
2022-03-29 22:42:20	eric.smith	set	messages: + msg416306
2022-03-28 16:20:34	larry	create