Message 359702 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	belopolsky, christian.heimes, eric.smith, gdr@garethrees.org, lukasz.langa, martin.panter, pablogsal, remi.lapeyre, rhettinger, terry.reedy, tim.peters, tshepang
Date	2020-01-10.02:24:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1578623083.29.0.958363917748.issue17005@roundup.psfhosted.org>
In-reply-to

Content
Let's stir this up a bit ;-) I recently had occasion to exchange ideas with Larry Hastings about topsorts for use in a package manager he's writing. I thought his API for adding edges was ... perfect: add(node, dependson) So, e.g., add(A, B, C) says A depends on B, and on C, but says nothing else about B and C. This is almost always the way topsorts show up in real life: you know what a thing depends on* directly, but have scant idea how things may be in the opposite direction. For example, you know that baking a cake requires (among other things) flour, but have no real idea of the universe of other things that require flour. Likewise Larry knows which packages each package requires, but not the reverse. Etc. Nodes with no edges are trivial to add then: add(A). If you're building input to a topsort from a graph, also trivial: for n, p in node2predecessors.items(): topsort_instance.add(n, *p) and it doesn't matter whether the predecessors in the original graph were stored in a list, set, tuple, or any other iterable container. Nothing special about an empty collection of predecessors either. The other big thing that came up is that most topsort programs were useless for his goal: downloading and installing packages takes significant wall clock time, and there's huge opportunity for exploiting parallelism. But a flat sequence in topsort order gives no clue about what _can_ be done in parallel. Instead you really want several methods, like prepare() to say that you're done building the graph; and, get_ready() to get all nodes ready to go, which haven't already been returned by get_ready() calls (initially, this is the collection of nodes with no predecessors, which prepare() can know); and, done(node) to say that `node` (returned by a previous call to get_ready()) is finished now, so that the next call to get_ready() can return all (if any) nodes for which `node` was the last non-done predecessor; and, is_active() to say whether the topsort can make more progress (is_active() returns True iff there are still nodes ready to go that haven't yet been passed out by get_ready(), or if the number of nodes marked done() is less than the number that have been passed out by get_ready()). These are all easy to code, and allow the user to extract all the opportunities for parallelism that theoretically exist. There is no static order that can do so, since the opportunities that exist at any time depend on the times and order in which nodes are marked done() in real life - and that may vary from one run to the next. Of course a deterministic static order can be derived from those, like, e.g., def static_order(self): self.prepare() while self.is_active(): for node in self.get_ready(): yield node self.done(node) For parallel use, e.g., self.prepare() while instance.is_active(): for node in instance.get_ready(): inq.put(node) node = outq.get() instance.done(node) where worker threads or processes take nodes to work on off of queue `inq`, then, when the work for a node is done, put the node on queue `outq`. Am I seriously suggesting this for Python? Sure. It's fun to advance the practical state of the art :-)

Let's stir this up a bit ;-)  I recently had occasion to exchange ideas with Larry Hastings about topsorts for use in a package manager he's writing.  I thought his API for adding edges was ... perfect:

    add(node, *dependson)

So, e.g., add(A, B, C) says A depends on B, and on C, but says nothing else about B and C.  This is almost always the way topsorts show up in real life:  you know what a thing depends *on* directly, but have scant idea how things may be in the opposite direction.  For example, you know that baking a cake requires (among other things) flour, but have no real idea of the universe of other things that require flour.  Likewise Larry knows which packages each package requires, but not the reverse.  Etc.

Nodes with no edges are trivial to add then:  add(A).

If you're building input to a topsort from a graph, also trivial:

    for n, p in node2predecessors.items():
        topsort_instance.add(n, *p)

and it doesn't matter whether the predecessors in the original graph were stored in a list, set, tuple, or any other iterable container.  Nothing special about an empty collection of predecessors either.

The other big thing that came up is that most topsort programs were useless for his goal:  downloading and installing packages takes significant wall clock time, and there's huge opportunity for exploiting parallelism.  But a flat sequence in topsort order gives no clue about what _can_ be done in parallel.  Instead you really want several methods, like

    prepare()

to say that you're done building the graph; and,

    get_ready()

to get all nodes ready to go, which haven't already been returned by get_ready() calls (initially, this is the collection of nodes with no predecessors, which prepare() can know); and,

    done(node)

to say that `node` (returned by a previous call to get_ready()) is finished now, so that the next call to get_ready() can return all (if any) nodes for which `node` was the last non-done predecessor; and,

    is_active()

to say whether the topsort can make more progress (is_active() returns True iff there are still nodes ready to go that haven't yet been passed out by get_ready(), or if the number of nodes marked done() is less than the number that have been passed out by get_ready()).

These are all easy to code, and allow the user to extract all the opportunities for parallelism that theoretically exist.  There is no static order that can do so, since the opportunities that exist at any time depend on the times and order in which nodes are marked done() in real life - and that may vary from one run to the next.

Of course a deterministic static order can be derived from those, like, e.g.,

    def static_order(self):
        self.prepare()
        while self.is_active():
            for node in self.get_ready():
                yield node
                self.done(node)

For parallel use, e.g.,

    self.prepare()
    while instance.is_active():
        for node in instance.get_ready():
            inq.put(node)
        node = outq.get()
        instance.done(node)

where worker threads or processes take nodes to work on off of queue `inq`, then, when the work for a node is done, put the node on queue `outq`.

Am I seriously suggesting this for Python?  Sure.  It's fun to advance the practical state of the art :-)

History
Date	User	Action	Args
2020-01-10 02:24:43	tim.peters	set	recipients: + tim.peters, rhettinger, terry.reedy, belopolsky, eric.smith, christian.heimes, lukasz.langa, tshepang, gdr@garethrees.org, martin.panter, pablogsal, remi.lapeyre
2020-01-10 02:24:43	tim.peters	set	messageid: <1578623083.29.0.958363917748.issue17005@roundup.psfhosted.org>
2020-01-10 02:24:43	tim.peters	link	issue17005 messages
2020-01-10 02:24:42	tim.peters	create