Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a topological sort algorithm #61207

Closed
rhettinger opened this issue Jan 20, 2013 · 64 comments
Closed

Add a topological sort algorithm #61207

rhettinger opened this issue Jan 20, 2013 · 64 comments
Labels
3.9 only security fixes type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

BPO 17005
Nosy @tim-one, @rhettinger, @terryjreedy, @abalkin, @orsenthil, @vstinner, @ericvsmith, @tiran, @merwok, @ambv, @gareth-rees, @vadmium, @wimglenn, @Zaharid, @DimitrisJim, @pablogsal, @miss-islington, @remilapeyre, @gaborbernat, @maresb, @adriangb
PRs
  • bpo-17005: Add a topological sort algorithm #11583
  • bpo-17005: Minor improvements to the documentation of TopologicalSorter #18155
  • bpo-17005: Move topological sort functionality to its own module #20558
  • [3.9] bpo-17005: Move topological sort functionality to its own module (GH-20558) #20561
  • Files
  • mro_merge.py
  • topological_sort.py: topological_sort.py
  • tsort.1.py: Early experiment with only a predecessor dict
  • ts.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-12-05.06:31:18.135>
    created_at = <Date 2013-01-20.21:39:12.282>
    labels = ['type-feature', '3.9']
    title = 'Add a topological sort algorithm'
    updated_at = <Date 2021-12-17.18:50:01.169>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2021-12-17.18:50:01.169>
    actor = 'eric.araujo'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-12-05.06:31:18.135>
    closer = 'rhettinger'
    components = []
    creation = <Date 2013-01-20.21:39:12.282>
    creator = 'rhettinger'
    dependencies = []
    files = ['28800', '48361', '48362', '48842']
    hgrepos = []
    issue_num = 17005
    keywords = ['patch', 'patch', 'patch']
    message_count = 63.0
    messages = ['180319', '180321', '180329', '190419', '333806', '333810', '333816', '333831', '334000', '334002', '334010', '334012', '334014', '334100', '334103', '334104', '334107', '343584', '343586', '344143', '359672', '359702', '360016', '360017', '360018', '360019', '360020', '360022', '360157', '360214', '360258', '360567', '360580', '365614', '365618', '365622', '365631', '365632', '365638', '365639', '365644', '370488', '370494', '370496', '370500', '370501', '370503', '370506', '370508', '370509', '370510', '370511', '370514', '370515', '370516', '370517', '370519', '370520', '370521', '375979', '375980', '407319', '408807']
    nosy_count = 23.0
    nosy_names = ['tim.peters', 'rhettinger', 'terry.reedy', 'belopolsky', 'orsenthil', 'vstinner', 'eric.smith', 'christian.heimes', 'eric.araujo', 'lukasz.langa', 'tshepang', 'gdr@garethrees.org', 'martin.panter', 'wim.glenn', 'Zahari.Dim', 'Paddy McCarthy', 'Jim Fasarakis-Hilliard', 'pablogsal', 'miss-islington', 'remi.lapeyre', 'gaborjbernat', 'maresb', 'adriangb']
    pr_nums = ['11583', '18155', '20558', '20561']
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue17005'
    versions = ['Python 3.9']

    @rhettinger
    Copy link
    Contributor Author

    I suggest adding a topological sort algorithm to the standard library.

    In addition to being a fundamental algorithm, it is immediately useful in demonstrating how the MRO computation works and for pure Python implementations of MRO logic. IIRC, the pgen code was also expressed in pure Python for the same reason.

    I've attached a first-draft of the algorithm and an alternative that only implements a topological merge. This is just an early draft and there are a number of open points:

    • which module to put it in
    • a better implementation may be possible (perhaps using fewer dictionaries and sets).

    @rhettinger rhettinger added the type-feature A feature request or enhancement label Jan 20, 2013
    @tiran
    Copy link
    Member

    tiran commented Jan 20, 2013

    Good idea!

    I'm using http://pypi.python.org/pypi/topsort/0.9 for a couple of years. The initial code was written by Tim Peters.

    @rhettinger rhettinger self-assigned this Jan 20, 2013
    @ericvsmith
    Copy link
    Member

    +1

    I'll note (by inspection only) your example code doesn't work under Python 3.x! :)

    (print as a statement)

    @abalkin
    Copy link
    Member

    abalkin commented May 31, 2013

    +1

    I had "tsort" in my own utilities library for so long that I thought it was in stdlib already. I found this issue after unsuccessfully searching docs.python.org. :-)

    I think functools will be a fine place for it. It is somewhat related to total ordering and solves the problem which is common when implementing functional mini-languages.

    Another possibility is shutil given that tsort is a standard POSIX command, <http://pubs.opengroup.org/onlinepubs/009695299/utilities/tsort.html\>, but I think this will be too obscure.

    @abalkin
    Copy link
    Member

    abalkin commented Jan 17, 2019

    Here is an implementation that I've used for years. It is somewhat shorter than the one in PR 11583:

    class CycleError(LogicalError, ValueError):
        """dependencies cycle detected
    """
    
    def tsort(pairs):
        """topological sort
    Just like unix tsort(1)
    
        >>> tsort([(1, 2), (7, 8), (8, 10), (7, 4), (2, 3), (4, 10)])
        [1, 7, 2, 8, 4, 3, 10]
        >>> try:
        ...     tsort([(1,2), (2,1)])
        ... except CycleError as e:
        ...     print(e)
        ([], Counter({1: 1, 2: 1}), {1: [2], 2: [1]})
        """
        # inspired by http://mail.python.org/pipermail/python-list/1999-July/002831.html
        successors = {}
        predecessor_counts = collections.Counter()
        for x, y in pairs:
            successors.setdefault(x, []).append(y)
            predecessor_counts.setdefault(x, 0)
            predecessor_counts[y] += 1
        ordered = [x for x in predecessor_counts
                   if predecessor_counts[x] == 0]
        for x in ordered:
            del predecessor_counts[x]
            for y in successors.get(x, ()):
                predecessor_counts[y] -= 1
                if predecessor_counts[y] == 0:
                    ordered.append(y)
        if predecessor_counts:
            raise CycleError(ordered, predecessor_counts, successors)
        return ordered

    @rhettinger
    Copy link
    Contributor Author

    Thanks. I have some API nits to work to make this more broadly useful. I'll all those on the PR comments soonish :-)

    @pablogsal
    Copy link
    Member

    The one in PR 11583 is twice as fast:

    timeit for -> topsort([(2,11),(9,11),(9,8),(9,10),(10,11),(10,3),(11,7),(11,5),(8,7),(8,3)])
    12.4 µs ± 59.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

    timeit for -> tsort([(2,11),(9,11),(9,8),(9,10),(10,11),(10,3),(11,7),(11,5),(8,7),(8,3)])
    29.1 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented Jan 17, 2019

    As the name been already discussed ?

    I fear that topsort might only be clear to people already knowing what it does. topoligical_sort would be more discoverable and explicit and one can always do

        from functools import topological_sort as tsort

    if he wants to save some typing later.

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jan 18, 2019

    I approve in general with the principle of including a topological sort algorithm in the standard library. However, I have three problems with the approach in PR 11583:

    1. The name "topsort" is most naturally parsed as "top sort" which could be misinterpreted (as a sort that puts items on top in some way). If the name must be abbreviated then "toposort" would be better.

    2. "Topological sort" is a terrible name: the analogy with topological graph theory is (i) unlikely to be helpful to anyone; and (ii) not quite right. I know that the name is widely used in computing, but a name incorporating "linearize" or "linear order" or "total order" would be much clearer.

    3. The proposed interface is not suitable for all cases! The function topsort takes a list of directed edges and returns a linear order on the vertices in those edges (if any linear order exists). But this means that if there are any isolated vertices (that is, vertices with no edges) in the dependency graph, then there is no way of passing those vertices to the function. This means that (i) it is inconvenient to use the proposed interface because you have to find the isolated vertices in your graph and add them to the linear order after calling the function; (ii) it is a bug magnet because many programmers will omit this step, meaning that their code will unexpectedly fail when their graph has an isolated vertex. The interface needs to be redesigned to take the graph in some other representation.

    @pablogsal
    Copy link
    Member

    1. The name "topsort" is most naturally parsed as "top sort" which could be misinterpreted (as a sort that puts items on top in some way). If the name must be abbreviated then "toposort" would be better.

    I totally agree that topsort is a bad name, I used it more or less as a dummy for starting the discussion about the implementation.

    1. "Topological sort" is a terrible name: the analogy with topological graph theory is (i) unlikely to be helpful to anyone; and (ii) not quite right. I know that the name is widely used in computing, but a name incorporating "linearize" or "linear order" or "total order" would be much clearer.

    Topological sort (not as the function name) but as an operation is a very well known concept and is well defined. If you are referring to not use "Topological Sort" in the docstrings or the documentation, I strongly oppose.

    Regarding the interface, I am more happy to change it once there is an agreement. I am still awaiting Raymond's comments regarding this so we can start discussing.

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jan 18, 2019

    Just to elaborate on what I mean by "bug magnet". (I'm sure Pablo understands this, but there may be other readers who would like to see it spelled out.)

    Suppose that you have a directed graph represented as a mapping from a vertex to an iterable of its out-neighbours. Then the "obvious" way to get a total order on the vertices in the graph would be to generate the edges and pass them to topsort:

        def edges(graph):
            return ((v, w) for v, ww in graph.items() for w in ww)
        order = topsort(edges(graph))

    This will appear to work fine if it is never tested with a graph that has isolated vertices (which would be an all too easy omission).

    To handle isolated vertices you have to remember to write something like this:

        reversed_graph = {v: [] for v in graph}
        for v, ww in graph.items():
            for w in ww:
                reversed_graph[w].append(v)
        order = topsort(edges(graph)) + [
              v for v, ww in graph.items() if not ww and not reversed_graph[v]]

    I think it likely that beginner programmers will forget to do this and be surprised later on when their total order is missing some of the vertices.

    @terryjreedy
    Copy link
    Member

    I think 'toposort' is a good name for a function that implements 'topological sorting'.
    https://en.wikipedia.org/wiki/Topological_sorting

    @terryjreedy terryjreedy added the 3.8 only security fixes label Jan 18, 2019
    @ericvsmith
    Copy link
    Member

    This is why I prefer the API exposed by https://pypi.org/project/toposort/

    list(toposort({2: {11},
                   9: {11, 8, 10},
                   10: {11, 3},
                   11: {7, 5},
                   8: {7, 3},
                  }))

    returns [{3, 5, 7}, {8, 11}, {2, 10}, {9}]

    For an node with no edges, use an empty set:

    list(toposort({100: set(),
    2: {11},
    9: {11, 8, 10},
    10: {11, 3},
    11: {7, 5},
    8: {7, 3},
    }))
    [{3, 100, 5, 7}, {8, 11}, {2, 10}, {9}]

    I also don't think we should provide multiple APIs. Let's just provide one, and recipes for any helpers, if needed. For example, to flatten the result into a list. Or to take a list of edges as the input.

    @pablogsal
    Copy link
    Member

    I have updated the PR to receive a dictionary of sets as in Eric V. Smith's package. I have maintained the guts of the current algorithm as it scales much better:

    >> test_data = {x:{x+n for n in range(100)} for x in range(1000)}

    >>> %timeit list(toposort.toposort(test_data)) # The one in PyPi
    910 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    >> %timeit list(functools.toposort(test_data)) # In this PR

    69.3 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

    >>> list(functools.toposort(l)) == list(toposort.toposort(l))
    True

    @pablogsal
    Copy link
    Member

    Although I think the API with the dictionary may be cleaner, I still do not like it completely.

    1. One of the reasons is that implementing C3 directly with the topological sort is not straightforward as the different nodes that are at the same level are unordered and therefore any permutation of all these is a valid topological sort, while C3 comes from a stable sort. The version with pairs is stable on the other hand.

    2. Topological sorting usually is well-defined on totally connected graphs, so I do not know what exactly it means to topologically sort two disjoint graphs. This was one of the main drawbacks of the tuple-based approach, but I think it may be a good property.

    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented Jan 20, 2019

    1. Topological sorting usually is well-defined on totally connected graphs, so I do not know what exactly it means to topologically sort two disjoint graphs. This was one of the main drawbacks of the tuple-based approach, but I think it may be a good property.

    To give a use-case, I'm currently using topological sort to order a list of tasks where edges represent dependencies between tasks. Sometime a group of tasks does not share a dependency with another group any relative order between those two groups is correct:

    A -> B

    C
    / \
    D E
    \ /
    F

    The order (A, B, C, D, E, F) would be correct in this example as would (C, A, E, B, D, F).

    I think the general topological sort in Python should be able to handle such inputs.

    @pablogsal
    Copy link
    Member

    (A, B, C, D, E, F) would not be correct as B is order 2 and "C" and "A" is order 1. The correct orders are the inner-group permutations of:

    [{'A', 'F'}, {'B', 'D', 'E'}, {'C'}]

    (Assuming that the lower diagram goes from top to bottom)

    @rhettinger
    Copy link
    Contributor Author

    Attaching some of my 2013 work on this. Here are some API notes:

    • spell-out the name topological_sort()

    • allow input as ordered pairs like the Unix tsort command
      https://en.wikipedia.org/wiki/Tsort#Examples

    • allow more convenient input as dependency sequences (like graphviz):
      [['a', 'b', 'c', 'x], ['b', 'd', 'e', 'y']]
      is short for and equivalent to:
      [(a,b), (b,c), (c,x), (b,d), (d, e), (e, y)]

    • return both the sorted sequence and cycles
      (both are individually useful and the latter
      is helpful in debugging in only the former is wanted)

    • desire to match the C3 MRO computation

    • would like the ordering to be stable and deterministic
      (this precludes picking arbitrary elements from sets).

    @pablogsal
    Copy link
    Member

    • allow input as ordered pairs like the Unix tsort command
    • allow more convenient input as dependency sequences (like graphviz):

    This is how my first proposal started (and I still like it a bit more than the dictionary input), but there are some concerns (check other comments) regarding this API, like representing isolated nodes or disjoint graphs.

    return both the sorted sequence and cycles

    Regarding the output, I like returning a collection of sets, where every set represents all possible elements of the same order in the result. This also helps if the user has some expectation regarding the ordering. For example, in:

    ['ABDGI', 'BEG', 'CEH', 'KCFHJ']

    the results starting with

    ['A', 'B', 'D'

    and

    ['A', 'B', 'K'

    are both valid.

    With the current implementation, this is the equivalent of C3 linearization:

          from itertools import tee
          from collections import defaultdict
          def c3_linearization(inheritance_seqs):
             graph = defaultdict(set)
             for seq in inheritance_seqs:
                 a, b = tee(seq)
                 next(b, None)
                 for child, parent in zip(a,b):
                     graph[child].add(parent)
             retun ((list(group) for group in functools.toposort(graph)), [])
             return tuple(reversed(order))
      >>> class A: pass
      >>> class B(A): pass
      >>> class C(A): pass
      >>> class D(B, C): pass
    
       >> D.__mro__
      (__main__.D, __main__.B, __main__.C, __main__.A, object)
    
       >> c3_linearization([(D, B, A, object), (D, C, A, object)])
      [{__main__.D}, {__main__.B, __main__.C}, {__main__.A}, {object}]
    

    What do you think?

    @rhettinger
    Copy link
    Contributor Author

    Unless Łukasz gives us a nod to work out this API for the second beta, we'll need to defer this to 3.9. IMO, the API in the patch is not user friendly. It started on the right path but became contorted (for both inputs and outputs) to serve unusual cases.

    @gareth-rees
    Copy link
    Mannequin

    gareth-rees mannequin commented Jan 9, 2020

    I'd like to push back on the idea that graphs with isolated vertices are "unusual cases" as suggested by Raymond.

    A very common use case (possibly the most common) for topological sorting is job scheduling. In this use case you have a collection of jobs, some of which have dependencies on other jobs, and you want to output a schedule according to which the jobs can be executed so that each job is executed after all its dependencies.

    In this use case, any job that has no dependencies, and is not itself a dependency of any other job, is an isolated vertex in the dependency graph. This means that the proposed interface (that is, the interface taking only pairs of vertices) will not be suitable for this use case. Any any programmer who tries to use it for this use case will be setting themselves up for failure.

    @tim-one
    Copy link
    Member

    tim-one commented Jan 10, 2020

    Let's stir this up a bit ;-) I recently had occasion to exchange ideas with Larry Hastings about topsorts for use in a package manager he's writing. I thought his API for adding edges was ... perfect:

        add(node, *dependson)

    So, e.g., add(A, B, C) says A depends on B, and on C, but says nothing else about B and C. This is almost always the way topsorts show up in real life: you know what a thing depends *on* directly, but have scant idea how things may be in the opposite direction. For example, you know that baking a cake requires (among other things) flour, but have no real idea of the universe of other things that require flour. Likewise Larry knows which packages each package requires, but not the reverse. Etc.

    Nodes with no edges are trivial to add then: add(A).

    If you're building input to a topsort from a graph, also trivial:

        for n, p in node2predecessors.items():
            topsort_instance.add(n, *p)

    and it doesn't matter whether the predecessors in the original graph were stored in a list, set, tuple, or any other iterable container. Nothing special about an empty collection of predecessors either.

    The other big thing that came up is that most topsort programs were useless for his goal: downloading and installing packages takes significant wall clock time, and there's huge opportunity for exploiting parallelism. But a flat sequence in topsort order gives no clue about what _can_ be done in parallel. Instead you really want several methods, like

        prepare()

    to say that you're done building the graph; and,

        get_ready()

    to get all nodes ready to go, which haven't already been returned by get_ready() calls (initially, this is the collection of nodes with no predecessors, which prepare() can know); and,

        done(node)

    to say that node (returned by a previous call to get_ready()) is finished now, so that the next call to get_ready() can return all (if any) nodes for which node was the last non-done predecessor; and,

        is_active()

    to say whether the topsort can make more progress (is_active() returns True iff there are still nodes ready to go that haven't yet been passed out by get_ready(), or if the number of nodes marked done() is less than the number that have been passed out by get_ready()).

    These are all easy to code, and allow the user to extract all the opportunities for parallelism that theoretically exist. There is no static order that can do so, since the opportunities that exist at any time depend on the times and order in which nodes are marked done() in real life - and that may vary from one run to the next.

    Of course a deterministic static order can be derived from those, like, e.g.,

        def static_order(self):
            self.prepare()
            while self.is_active():
                for node in self.get_ready():
                    yield node
                    self.done(node)

    For parallel use, e.g.,

        self.prepare()
        while instance.is_active():
            for node in instance.get_ready():
                inq.put(node)
            node = outq.get()
            instance.done(node)

    where worker threads or processes take nodes to work on off of queue inq, then, when the work for a node is done, put the node on queue outq.

    Am I seriously suggesting this for Python? Sure. It's fun to advance the practical state of the art :-)

    @pablogsal
    Copy link
    Member

    Am I seriously suggesting this for Python? Sure. It's fun to advance the practical state of the art :-)

    I think this API looks very interesting! I have some questions before start implementing it to play a bit with it:

    • I am slightly confused about what .prepare() should do. Why is this step necessary?

    • Why we need the .done() method here? Why not instead make get_ready() simply a generator so you can just write

        for node in self.get_ready():

    It seems that the .done() is very tight to use this API as a "task scheduler" but maybe I am doing something here in
    my understanding of the API.

    @tim-one
    Copy link
    Member

    tim-one commented Jan 15, 2020

    I am slightly confused about what .prepare() should do. Why
    is this step necessary?

    To say "I'm done adding edges". Any call to add() after prepare() should raise an exception. Likewise, e.g., any call to get_ready() before prepare() should raise an exception. In a bog-standard topsort implementation, saving for each node a sequence of successors and a count of predecessors, this is also the time to collect all the nodes that have no predecessors (the first batch returned by get_ready()).

    Much the same could be done without prepare() by get_ready() making a special case out of the first time it's called. That's more magical, though. "I'm done adding edges" is utterly non-magical.

    • Why we need the .done() method here? Why not instead make get_ready()
      simply a generator so you can just write

      for node in self.get_ready():

    The point of done() is to enable maximum exploitation of parallelism. As already sketched, if a user doesn't care about that, fine, a different method (like static_order()) can generate all the nodes in _some_ static topsort order, with no need for done().

    But suppose a user does care about parallelism. Consider graph

    A -> B
    A -> C
    A -> D
    B -> D

    Output A B C D is a topsort, but useless unless the user is content to "do" one node at a time.

    Instead get_ready() first returns [A] (or a tuple, or a generator, or a set ... something iterable). A is handed out to worker processes/threads, but get_ready() will return an empty iterable until done(A) is called. Indeed, if "doing" A fails, it's NOT the case that anything else can ever be started.

    If/when "doing A" succeeds, then done(A) is called, and the next get_ready() returns [B, C]. Those can be done in parallel, but D can't be started until done(B) is called. done(B) may or may not be called before done(C) is called - the topsort itself has no way to know in advance, nor _should_ it impose its own view of that. Note that D does not depend on C, so there's no need to wait for _both_ in [B, C] to finish. It's necessary and sufficient that B be marked done() for D to be ready to go.

    It seems that the .done() is very tight to use this API as a "task
    scheduler" but maybe I am doing something here in my understanding
    of the API.

    done() says nothing about how the user "should" schedule work items, but instead allows get_ready() to return all the work items whose predecessors have been marked done() (but haven't already been passed out by get_ready()). That's the maximum set of nodes that _can_ be worked on at the time. The topsort class itself has no policy about how or when they "should" be worked on, get_ready() is just identifying all the possibilities that exist. Which is impossible to know unless the class is also told which nodes it already passed out have finished - the purpose of done().

    is_active() eventually returns False when all the nodes passed out by get_ready() have been marked done(), _and_ there are no more nodes ready to pass out. At that point, there's a cycle in the input relations if and only if there's some node get_ready() never passed out.

    In my prototype implementation, that's another thing prepare() does: checks for a cycle, and raises CycleError if there is one. The user can catch & ignore that if they like, and continue calling get_ready() and done() until no more progress can be made. I think it's more likely, though, that the user would stare at the cycle attached to the CycleError instance, do whatever it takes to break the cycle, and start over again.

    @pablogsal
    Copy link
    Member

    Fair enough, I will read this carefully again and try to sketch a prototype soon :)

    @tim-one
    Copy link
    Member

    tim-one commented Jan 15, 2020

    I'll add ts.py, which was a work-in-progress that implements a minor variation of most everything I typed about. If nothing else, its _find_cycle is useful as a non-recursive linear-time cycle finder (recursion is deadly here because recursive depth-first search can easily "blow the stack" on larger graphs).

    There's also "if 1:"/"else:" blocks that set up parallel cases, using threads or processes, and two ways of managing the parallelism (the one I showed before, and a more elaborate one that puts an upper bound on how large the queues can grow - which is sometimes "a problem" for multiprocessing.queue).

    @pablogsal
    Copy link
    Member

    Disregarding the API, what do you think about the approach of #11583 for the implementation? Under my benchmarks (check previous comments) it seems to perform very good with regards to memory and time.

    @pablogsal
    Copy link
    Member

    Is also notable to mention that you can also provide the graph as a dictionary to the constructor:

    >> graph = {D: {B, C}, C: {A}, B: {A}, A:{object}}
    >> ts = TopologicalSorter(graph)

    @rhettinger
    Copy link
    Contributor Author

    How about I post a PR so we can talk about something concrete. Then you two can either fight it to its death or you can join me in making it is good as possible, hopefully the latter :-)

    I am not happy with the current API but do accept that both of you are in satisfied with it.

    @maresb
    Copy link
    Mannequin

    maresb mannequin commented May 31, 2020

    It's great to have this feature in the standard library, but it really seems to clutter the functools documentation. Everything else in functools applies directly to functions and methods. Suddenly reading graph theory terminology was disorienting for me. (Due to context, I expected "node" to mean some new type of Python function.) It makes the documentation significantly longer, and IMHO abstruse. Would it be sensible to move this to a separate module?

    @pablogsal
    Copy link
    Member

    Any suggestions for the new module? I assume we can move this to a new topsort/topological_sort or similar...

    What do people think?

    @remilapeyre
    Copy link
    Mannequin

    remilapeyre mannequin commented May 31, 2020

    Could it make sense to have this in the often proposed imath module?

    It's integers per se but Both number theory and graph theory are part of discrete mathematics so it may feel more at home there?

    @DimitrisJim
    Copy link
    Mannequin

    DimitrisJim mannequin commented May 31, 2020

    It does seem out of place in functools, intensified by it's odd interjection among the other functools objects.

    Considering heapq and bisect exist as standalone modules, the idea that topological sorting could go in its own module wouldn't be without precedent.

    @pablogsal
    Copy link
    Member

    If we move it, I would prefer a new module that somehow makes clear the scope, something like graphutils or graphtheory or graph (although this las one will probably collide with many existing things).

    @gaborbernat
    Copy link
    Mannequin

    gaborbernat mannequin commented May 31, 2020

    I like graphutils for what it's worth.

    @DimitrisJim
    Copy link
    Mannequin

    DimitrisJim mannequin commented May 31, 2020

    The downside I see with any graph prefixed names is the fact that it implies a larger collection of graph operations.

    Add that to the fact that people might be more tempted to propose many graph related algorithms/utilities to a module with the same name.

    A more localized name solves that.

    @pablogsal
    Copy link
    Member

    The downside I see with any graph prefixed names is the fact that it implies a larger collection of graph operations.

    I don't see that an issue. In the general term is better to have a general name and discuss further improvements than just name the module "topological_sort" and now being cornered if we ever want to add new graph-related stuff.

    We can always say "no" to new functionality but changing the name of the module is not possible once is released.

    @pablogsal
    Copy link
    Member

    I would like to hear Raymond and Tim advise on what the best name for the new module should be :)

    @maresb
    Copy link
    Mannequin

    maresb mannequin commented May 31, 2020

    dependencytools?

    But I don't think it qualifies yet for the plural...

    @tim-one
    Copy link
    Member

    tim-one commented May 31, 2020

    functools is clearly a poor place for this. imath would also be. graph_stuff_probably_limited_to_a_topsort is the only accurate name ;-) Off-the-wall possibilities include misclib (stuff that just doesn't fit anywhere else - yet) and cslib (Computer Science-y stuff).

    Whichever name wins, we should probably look to ensure the name isn't also of a package already in (non-trivial) use.

    @rhettinger
    Copy link
    Contributor Author

    I vote for it being in its own module (like difflib). That would also be a good place to put my proposed tsort() function with its simpler API.

    @pablogsal
    Copy link
    Member

    I checked and from the list of proposed names with "graph" prefixes, only "graphutils" is not colliding with something on Pypi.

    For the record "cslib" and "misclib" are also available but the scope of those feel much much bigger.

    I am going to open a PR to move this to "graphutils" for now. We can discuss better in the PR for better names given that seems that there is an agreement on moving this out of functools.

    @ericvsmith
    Copy link
    Member

    I'd prefer a new module just for this. As for names, I like toposort over topsort.

    I already have a library on PyPI named toposort. Personally, I don't have a problem with the stdlib taking over that name, and I'd just deprecate mine at 3.9, or whatever. I'm not sure if there are any ramifications of doing that, however.

    Or, pick another unused name, like toposortlib or something.

    @DimitrisJim
    Copy link
    Mannequin

    DimitrisJim mannequin commented May 31, 2020

    Another option, graphlib[1], does exist on PyPI but is not maintained and currently read-only by the author. Other flavors[2][3] of the same name also don't seem to have much adoption so they shouldn't confuse if a name like graphlib was chosen.

    [1] https://github.com/bruth/graphlib/
    [2] https://github.com/MengLiuPurdue/graph_lib
    [3] https://github.com/EmileTrotignon/GraphLib

    @pablogsal
    Copy link
    Member

    Opened #20558. I have *initially* proposed "graphlib" as it feels more in-line with other stdlib module names and as Jim pointed it does not collide with anything major. Also, I have proposed this initial name to not keep the scope too short in case we want to add new things in the future (although we can always say "no" and just keep this functionality).

    @pablogsal
    Copy link
    Member

    New changeset 2f172d8 by Pablo Galindo in branch 'master':
    bpo-17005: Move topological sort functionality to its own module (GH-20558)
    2f172d8

    @miss-islington
    Copy link
    Contributor

    New changeset 0a67463 by Miss Islington (bot) in branch '3.9':
    bpo-17005: Move topological sort functionality to its own module (GH-20558)
    0a67463

    @PaddyMcCarthy
    Copy link
    Mannequin

    PaddyMcCarthy mannequin commented Aug 27, 2020

    I've been playing with Python 3.9.0rc1 and was looking at a particular graph to see when it released tasks for processing.
    I ran the following code:

    from functools import reduce
    from pprint import pprint as pp
    from collections import defaultdict
    from graphlib import TopologicalSorter
    from heapq import heapify, heappush, heappop, heappushpop

    print("""

    ### TASK DEPENDENCY

    A -\> B -\> C
    ↓    ↓    ↓
    D -\> E -\> F
    

    """)
    graph3 = {"A": {"B", "D"},
    "B": {"C", "E"},
    "C": {"F"},
    "D": {"E"},
    "E": {"F"},
    }
    tasktime = {task: 2 for task in 'ABCDEF'}

    def simulate(graph, tasktime):
        print("\n## NEW SIMULATION")
        t = 0
        heap = []
        heapify(heap)
        print("\n# Task runtimes:")
        for node, tm in tasktime.items():
            print(f"  {node}: {tm}")
    print("\n# OUTPUT. (:> for task start times, <: for stop times).\n")
    ts = TopologicalSorter(graph)
    ts.prepare()
    while ts.is_active():
        for node in ts.get_ready():
            finish = t + tasktime[node]
            heappush(heap, (finish, node))
            print(f"{'  ' * t}{node}:> @{t}")
        t, node = heappop(heap)
        print(f"{'  ' * t}{node}<: @{t}")
        ts.done(node)
    
    simulate(graph3, tasktime)
    tasktime['C'] = 1
    simulate(graph3, tasktime)

    I got the following output:

    ### TASK DEPENDENCY

    A -> B -> C
    ↓    ↓    ↓
    D -> E -> F
    

    ## NEW SIMULATION

    # Task runtimes:
    A: 2
    B: 2
    C: 2
    D: 2
    E: 2
    F: 2

    # OUTPUT. (:> for task start times, <: for stop times).

    F:> @0
    F<: @2
    C:> @2
    E:> @2
    C<: @4
    E<: @4
    B:> @4
    D:> @4
    B<: @6
    D<: @6
    A:> @6
    A<: @8

    ## NEW SIMULATION

    # Task runtimes:
    A: 2
    B: 2
    C: 1
    D: 2
    E: 2
    F: 2

    # OUTPUT. (:> for task start times, <: for stop times).

    F:> @0
    F<: @2
    C:> @2
    E:> @2
    C<: @3
    E<: @4
    B:> @4
    D:> @4
    B<: @6
    D<: @6
    A:> @6
    A<: @8

    >>

    Note that in the second simulation, C finish, but B isn't then immediately started. I have my own code that also works like this but it isn't optimal.

    Thanks guys.

    @PaddyMcCarthy
    Copy link
    Mannequin

    PaddyMcCarthy mannequin commented Aug 27, 2020

    Please ignore my earlier Message-id <1598493715.04.0.06462575371.bpo-17005@roundup.psfhosted.org>.

    I missed a dependency in cutting down a larger example. Sorry.

    @rhettinger rhettinger removed their assignment Dec 5, 2020
    @merwok merwok added 3.9 only security fixes and removed 3.8 only security fixes labels Sep 16, 2021
    @adriangb
    Copy link
    Mannequin

    adriangb mannequin commented Nov 29, 2021

    As part of working on a tool that deals with dependencies, I was building my own topological sort. I iterated through various APIs (iterable of tasks, iterable of parallelizable groups of tasks, etc.) until I found the (now stdlib) version which ended up being exactly the API I needed to most efficiently execute dependencies. So, kudos on the design!

    I actually ended up re-writing it in Rust, partly because I wanted a good project to learn Rust, partly because I wanted to be able to modify the API a bit. Namely:

    1. I needed the ability to re-execute the same DAG multiple times without re-checking for cycles and re-adding all nodes (so basically copying npredecessors before executing).
    2. I needed the ability to remove nodes from the graph. The real-world application is removing pruning subgraphs corresponding to cached dependencies. Again, I wanted to do this without rebuilding the entire thing (removing nodes can never lead to a cycle, and it is possible to keep track of new leaf nodes as you remove them instead of iterating over the entire graph again to find leaf nodes).

    Here's the implementation in case anyone is interested: https://github.com/adriangb/graphlib2

    The algorithm is the same, but I had to change the data structures somewhat to cope w/ Rusts' borrowing rules (namely I can't hold a mutable reference to two values in node2nodeinfo at the same time, which the current implementation does here

    successor_info = n2i[successor]
    , so I split them out into two separate mappings).

    @merwok
    Copy link
    Member

    merwok commented Dec 17, 2021

    See https://bugs.python.org/issue46071 for request to change the API
    (I preferred adding a note here than adding all people to nosy there)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @vdwees
    Copy link

    vdwees commented May 22, 2022

    For what it's worth, I thought I would share my use case for graphlib: scheduling coroutines. It's something I jokingly call structured concurrency 2.0 😄

    import asyncio
    import graphlib
    
    
    class TopologicalExecutor(graphlib.TopologicalSorter):
        """Executes a group of coroutines with topological constraints"""
    
        async def execute(self):
            self.prepare()
            # Similar to static_order(), but executes concurrently when possible
            ready = {asyncio.create_task(coro) for coro in self.get_ready()}
            while ready:
                done, _ = await asyncio.wait(ready, return_when=asyncio.FIRST_COMPLETED)
                self.done(*(task.get_coro() for task in done))
                ready -= done
                ready |= {asyncio.create_task(coro) for coro in self.get_ready()}
    
    
    async def async_print(sg, n):
        await asyncio.sleep(0)
        ws = "\t" * n
        print(f"{ws}{sg}{n if n else ''}", flush=True)
    
    
    # start
    start = async_print("start", 0)
    # subgroup a
    a1 = async_print("a", 1)
    a2 = async_print("a", 2)
    a3 = async_print("a", 3)
    # subgroup b
    b1 = async_print("b", 1)
    b2 = async_print("b", 2)
    b3 = async_print("b", 3)
    # subgroup c
    c1 = async_print("c", 1)
    c2 = async_print("c", 2)
    c3 = async_print("c", 3)
    # finalizer
    stop = async_print("stop", 0)
    
    
    await TopologicalExecutor(
        {
            # start
            start: (),
            # subgroup a
            a1: (start,),
            a2: (a1,),
            a3: (a2,),
            # subgroup b
            b1: (start,),
            b2: (b1,),
            b3: (b2,),
            # subgroup c
            c1: (start,),
            c2: (c1,),
            c3: (c2,),
            # stop
            stop: (a3, b3, c3),
        }
    ).execute()
    # prints
    # start
    # 	a1
    # 	b1
    # 	c1
    # 		a2
    # 		b2
    # 		c2
    # 			b3
    # 			c3
    # 			a3
    # stop

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    10 participants