classification
Title: Use range in itertools roundrobin recipe
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Dubslow, docs@python, rhettinger, serhiy.storchaka, terry.reedy, tim.peters
Priority: normal Keywords: patch

Created on 2017-11-21 00:54 by terry.reedy, last changed 2017-11-23 21:33 by rhettinger. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 4486 merged rhettinger, 2017-11-21 04:17
PR 4487 merged python-dev, 2017-11-21 08:27
PR 4497 merged rhettinger, 2017-11-22 00:10
Messages (15)
msg306605 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-11-21 00:54
The itertools roundrobin recipe has an outer loop executed a preset number of times.  It is currently implemented with two assignments and a while loop.
https://docs.python.org/3/library/itertools.html#itertools-recipes
These can be replaced with a for loop using a reversed range.

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    nexts = cycle(iter(it).__next__ for it in iterables)
    for current_len in reversed(range(1, len(iterables)+1)):
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            nexts = cycle(islice(nexts, current_len - 1))

I think this is easier to understand.  So do some other participants in the current python-ideas thread 'Rewriting the "roundrobin" recipe in the itertools documentation'.

I changed 'pending' to 'current_len' because, to me, 'pending' should be the set of iter_nexts and not their number.

I originally avoided the '-1' in the islice call by decrementing both range arguments by 1 and calling the loop variable 'reduced_len'.  But having the loop variable be the size of the nexts iterable in the next outer iteration seemed confusing and not worth the trivial efficiency gain.

An independent change would be to replace 'next' with 'iter' on the basis that reusing the builtin name is not good and because 'nexts' is awkward to pronounce.

I will produce a PR if any version is preferred to the current one.
---

The OP proposed, and some participants like, an accept-reject algorithm based on zip_longest.

def roundrobin(*iters):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Perhaps "flat_zip_nofill" is a better name, or something similar
    sentinel = object()
    for tup in it.zip_longest(*iters, fillvalue=sentinel):
        yield from (x for x in tup if x is not sentinel)

I dislike creating tuples we don't want, with values we don't want, with an arbitrarily small acceptance ratio.  I also note that zip_longest is properly used in grouper, whereas roundrobin is the only recipe using cycle.
msg306608 - (view) Author: (Dubslow) Date: 2017-11-21 01:27
Note that this follows a bit of discussion on python-ideas, in two threads:

https://mail.python.org/pipermail/python-ideas/2017-November/047920.html

https://mail.python.org/pipermail/python-ideas/2017-November/047989.html

I agree the zip_longest-derived solution is both easier to read/understand and also not really accurate, especially in extreme cases.

But, and see the second thread, I think it might just be better to add this funcitonality itself to itertools, under the name zip_flat -- per the second thread, there's a lot of confusion around the topic beyond just the suitability of the current recipe (including such things as lack of a clear name, legibility, efficiency, and brevity -- a fair number of people are looking for one or two liners, even if that doesn't really exist). (One alternative to zip_flat would be to add a second keyword argument to zip_longest that *doesn't* use a fillvalue, producing variable-length tuples when the supplied iters run out. Then the recipe and the entire conversation/topic could be replaced by "yield from zip_longest(*iters, usefill=False)".)

Despite that opinion, this suggested change is better than nothing.
msg306611 - (view) Author: (Dubslow) Date: 2017-11-21 01:50
Regarding the current bug more specifically, I think a few comments would go a long way to helping readers understand the code.

And also, I do think the (1, +1, -1) is less readable, simply because it doesn't follow the most common usage patterns of range, where your first version using (0,0,0) (with implicit zeroes of course) is cleaner. It's much more apparent how long the loop is at first glance ("loop once per iter" vs "loop... from what to what? oh, once per iter"). Perhaps not using reversed() but instead setting step=-1 would be a better case to use the off-by-1 variant.

Such a version with both proposed changes (which are independent and can be considered or accepted separately) looks like this:


def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # This recipe is also called other names like "alternate", "interleave", or
    # "merge". "zip_flat" would also be an accurate name.
    # Algorithm: cycle around the __next__ methods of each iterable. when an
    # iter runs out, remove its __next__ from the cycle.
    nexts = cycle(iter(it).__next__ for it in iterables)
    for reduced_len in reversed(range(len(iterables))):
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            nexts = cycle(islice(nexts, reduced_len))
            # This skips one round of the cycle, starting the next round
            # without the current iter


The last comment is probably the least valuable, though it does point out the slight quirk of how the last line works. I suppose this is the case for having the loop variable be named the actual length, but I think most Python users are accustomed to the last element of a list having index length-1. At any rate, I think the readability gain in the for statement is more than the readability loss in the islice().
msg306612 - (view) Author: (Dubslow) Date: 2017-11-21 01:51
Perhaps the loop variable could be renamed to "len_minus_1" or some such something which is more semantic to the algorithm.
msg306613 - (view) Author: (Dubslow) Date: 2017-11-21 01:54
Er, in my first message, make that "(yield from tup for tup in zip_longest(*iters, usefill=False))"
msg306623 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-11-21 03:15
IMO, George Sakkis's original version is better than the TJR's proposed revision which has three separate adjustments by one and adds to extraneous functions which aren't central to the example.  Dubslow's alternative at least avoids the three offsets by one; however, it has the disadvantage that meaning of the loop induction variable is somewhat indirect and creates its own source of confusion.

I agree that some algorithmic comment should be added, probably just for the last line.

Let's not add any of the listed alternative names to the comments.  Each one is either  confusing or wrong.  The word "merge" already has an established, different meaning as in "sort/merge".  For example,  heapq.merge('ABC', 'D', 'EF') gives ['A', 'B', 'C', 'D', 'E', 'F'].  The word "alternate" tends to mean "toggle back-and-forth" in common usage.  The term "zip_flat" has no meaning to me, it has no hits on Google, and the closest match is a recipe on StackOverflow that does something different.  And "interleave" is not suggestive of looping back over the iterables until each is exhausted.  Also, we may yet use interleave() to mean something else.

In contrast, "round robin" has a well established meaning that matches what this recipe does. Until now, not a single reader has ever claimed to not know what it means.  https://en.wikipedia.org/wiki/Round-robin_scheduling

FWIW, the recipe has several benefit.  1) Give a way to implement round robin iteration without re-inventing the wheel, 2) Demonstrate ways to use cycle() and islice().  3) Demonstrate useful optimization technique of factoring the try/except out of the for-loop, 4) Demonstrate the useful optimization technique of calling bound methods, and 5) Be concise (I've left longer or more complex recipes for the ASPN cookbook or StackOverflow).

Ideally, I would prefer to not add two extra builtin lookups (the recipe is sometime used on short inputs which would be affected by the extra overhead).  Also, I prefer the visual weight to be on the central message of tight inner loops that exploit itertools rather than having the visual weight shift to the for-loop which is the least important part.

Can you a suggest a concise single line comment that would make the last line clearer about what it is doing?   Also, I'm open to changing the name of the "pending" variable but think "current_len" and "reduced_len" are not improvements.
msg306624 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-11-21 03:30
I agree the current recipe strikes a very nice balance among competing interests, and is educational on several counts.

s/pending/numactive/

    # Remove the iterator we just exhausted from the cycle.
    numactive -= 1
    nexts = cycle(islice(nexts, numactive))
msg306628 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-21 06:37
roundrobin() has quadratic computational complexity. For example list(roundrobin(*([[1]]*N))). Is there a way to make it with linear complexity?
msg306630 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-21 07:16
I agree that using binary tree would be excessively here. I just wondering if there is a simple way to get rid of the quadratic complexity. In any case this does not matter much until you work with many hundreds or thousands of iterables.
msg306631 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-21 07:25
Actually I tried to test if this implementation can cause a crash due to using deeply nested iterators (like in issue14010). For example in next(roundrobin(*([[]]*N + [[1]]))). But if there is such problem here, roundrobin() becomes unusable due to quadratic time for smaller N (tens of thousands).
msg306670 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-11-21 17:04
While we're on the topic, I had some thought of also adding a similar recipe to https://docs.python.org/3/library/collections.html#deque-recipes .  The alternative recipe is slower is common cases but doesn't degrade when there are a huge number of iterables:  

    def roundrobin(*iterables):
        "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
        nexts = deque(iter(it).__next__ for it in iterables)
        while nexts:
            try:
                while True:
                    yield nexts[0]()
                    nexts.rotate(-1)
            except StopIteration:
                nexts.popleft()
msg306673 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-21 17:28
Today I have published a similar recipe on Python-Ideas. It uses popleft/append instead of __getitem__/rotate.

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    nexts = deque(iter(it).__next__ for it in iterables)
    popleft = nexts.popleft
    append = nexts.append
    while nexts:
        next = popleft()
        try:
            yield next()
        except StopIteration:
            pass
        else:
            append(next)

It is faster (10-25%) in all microbenchmarks that I did (Steven's benchmarks for small number of iterables and my examples for large number of iterables).
msg306680 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-11-21 19:35
Really, I don't give a whit about the micro-benchmarks -- that completely misses the point.  The recipes are primarily intended to have educational value on ways to use itertools, deques, and whatnot.

For itertools, I'm satisfied with new variable name and the additional comment.  For collections, there is an open question about whether adding an extra example would make users better off.   Beyond that, I have little interest is exploring all the little variants or wasting further time on this (that is what ASPN, StackOverflow, and blog posts are for).
msg306681 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-21 19:39
But why use less efficient code? Is my code worse?
msg306860 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-11-23 21:32
New changeset 0858495a50e19defde786a4ec050ec182e920f46 by Raymond Hettinger in branch 'master':
bpo-32099 Add deque variant of roundrobin() recipe (#4497)
https://github.com/python/cpython/commit/0858495a50e19defde786a4ec050ec182e920f46
History
Date User Action Args
2017-11-23 21:33:29rhettingersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-11-23 21:32:25rhettingersetmessages: + msg306860
2017-11-22 00:10:06rhettingersetpull_requests: + pull_request4434
2017-11-21 19:39:49serhiy.storchakasetmessages: + msg306681
2017-11-21 19:35:04rhettingersetmessages: + msg306680
2017-11-21 17:28:56serhiy.storchakasetmessages: + msg306673
2017-11-21 17:04:18rhettingersetmessages: + msg306670
2017-11-21 08:27:03python-devsetpull_requests: + pull_request4425
2017-11-21 07:25:56serhiy.storchakasetmessages: + msg306631
2017-11-21 07:16:56serhiy.storchakasetmessages: + msg306630
2017-11-21 07:06:13rhettingersetmessages: - msg306629
2017-11-21 07:02:00rhettingersetmessages: + msg306629
2017-11-21 06:37:27serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg306628
2017-11-21 04:17:45rhettingersetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request4424
2017-11-21 03:30:25tim.peterssetnosy: + tim.peters
messages: + msg306624
2017-11-21 03:15:21rhettingersetmessages: + msg306623
2017-11-21 02:29:01rhettingersetassignee: docs@python -> rhettinger
2017-11-21 01:54:21Dubslowsetmessages: + msg306613
2017-11-21 01:51:13Dubslowsetmessages: + msg306612
2017-11-21 01:50:04Dubslowsetmessages: + msg306611
2017-11-21 01:27:01Dubslowsetnosy: + Dubslow
messages: + msg306608
2017-11-21 00:54:27terry.reedycreate