classification
Title: Odd behavior when unpacking `itertools.groupby`
Type: behavior Stage: resolved
Components: Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Matt Gilson, mgilson, rhettinger, serhiy.storchaka, tim.peters
Priority: low Keywords: patch

Created on 2017-05-11 17:33 by Matt Gilson, last changed 2017-09-24 11:01 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
groupby-fix.patch mgilson, 2017-05-12 21:29 review
groupby-invalid.diff serhiy.storchaka, 2017-09-24 11:01
Pull Requests
URL Status Linked Edit
PR 1568 closed serhiy.storchaka, 2017-05-13 06:51
PR 1569 merged serhiy.storchaka, 2017-05-13 06:51
Messages (20)
msg293511 - (view) Author: Matt Gilson (Matt Gilson) Date: 2017-05-11 17:33
There is some odd behavior when unpacking the groups from an itertools.groupby result.  For example:

    from itertools import groupby
    from operator import itemgetter

    inputs = ((x > 5, x) for x in range(10))
    (_, a), (_, b) = groupby(inputs, key=itemgetter(0))
    print(list(a))
    print(list(b))

On CPython, this results in:

    []
    [(True, 9)]

I would expect it to print out 2 empty lists since the second group would have to be consumed to make sure that there isn't another value yielded from groupby (If there was another value to yield, we'd get a ValueError since we couldn't unpack the correct number of items).

This is the behavior that PyPy had prior to re-implementing to be consistent with CPython in https://bitbucket.org/pypy/pypy/commits/6093ff1a44e6b17f09db83aa80aea562a738c286
msg293521 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-05-12 03:41
FYI, the CPython behavior matches the pure python implementation show in the docs:

Python 3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "copyright", "credits" or "license()" for more information.
>>> class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey))
    def _grouper(self, tgtkey):
        while self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

>>> from operator import itemgetter
>>> inputs = ((x > 5, x) for x in range(10))
>>> (_, a), (_, b) = groupby(inputs, key=itemgetter(0))
>>> print(list(a))
[]
>>> print(list(b))
[(True, 9)]
msg293526 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-05-12 06:33
I suppose that when the input iterator is exhausted, we could poison the currkey to make the subordinate iterator ends as well:

    def __next__(self):
        while self.currkey == self.tgtkey:
            try:    
                self.currvalue = next(self.it)
            except StopIteration:
                self.currkey = object()
                raise StopIteration from None
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey))
msg293529 - (view) Author: Matthew Gilson (mgilson) * Date: 2017-05-12 07:20
I think that works to solve the problem that I pointed out.  In my stack overflow question (http://stackoverflow.com/a/43926058/748858) it has been pointed out that there are other opportunities for weirdness here.

Specifically, if if I skip processing 2 groups and then I process a third group whose key is the same as the first:


inputs = [(x > 5, x) for x in range(10)]
inputs += [(False, 10), (True, 11)]

g = groupby(inputs2 + [(True, 11)], key=itemgetter(0))
_, a = next(g)
_, b = next(g)
_, c = next(g)

print(list(a))
print(list(b))

Both `a` and `b` should probably be empty at this point, but they aren't.  

What if you kept track of the last iterable group and just consumed it at whenever `next` is called?  I think then you also need to keep track of whether or not the input iterable has been completely consumed, but that's not too bad either:

_sentinel = object()

class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.last_group = self.currkey = self.currvalue = _sentinel
        self.empty = False

    def __iter__(self):
        return self

    def __next__(self):
        if self.last_group is not _sentinel:
            for _ in self.last_group:
                pass
        if self.empty:
            raise StopIteration

        if self.currvalue is _sentinel:
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                self.empty = True
                raise
            self.currkey = self.keyfunc(self.currvalue)
        self.last_group = self._grouper(self.currkey, self.currvalue)
        return (self.currkey, self.last_group)

    def _grouper(self, tgtkey, currvalue):
        while self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                self.empty = True
                return
            self.currkey = self.keyfunc(self.currvalue)

I haven't tested this to make sure it passes the test suite -- I also don't know if this would have major performance implications or anything.  If it did have severe performance implications, then it probably isn't worthwhile...
msg293530 - (view) Author: Matthew Gilson (mgilson) * Date: 2017-05-12 07:23
Oops.  You don't need to pass `self.currvalue` to `_grouper`. I didn't end up using it in there...
msg293533 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-12 09:18
Does it make sense using a grouper after advancing the groupby iterator? Currently it is possible by accident:

>>> from itertools import groupby, count
>>> it = groupby(count(), lambda i: (i//10)%2)
>>> _, even = next(it)
>>> _, odd = next(it)
>>> print(list(even))
[]
>>> print(list(odd))
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> print(list(even))
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
>>> print(list(odd))
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
>>> print(list(even))
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

But I think that it would be more consistent to implement one of following variant:

1. Invalidate groupers after creating a new grouper. There should be only one valid grouper related to the groupby object. Using of invalid grouper should raise either StopIteration or RuntimeError.

>>> it = groupby(count(), lambda i: (i//10)%2)
>>> _, even = next(it)
>>> _, odd = next(it)
>>> print(list(even))
[]
>>> print(list(odd))
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> print(list(even))
[]
>>> print(list(odd))
[]

or

>>> it = groupby(count(), lambda i: (i//10)%2)
>>> _, even = next(it)
>>> _, odd = next(it)
>>> print(list(even))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError
>>> print(list(odd))
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> print(list(even))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError
>>> print(list(odd))
[]

2. Duplicate the source iterator using tee() when create a new grouper. All groupers can be used independently from the groupby object and other groupers.

>>> it = groupby(range(100), lambda i: (i//10)%2)
>>> _, even = next(it)
>>> _, odd = next(it)
>>> _ = list(it)  # exhaust the source iterator
>>> print(list(even))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> print(list(odd))
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> print(list(even))
[]
>>> print(list(odd))
[]

I think resolving this issue will also fix issue30347.
msg293552 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-12 15:44
Here is a Python implementation of the idea #2:

from itertools import tee
def groupby(iterable, key=None):
    if key is None:
        key = lambda x: x
    def grouper(currvalue, it, tgtkey):
        yield currvalue
        for currvalue in it:
            if key(currvalue) != tgtkey:
                break
            yield currvalue
    it = iter(iterable)
    tgtkey = init = object()
    while True:
        try:
            currvalue = next(it)
        except StopIteration:
            return
        currkey = key(currvalue)
        if tgtkey is init or currkey != tgtkey:
            tgtkey = currkey
            it, it2 = tee(it)
            yield (currkey, grouper(currvalue, it2, tgtkey))
msg293565 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-05-12 19:38
I have no interest in #2.   Itertools are all about *not* storing data in memory unless that is an intrinsic requirement for the operation (tee() and cycle() for example).

Also, I would like any changes to be minimal and have the lowest possible risk of breaking code.  The groupby() tool is about 13 years old and has served its users well.  These little corner cases we're working on now don't seem to correspond to how users actually use groupby or how it was intended to be used (either eat the sub-iterator right away before advancing the outer iterator or ignore it entirely).
msg293568 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-12 20:01
I agree and prefer #1. This will save users from incorrect using and us from reports about odd behavior.

The implementation is simple. The groupby object generates an unique identifier when creates a new grouper. This may be a sequential integer, just object(), or the grouper object itself (but the latter case creates a reference loop). The grouper object saves the parent's identifier when created and compare the owned reference with the current parent's identifier before every use (in __next__ and __reduce__). If they don't match, the grouper object is not valid.
msg293574 - (view) Author: Matthew Gilson (mgilson) * Date: 2017-05-12 21:29
Tracking which group the grouper _should_ be on using an incrementing integer seems to work pretty well.

In additional to the tests in `test_itertools.py`, I've gotten the following tests to pass:

class TestGroupBy(unittest.TestCase):
    def test_unpacking(self):
        iterable = 'AAAAABBBBB'
        (_, a), (_, b) = groupby(iterable)
        self.assertEqual(list(a), [])
        self.assertEqual(list(b), [])

    def test_weird_iterating(self):
        g = groupby('AAAABBBBAAAAABBBBB')
        _, a = next(g)
        _, b = next(g)
        _, aa = next(g)
        self.assertEqual(list(a), [])
        self.assertEqual(list(b), [])
        self.assertEqual(list(aa), list('AAAAA'))

If I was to submit this as a PR,

1. where would I want to add these tests?
2. should I update the documentation for the "equivalent" python version to match exactly?
msg293578 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-05-12 21:51
> If I was to submit this as a PR, ...

Please don't jump to writing code yet.  We need to decide what should be done first.  

I'm not really excited about making a change at all.  No users seem to have ever been affected by this.  The CPython code matches what it is documented to do (by the pure python code in the documentation).  In a way, it is just an obscure implementation artifact.

If we do make a small change, I would like it to be 3.7 only.  No need to risk breaking existing code for something that technically isn't even a bug (because it matches the documentation).  As the title suggests, it is more of an oddity than anything else.
msg293599 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-13 06:54
PR 1568 and PR 1569 are two subvariants of #1. PR 1568 makes an inner iterator invalid and raising RuntimeError when used. PR 1569 makes an inner iterator exhausted.
msg293621 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-05-13 17:39
Users certainly have been fooled by this, although "unpacking" is a red herring.  I've been burned by it, and I've seen StackOverflow puzzles related to the same thing.  I think this is the heart of it:  given a finite iterable I, it usually makes no semantic difference whether a program does:

    for thing in I:
    
or

    for thing in list(I):
    
But when `I` is obtained from `groupby()`, `thing` contains an iterator that shares state with `I` itself, so they're not at all the same.  It's surprising just because it's so uncommon.  "unpacking" is one special case of this.

I'd like to see an attempt to invoke a sub-iterator raise an exception if the primary iterator has moved beyond it.  The behavior in that case now is undefined, utterly accidental, and useful only for surprising people ;-)  Nobody _intends_ to do this, unless they have no idea how `groupby()` is meant to be used.
msg294643 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-28 12:29
Raymond, could you please look at these patches? Both are small. What do you prefer?
msg294722 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-05-29 22:37
I've looked at the patches but am still thinking about what I prefer before committing to a re-design.  Please give it a little time.  There is zero urgency here (nothing is broken, no user complaints, etc).   I have a bunch of other stuff on my plate for a little while and this isn't the top priority.  I had asked to hold off on PRs until a decision is made and not rush into coding (which is usually the easy part).
msg302718 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-21 20:07
Ping. Can you make your decision Raymond?

This issue is a stopper for issue30347 which fixes a crash. I don't want to merge the fix for issue30347 until this issue be solved, because both proposed PRs solve issue30347 too, and I don't want to make unneeded changes or move code forth and back.
msg302843 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-09-24 10:01
Go ahead with PR 1569 to exhaust the inner iterator when the outer iterator advances.

Rationale:  The OP expected the inner iterator to be exhausted.  That is also what the PyPy team originally implemented and it is what I would have expected.

Also, there are some loose analogies elsewhere in the language.  Raising StopIteration is what next(somefileobject) does when there is a seek-to-end between next() calls.  A list iterator raises StopIteration when there is a del somelist[:] between next() calls.  When zip(it1, it2) terminates on the shortest input, it leaves the other iterator alive, allowing it to run and terminally normally with StopIteration. And though I don't have a use case for it, I would expect that next(inner_iterator, default_value) would return a value from the input stream or the default value but would not fail with a RuntimeError (this would apply to islice() as well).
msg302844 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-24 10:36
New changeset c247caf33f6e6000d828db4762d1cb12edf3cd57 by Serhiy Storchaka in branch 'master':
bpo-30346: An iterator produced by the itertools.groupby() iterator (#1569)
https://github.com/python/cpython/commit/c247caf33f6e6000d828db4762d1cb12edf3cd57
msg302846 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-24 10:39
Thank you for your review and explanations Raymond. This makes sense to me.
msg302847 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-24 11:01
Added the patch for PR 1568 against the current master just for history.
History
Date User Action Args
2017-09-24 11:01:22serhiy.storchakasetfiles: + groupby-invalid.diff

messages: + msg302847
2017-09-24 10:39:26serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg302846

stage: patch review -> resolved
2017-09-24 10:36:13serhiy.storchakasetmessages: + msg302844
2017-09-24 10:01:18rhettingersetassignee: rhettinger -> serhiy.storchaka
messages: + msg302843
2017-09-21 20:07:23serhiy.storchakasetmessages: + msg302718
2017-05-29 22:37:56rhettingersetmessages: + msg294722
2017-05-28 12:29:40serhiy.storchakasetmessages: + msg294643
2017-05-13 17:39:11tim.peterssetnosy: + tim.peters
messages: + msg293621
2017-05-13 06:54:21serhiy.storchakasetmessages: + msg293599
stage: patch review
2017-05-13 06:51:38serhiy.storchakasetpull_requests: + pull_request1664
2017-05-13 06:51:03serhiy.storchakasetpull_requests: + pull_request1663
2017-05-12 21:51:45rhettingersetmessages: + msg293578
versions: + Python 3.7, - Python 3.3, Python 3.5
2017-05-12 21:29:53mgilsonsetfiles: + groupby-fix.patch
keywords: + patch
messages: + msg293574
2017-05-12 20:01:59serhiy.storchakasetmessages: + msg293568
2017-05-12 19:38:26rhettingersetmessages: + msg293565
2017-05-12 15:44:17serhiy.storchakasetmessages: + msg293552
2017-05-12 09:18:30serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg293533
2017-05-12 07:23:41mgilsonsetmessages: + msg293530
2017-05-12 07:20:24mgilsonsetnosy: + mgilson
messages: + msg293529
2017-05-12 06:33:03rhettingersetmessages: + msg293526
2017-05-12 03:42:01rhettingersetpriority: normal -> low
2017-05-12 03:41:44rhettingersetmessages: + msg293521
2017-05-11 17:54:37serhiy.storchakasetassignee: rhettinger

nosy: + rhettinger
2017-05-11 17:33:01Matt Gilsoncreate