This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: itertools.groupby not working as expected
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Jiba, petri.lehtinen
Priority: normal Keywords:

Created on 2012-05-16 09:51 by Jiba, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg160822 - (view) Author: Jiba (Jiba) Date: 2012-05-16 09:51
In some situation, itertools.groupby fails to group the objects, and produces several groups with the same key. For example, the following code :


from itertools import *

class P(object):
  def __init__(self, key):
    self.key = key

p1 = P(1)
p2 = P(2)
p3 = P(1)

for key, ps in groupby([p1, p2, p3], lambda p: p.key):
  print "group", key
  for p in ps:
    print "  - object", p


Produces the following result :

group 1
  - object <__main__.P object at 0xb73d6acc>
group 2
  - object <__main__.P object at 0xb73d6aec>
group 1
  - object <__main__.P object at 0xb73d6b0c>


While I would expect to have only a single group 1, e.g. something like :

group 1
  - object <__main__.P object at 0xb73d6acc>
  - object <__main__.P object at 0xb73d6b0c>
group 2
  - object <__main__.P object at 0xb73d6aec>


It seems that this bug also affects Python 3 (tested on Python 3.1.2)
msg160825 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-05-16 10:23
groupby() changes the group when the key changes in the input it iterates. If you want to have p1 and p3 to go to the same group, you need to sort the input by P.key first.

This is clearly documented, too:

    The operation of groupby() is similar to the uniq filter in Unix.
    It generates a break or new group every time the value of the key 
    function changes (which is why it is usually necessary to have 
    sorted the data using the same key function). That behavior differs
    from SQL’s GROUP BY which aggregates common elements regardless of
    their input order.
msg160828 - (view) Author: Jiba (Jiba) Date: 2012-05-16 11:08
Ok, I understand.

However, in my initial problem, the sequence passed to groupby was a set, e.g. (modifying my previous example) :

       groupby(set([p1, p2, p3]), lambda p: p.key)

If I understand well how groupby() works, the result of a groupby performed on a set is unpredictable, since it depends of the order of the items when iterating over the set. Perhaps the behavior of groupby() should be modified for unsorted sequences, possibly not taking the order into account, or raising an error ?
msg160829 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2012-05-16 11:17
You're right, the result over a set would be unpredictable.

The point of the itertools module is to be able to a) cope with massive amounts of data and b) be a set of tools instead of complete solutions for all problems.

Because of both of the points above, groupby() doesn't load all the data into memory or attempt to sort the data by itself. Furthermore, there's no way for groupby() to know whether the iterable it's passed is going to yield sorted or unsorted data.

It's your responsibility to know whether the iterable you're passing is already sorted or not, and sort it first, if it's possible and there's a need to do so.
History
Date User Action Args
2022-04-11 14:57:30adminsetgithub: 59033
2012-05-16 11:17:40petri.lehtinensetmessages: + msg160829
2012-05-16 11:08:15Jibasetmessages: + msg160828
2012-05-16 10:23:07petri.lehtinensetstatus: open -> closed

nosy: + petri.lehtinen
messages: + msg160825

resolution: not a bug
2012-05-16 09:51:34Jibacreate