This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: de-duping function in itertools
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: LambertDW, rhettinger, thomaspinckney3
Priority: normal Keywords:

Created on 2008-12-10 02:47 by thomaspinckney3, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg77477 - (view) Author: Tom Pinckney (thomaspinckney3) * Date: 2008-12-10 02:47
Any interest in an itertools de-duping function? I find I have to write 
this over and over for different projects:

def deduped(iter,key=None):
    keys = set()
    for x in iter:
        if key:
            k = key(x)
        else:
            k = x
        if k in keys:
            continue
        keys.add(k)
        yield(x)
msg77481 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-12-10 03:06
Am curious about your use cases.  In what contexts are you needing to
eliminate duplicates but retain the original ordering?  Am wondering if
the proposals for an ordered dictionary will meet your needs?  Also, am
looking to see what typifies the balance of needs your use cases -- an
iterator so large that you wouldn't want to loop over it twice but a set
of unique items that doesn't grow especially large (see the code
random.sample for a similar situation).  Another thought is that the
proposal for a Bag class (multiset) may solve the problem more generally.
msg77482 - (view) Author: Tom Pinckney (thomaspinckney3) * Date: 2008-12-10 03:47
My latest need for something like this was something like this:

src1 = db_query(query_1)
src2 = db_query(query_2)
results = deduped(src1 + src2, key=lambda x: x.field2)

Basically, I wanted data from src1 if it existed and otherwise from src2 
, while preserving the order of src1 (I didn't care about order of 
src2).

A previous example was reading from a file and wanting to de-dupe lines 
based on a field in each line. Again order mattered to me since I wanted 
to process the non-duped lines in the file in order.

A final example was generating a bunch of error messages from a variety 
of sources and then wanting to make sure there were no duplicate errors. 
Instead of: 

errors = set(errors)

I find this much clearer:

errors = deduped(errors)

In reality all of these examples probably do not need to be written as a 
generator. The lists being de-duped are probably not so huge in practice 
as to preclude instantiating a new list (given the reality of multi-gig 
RAM machines etc). It just seemed particularly clear to write this using 
a yield.

An ordered dictionary would probably work for me too. I don't think a 
Bag would given it's lack of ordering. 

I do find it very simple to just be able to apply deduped() to any 
existing sequence/iterator and not have to be more verbose about 
explicitly iterating and filling in an ordered dictionary somehow.
msg78015 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-12-18 08:38
My inclination is to not include this as a basic C coded itertool
because it holds potentially all of the data in memory (generally, not a
characteristic of an itertool) and because I don't see it as a basic
building block (itertools are intended to be elemental, composable
components of an iterator algebra).  Also, the pure python equivalent of
dedup() is both easy to write and runs efficiently (it gains little from
being recoded in C).

Instead, I'm think of adding two recipes to the itertools docs:

def unique_everseen(iterable, key=None):
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D    
    seen = set()
    seen_add = seen.add
    if key is None:
        for elem in iterable:
            if elem not in seen:
                seen_add(elem)
                yield elem
    else:
        for elem in iterable:
            k = key(elem)
            if k not in seen:
                seen_add(k)
                yield elem

def unique_lastseen(iterable, key=None):
    # unique_lastseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_lastseen('ABBCcAD', str.lower) --> A B C A D
    return imap(next, imap(itemgetter(1), groupby(iterable, key)))
msg78029 - (view) Author: David W. Lambert (LambertDW) Date: 2008-12-18 15:44
(but of course with imap in version 2.7 and with something else in
version 3.x)
msg78878 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-01-02 21:21
Added recipes to the itertools docs: r68177 .
History
Date User Action Args
2022-04-11 14:56:42adminsetgithub: 48865
2009-01-02 21:21:32rhettingersetstatus: open -> closed
resolution: accepted
messages: + msg78878
2008-12-18 15:44:00LambertDWsetnosy: + LambertDW
messages: + msg78029
2008-12-18 08:38:20rhettingersetmessages: + msg78015
2008-12-10 03:47:14thomaspinckney3setmessages: + msg77482
2008-12-10 03:06:15rhettingersetmessages: + msg77481
versions: + Python 3.1, Python 2.7, - Python 2.6
2008-12-10 03:00:13benjamin.petersonsetassignee: rhettinger
nosy: + rhettinger
2008-12-10 02:47:59thomaspinckney3create