Issue 4615: de-duping function in itertools

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48865

classification

Title:	de-duping function in itertools
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:	Python 3.1, Python 2.7

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	LambertDW, rhettinger, thomaspinckney3
Priority:	normal	Keywords:

Created on 2008-12-10 02:47 by thomaspinckney3, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg77477 - (view)	Author: Tom Pinckney (thomaspinckney3) *	Date: 2008-12-10 02:47
Any interest in an itertools de-duping function? I find I have to write this over and over for different projects: def deduped(iter,key=None): keys = set() for x in iter: if key: k = key(x) else: k = x if k in keys: continue keys.add(k) yield(x)
msg77481 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2008-12-10 03:06
Am curious about your use cases. In what contexts are you needing to eliminate duplicates but retain the original ordering? Am wondering if the proposals for an ordered dictionary will meet your needs? Also, am looking to see what typifies the balance of needs your use cases -- an iterator so large that you wouldn't want to loop over it twice but a set of unique items that doesn't grow especially large (see the code random.sample for a similar situation). Another thought is that the proposal for a Bag class (multiset) may solve the problem more generally.
msg77482 - (view)	Author: Tom Pinckney (thomaspinckney3) *	Date: 2008-12-10 03:47
My latest need for something like this was something like this: src1 = db_query(query_1) src2 = db_query(query_2) results = deduped(src1 + src2, key=lambda x: x.field2) Basically, I wanted data from src1 if it existed and otherwise from src2 , while preserving the order of src1 (I didn't care about order of src2). A previous example was reading from a file and wanting to de-dupe lines based on a field in each line. Again order mattered to me since I wanted to process the non-duped lines in the file in order. A final example was generating a bunch of error messages from a variety of sources and then wanting to make sure there were no duplicate errors. Instead of: errors = set(errors) I find this much clearer: errors = deduped(errors) In reality all of these examples probably do not need to be written as a generator. The lists being de-duped are probably not so huge in practice as to preclude instantiating a new list (given the reality of multi-gig RAM machines etc). It just seemed particularly clear to write this using a yield. An ordered dictionary would probably work for me too. I don't think a Bag would given it's lack of ordering. I do find it very simple to just be able to apply deduped() to any existing sequence/iterator and not have to be more verbose about explicitly iterating and filling in an ordered dictionary somehow.
msg78015 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2008-12-18 08:38
My inclination is to not include this as a basic C coded itertool because it holds potentially all of the data in memory (generally, not a characteristic of an itertool) and because I don't see it as a basic building block (itertools are intended to be elemental, composable components of an iterator algebra). Also, the pure python equivalent of dedup() is both easy to write and runs efficiently (it gains little from being recoded in C). Instead, I'm think of adding two recipes to the itertools docs: def unique_everseen(iterable, key=None): # unique_everseen('AAAABBBCCDAABBB') --> A B C D # unique_everseen('ABBCcAD', str.lower) --> A B C D seen = set() seen_add = seen.add if key is None: for elem in iterable: if elem not in seen: seen_add(elem) yield elem else: for elem in iterable: k = key(elem) if k not in seen: seen_add(k) yield elem def unique_lastseen(iterable, key=None): # unique_lastseen('AAAABBBCCDAABBB') --> A B C D A B # unique_lastseen('ABBCcAD', str.lower) --> A B C A D return imap(next, imap(itemgetter(1), groupby(iterable, key)))
msg78029 - (view)	Author: David W. Lambert (LambertDW)	Date: 2008-12-18 15:44
(but of course with imap in version 2.7 and with something else in version 3.x)
msg78878 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2009-01-02 21:21
Added recipes to the itertools docs: r68177 .

History
Date	User	Action	Args
2022-04-11 14:56:42	admin	set	github: 48865
2009-01-02 21:21:32	rhettinger	set	status: open -> closed resolution: accepted messages: + msg78878
2008-12-18 15:44:00	LambertDW	set	nosy: + LambertDW messages: + msg78029
2008-12-18 08:38:20	rhettinger	set	messages: + msg78015
2008-12-10 03:47:14	thomaspinckney3	set	messages: + msg77482
2008-12-10 03:06:15	rhettinger	set	messages: + msg77481 versions: + Python 3.1, Python 2.7, - Python 2.6
2008-12-10 03:00:13	benjamin.peterson	set	assignee: rhettinger nosy: + rhettinger
2008-12-10 02:47:59	thomaspinckney3	create