Issue 40230: Itertools.product() Out of Memory Errors

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84411

classification

Title:	Itertools.product() Out of Memory Errors
Type:	resource usage	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.9

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	Dennis Sweeney, Henry Carscadden, rhettinger, tim.peters
Priority:	normal	Keywords:

Created on 2020-04-08 20:12 by Henry Carscadden, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (9)
msg366005 - (view)	Author: Henry Carscadden (Henry Carscadden)	Date: 2020-04-08 20:12
The product method in itertools provides an implementation of the Cartesian product that when run on with many arguments quickly gives out of memory errors. The current implementation creates a lot of unnecessary lists in this situation. A more appropriate implementation uses dynamic programming to avoid these out of memory issues.
msg366031 - (view)	Author: Dennis Sweeney (Dennis Sweeney) *	Date: 2020-04-09 01:49
The trouble is that itertools.product accepts iterators, and there is no guaranteed way of "restarting" an arbitrary iterator in Python. Consider: >>> a = iter([1,2,3]) >>> b = iter([4,5,6]) >>> next(a) 1 >>> next(b) 4 >>> from itertools import product >>> list(product(a, b)) [(2, 5), (2, 6), (3, 5), (3, 6)] Since there's no way to get back to items you've already consumed, the current approach is to consume all of the iterators to begin with and store their items in arrays, then lazily produce tuples of the items at the right indices of those arrays. Perhaps one could consume lazily from the iterators, say, only filling up the pools as they're needed and not storing the contents of the first iterator, but this would mean sometimes producing a product iterator that was doomed to cause a memory error eventually. If you really need this behavior you could do this in Python: def lazy_product(iterables): if not iterables: yield () return it0 = iterables[0] for x in it0: print(f"{x=}") for rest in lazy_product(iterables[1:]): print(f"{rest=}") yield (x,) + rest The above could surely be optimized as maybe you're suggesting, but this would be a backward-incompatible change for itertools.product.
msg366033 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-04-09 03:53
Possibly related: https://bugs.python.org/issue10109 Henry, I'm not clear at all about what you're saying. Please give at least one specific, concrete example of the behavior you're objecting to, and specify the behavior you want to see instead.
msg366103 - (view)	Author: Henry Carscadden (Henry Carscadden)	Date: 2020-04-10 04:10
Hey, Tim, I just wanted a note of clarification. I was working on an approximation algorithm for a network science tool that is being released soon. Initially, I relied on itertools.product(), but when I moved to testing on non-trivial graphs, I immediately had Out of Memory Errors. This was even on the nodes with large memories on the computing cluster that I was using. The issue is a lot of extra lists are added as the number of iterables passed product grows although the actual number of elements in the iterables is relatively small. Here is the workaround that I used to handle many arguments: def product(*args) -> Iterable: ''' Takes in several iterables and returns a generator that yields the cartesian product of the elements in the iterables :param args: sets which you would like the cartesian product of :return: generator ''' if len(args) == 1 and type(args) == list: return args numArgs = len(args) indices = np.zeros(numArgs, dtype=np.uint) checkLen = len(args[0]) while indices[0] < checkLen: curr_term = [args[0][indices[0]]] for i in range(1, numArgs): curr_term.append(args[i][indices[i]]) indices[-1] += np.uint(1) for i in range(1, numArgs): updated = False if indices[-i] >= len(args[-i]): indices[-i] = np.uint(0) indices[-i - 1] += np.uint(1) updated = True if not updated: break yield curr_term
msg366107 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-04-10 04:51
Henry, I have to ask again: please give at least one specific, concrete example of the behavior you're objecting to. English isn't helping, and I still have no idea what your complaint is. What I'm looking for: for i in itertools.product(???): pass where you replace the ??? with executable code (preferably not using numpy or any other extension package) that provokes the MemoryError you're talking about. For example, here I'm constructing a case with a million arguments. There's no problem at all: >>> import itertools >>> args = [range(100) for i in range(1_000_000)] >>> for i in itertools.product(*args): ... print(len(i)) [and it prints 1000000 over & over until I get bored and kill it] Note if it _did_ provoke a problem, we probably wouldn't care - there are no plausible realistic use cases for passing a million arguments to product(). You haven't given us a way to reproduce your complaint, or even a clue as to the number of arguments you're talking about. The above code was my best guess as to what you _might_ be talking about. But since there's no MemoryError in sight, apparently not. I'm suspecting this may be an XY problem[1], and especially because you posted "a solution" instead of an actual problem ;-) [1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
msg366113 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2020-04-10 10:27
> when I moved to testing on non-trivial graphs, I immediately had > Out of Memory Errors. I'm going to hazard a guess that the input to product() was a graph traversal iterator that got trapped in an undetected cycle. Feeding an infinite iterator into product() would eat all available memory, just as it would with list().
msg366132 - (view)	Author: Henry Carscadden (Henry Carscadden)	Date: 2020-04-10 15:13
@Tim Peters For example, the following should reproduce the error: many_arguments = [[1,2] for i in range(50)] for term in product(*many_arguments): print(term) In my application, I was taking the Cartesian product of the sets of all simple path to several nodes. This regularly produced a lot of arguments and crashed the compute nodes on which my jobs were running. Perhaps, this is not a salient concern for most uses, but I just wanted to make you aware of the issue.
msg366139 - (view)	Author: Tim Peters (tim.peters) *	Date: 2020-04-10 16:43
Henry, no, I see no problem while running your example. It's been running on my box for over 5 minutes now, and memory use remains trivial. Note that in the code I posted for you, instead of [1, 2] I used range(100), and instead of 50 I used a million: the same kind of thing, but I used _far_ larger numbers. And still didn't have a problem. Although, yes, that did consume about a gigabyte to materialize a million instances of tuple(range(100)) under the covers. The code you posted also works fine for me, and with minor memory consumption, if I replace your "50" with "1000000": >>> many_arguments = [[1,2] for i in range(1000000)] >>> for term in product(*many_arguments): ... pass 100% of a CPU is used for as long as I let it run, but memory use jumps just a little at the start. Which is what I expected. It just doesn't take all that much memory to create a million 2-tuples - which is what the `product()` implementation does. I believe you saw a MemoryError, but at this point I have to guess you misdiagnosed the cause. In any case, if you can't supply an example that reproduces the problem, we're going to have to close this. Perhaps there's some weird memory-allocation flaw on the _platform_ you're using? Extreme fragmentation? Without an example that exhibits a problem, there's just no way to guess from here :-(
msg366144 - (view)	Author: Henry Carscadden (Henry Carscadden)	Date: 2020-04-10 18:08
Tim, I'm guessing it was a misdiagnosis then. In any case, something changed about my codebase that alleviated the problem. Thanks for looking into it.

History
Date	User	Action	Args
2022-04-11 14:59:29	admin	set	github: 84411
2020-04-10 18:39:35	rhettinger	set	status: open -> closed resolution: not a bug stage: test needed -> resolved
2020-04-10 18:08:45	Henry Carscadden	set	messages: + msg366144
2020-04-10 16:43:50	tim.peters	set	messages: + msg366139
2020-04-10 15:13:50	Henry Carscadden	set	messages: + msg366132
2020-04-10 10:27:12	rhettinger	set	messages: + msg366113
2020-04-10 04:51:44	tim.peters	set	messages: + msg366107
2020-04-10 04:10:15	Henry Carscadden	set	messages: + msg366103
2020-04-10 03:42:17	rhettinger	set	assignee: rhettinger
2020-04-09 03:53:45	tim.peters	set	nosy: + tim.peters messages: + msg366033
2020-04-09 01:51:40	Dennis Sweeney	set	versions: + Python 3.9, - Python 3.7
2020-04-09 01:49:09	Dennis Sweeney	set	nosy: + Dennis Sweeney messages: + msg366031
2020-04-08 20:38:18	zach.ware	set	nosy: + rhettinger, - eric.araujo, dstufft components: + Library (Lib), - Distutils stage: test needed
2020-04-08 20:12:06	Henry Carscadden	create