Issue40230
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2020-04-08 20:12 by Henry Carscadden, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Messages (9) | |||
---|---|---|---|
msg366005 - (view) | Author: Henry Carscadden (Henry Carscadden) | Date: 2020-04-08 20:12 | |
The product method in itertools provides an implementation of the Cartesian product that when run on with many arguments quickly gives out of memory errors. The current implementation creates a lot of unnecessary lists in this situation. A more appropriate implementation uses dynamic programming to avoid these out of memory issues. |
|||
msg366031 - (view) | Author: Dennis Sweeney (Dennis Sweeney) * ![]() |
Date: 2020-04-09 01:49 | |
The trouble is that itertools.product accepts iterators, and there is no guaranteed way of "restarting" an arbitrary iterator in Python. Consider: >>> a = iter([1,2,3]) >>> b = iter([4,5,6]) >>> next(a) 1 >>> next(b) 4 >>> from itertools import product >>> list(product(a, b)) [(2, 5), (2, 6), (3, 5), (3, 6)] Since there's no way to get back to items you've already consumed, the current approach is to consume all of the iterators to begin with and store their items in arrays, then lazily produce tuples of the items at the right indices of those arrays. Perhaps one could consume lazily from the iterators, say, only filling up the pools as they're needed and not storing the contents of the first iterator, but this would mean sometimes producing a product iterator that was doomed to cause a memory error eventually. If you really need this behavior you could do this in Python: def lazy_product(*iterables): if not iterables: yield () return it0 = iterables[0] for x in it0: print(f"{x=}") for rest in lazy_product(*iterables[1:]): print(f"{rest=}") yield (x,) + rest The above could surely be optimized as maybe you're suggesting, but this would be a backward-incompatible change for itertools.product. |
|||
msg366033 - (view) | Author: Tim Peters (tim.peters) * ![]() |
Date: 2020-04-09 03:53 | |
Possibly related: https://bugs.python.org/issue10109 Henry, I'm not clear at all about what you're saying. Please give at least one specific, concrete example of the behavior you're objecting to, and specify the behavior you want to see instead. |
|||
msg366103 - (view) | Author: Henry Carscadden (Henry Carscadden) | Date: 2020-04-10 04:10 | |
Hey, Tim, I just wanted a note of clarification. I was working on an approximation algorithm for a network science tool that is being released soon. Initially, I relied on itertools.product(), but when I moved to testing on non-trivial graphs, I immediately had Out of Memory Errors. This was even on the nodes with large memories on the computing cluster that I was using. The issue is a lot of extra lists are added as the number of iterables passed product grows although the actual number of elements in the iterables is relatively small. Here is the workaround that I used to handle many arguments: def product(*args) -> Iterable: ''' Takes in several iterables and returns a generator that yields the cartesian product of the elements in the iterables :param args: sets which you would like the cartesian product of :return: generator ''' if len(args) == 1 and type(args) == list: return args numArgs = len(args) indices = np.zeros(numArgs, dtype=np.uint) checkLen = len(args[0]) while indices[0] < checkLen: curr_term = [args[0][indices[0]]] for i in range(1, numArgs): curr_term.append(args[i][indices[i]]) indices[-1] += np.uint(1) for i in range(1, numArgs): updated = False if indices[-i] >= len(args[-i]): indices[-i] = np.uint(0) indices[-i - 1] += np.uint(1) updated = True if not updated: break yield curr_term |
|||
msg366107 - (view) | Author: Tim Peters (tim.peters) * ![]() |
Date: 2020-04-10 04:51 | |
Henry, I have to ask again: please give at least one specific, concrete example of the behavior you're objecting to. English isn't helping, and I still have no idea what your complaint is. What I'm looking for: for i in itertools.product(???): pass where you replace the ??? with executable code (preferably not using numpy or any other extension package) that provokes the MemoryError you're talking about. For example, here I'm constructing a case with a million arguments. There's no problem at all: >>> import itertools >>> args = [range(100) for i in range(1_000_000)] >>> for i in itertools.product(*args): ... print(len(i)) [and it prints 1000000 over & over until I get bored and kill it] Note if it _did_ provoke a problem, we probably wouldn't care - there are no plausible realistic use cases for passing a million arguments to product(). You haven't given us a way to reproduce your complaint, or even a clue as to the number of arguments you're talking about. The above code was my best guess as to what you _might_ be talking about. But since there's no MemoryError in sight, apparently not. I'm suspecting this may be an XY problem[1], and especially because you posted "a solution" instead of an actual problem ;-) [1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem |
|||
msg366113 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2020-04-10 10:27 | |
> when I moved to testing on non-trivial graphs, I immediately had > Out of Memory Errors. I'm going to hazard a guess that the input to product() was a graph traversal iterator that got trapped in an undetected cycle. Feeding an infinite iterator into product() would eat all available memory, just as it would with list(). |
|||
msg366132 - (view) | Author: Henry Carscadden (Henry Carscadden) | Date: 2020-04-10 15:13 | |
@Tim Peters For example, the following should reproduce the error: many_arguments = [[1,2] for i in range(50)] for term in product(*many_arguments): print(term) In my application, I was taking the Cartesian product of the sets of all simple path to several nodes. This regularly produced a lot of arguments and crashed the compute nodes on which my jobs were running. Perhaps, this is not a salient concern for most uses, but I just wanted to make you aware of the issue. |
|||
msg366139 - (view) | Author: Tim Peters (tim.peters) * ![]() |
Date: 2020-04-10 16:43 | |
Henry, no, I see no problem while running your example. It's been running on my box for over 5 minutes now, and memory use remains trivial. Note that in the code I posted for you, instead of [1, 2] I used range(100), and instead of 50 I used a million: the same kind of thing, but I used _far_ larger numbers. And still didn't have a problem. Although, yes, that did consume about a gigabyte to materialize a million instances of tuple(range(100)) under the covers. The code you posted also works fine for me, and with minor memory consumption, if I replace your "50" with "1000000": >>> many_arguments = [[1,2] for i in range(1000000)] >>> for term in product(*many_arguments): ... pass 100% of a CPU is used for as long as I let it run, but memory use jumps just a little at the start. Which is what I expected. It just doesn't take all that much memory to create a million 2-tuples - which is what the `product()` implementation does. I believe you saw a MemoryError, but at this point I have to guess you misdiagnosed the cause. In any case, if you can't supply an example that reproduces the problem, we're going to have to close this. Perhaps there's some weird memory-allocation flaw on the _platform_ you're using? Extreme fragmentation? Without an example that exhibits a problem, there's just no way to guess from here :-( |
|||
msg366144 - (view) | Author: Henry Carscadden (Henry Carscadden) | Date: 2020-04-10 18:08 | |
Tim, I'm guessing it was a misdiagnosis then. In any case, something changed about my codebase that alleviated the problem. Thanks for looking into it. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:29 | admin | set | github: 84411 |
2020-04-10 18:39:35 | rhettinger | set | status: open -> closed resolution: not a bug stage: test needed -> resolved |
2020-04-10 18:08:45 | Henry Carscadden | set | messages: + msg366144 |
2020-04-10 16:43:50 | tim.peters | set | messages: + msg366139 |
2020-04-10 15:13:50 | Henry Carscadden | set | messages: + msg366132 |
2020-04-10 10:27:12 | rhettinger | set | messages: + msg366113 |
2020-04-10 04:51:44 | tim.peters | set | messages: + msg366107 |
2020-04-10 04:10:15 | Henry Carscadden | set | messages: + msg366103 |
2020-04-10 03:42:17 | rhettinger | set | assignee: rhettinger |
2020-04-09 03:53:45 | tim.peters | set | nosy:
+ tim.peters messages: + msg366033 |
2020-04-09 01:51:40 | Dennis Sweeney | set | versions: + Python 3.9, - Python 3.7 |
2020-04-09 01:49:09 | Dennis Sweeney | set | nosy:
+ Dennis Sweeney messages: + msg366031 |
2020-04-08 20:38:18 | zach.ware | set | nosy:
+ rhettinger, - eric.araujo, dstufft components: + Library (Lib), - Distutils stage: test needed |
2020-04-08 20:12:06 | Henry Carscadden | create |