classification
Title: Itertools.product() Out of Memory Errors
Type: resource usage Stage: resolved
Components: Library (Lib) Versions: Python 3.9
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Dennis Sweeney, Henry Carscadden, rhettinger, tim.peters
Priority: normal Keywords:

Created on 2020-04-08 20:12 by Henry Carscadden, last changed 2020-04-10 18:39 by rhettinger. This issue is now closed.

Messages (9)
msg366005 - (view) Author: Henry Carscadden (Henry Carscadden) Date: 2020-04-08 20:12
The product method in itertools provides an implementation of the Cartesian product that when run on with many arguments quickly gives out of memory errors. The current implementation creates a lot of unnecessary lists in this situation. A more appropriate implementation uses dynamic programming to avoid these out of memory issues.
msg366031 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python triager) Date: 2020-04-09 01:49
The trouble is that itertools.product accepts iterators, and there is no guaranteed way of "restarting" an arbitrary iterator in Python. Consider:

    >>> a = iter([1,2,3])
    >>> b = iter([4,5,6])
    >>> next(a)
    1
    >>> next(b)
    4
    >>> from itertools import product
    >>> list(product(a, b))
    [(2, 5), (2, 6), (3, 5), (3, 6)]

Since there's no way to get back to items you've already consumed, the current approach is to consume all of the iterators to begin with and store their items in arrays, then lazily produce tuples of the items at the right indices of those arrays.

Perhaps one could consume lazily from the iterators, say, only filling up the pools as they're needed and not storing the contents of the first iterator, but this would mean sometimes producing a product iterator that was doomed to cause a memory error eventually. If you really need this behavior you could do this in Python:

    def lazy_product(*iterables):
        if not iterables:
            yield ()
            return
        it0 = iterables[0]
        for x in it0:
            print(f"{x=}")
            for rest in lazy_product(*iterables[1:]):
                print(f"{rest=}")
                yield (x,) + rest

The above could surely be optimized as maybe you're suggesting, but this would be a backward-incompatible change for itertools.product.
msg366033 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-04-09 03:53
Possibly related:

    https://bugs.python.org/issue10109

Henry, I'm not clear at all about what you're saying.  Please give at least one specific, concrete example of the behavior you're objecting to, and specify the behavior you want to see instead.
msg366103 - (view) Author: Henry Carscadden (Henry Carscadden) Date: 2020-04-10 04:10
Hey, Tim, I just wanted a note of clarification. I was working on an approximation algorithm for a network science tool that is being released soon. Initially, I relied on itertools.product(), but when I moved to testing on non-trivial graphs, I immediately had Out of Memory Errors. This was even on the nodes with large memories on the computing cluster that I was using. The issue is a lot of extra lists are added as the number of iterables passed product grows although the actual number of elements in the iterables is relatively small. 
Here is the workaround that I used to handle many arguments:

def product(*args) -> Iterable:
    '''
    Takes in several iterables and returns a generator
    that yields the cartesian product of the elements in the iterables
    :param args: sets which you would like the cartesian product of
    :return: generator
    '''
    if len(args) == 1 and type(args) == list:
        return args
    numArgs = len(args)
    indices = np.zeros(numArgs, dtype=np.uint)
    checkLen = len(args[0])
    while indices[0] < checkLen:
        curr_term = [args[0][indices[0]]]
        for i in range(1, numArgs):
            curr_term.append(args[i][indices[i]])
        indices[-1] += np.uint(1)
        for i in range(1, numArgs):
            updated = False
            if indices[-i] >= len(args[-i]):
                indices[-i] = np.uint(0)
                indices[-i - 1] += np.uint(1)
                updated = True
            if not updated:
                break
        yield curr_term
msg366107 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-04-10 04:51
Henry, I have to ask again:  please give at least one specific, concrete example of the behavior you're objecting to.  English isn't helping, and I still have no idea what your complaint is.

What I'm looking for:

    for i in itertools.product(???):
        pass
 
where you replace the ??? with executable code (preferably not using numpy or any other extension package) that provokes the MemoryError you're talking about.

For example, here I'm constructing a case with a million arguments.  There's no problem at all:

>>> import itertools
>>> args = [range(100) for i in range(1_000_000)]
>>> for i in itertools.product(*args):
...     print(len(i))
[and it prints 1000000 over & over until I get bored and kill it]

Note if it _did_ provoke a problem, we probably wouldn't care - there are no plausible realistic use cases for passing a million arguments to product().

You haven't given us a way to reproduce your complaint, or even a clue as to the number of arguments you're talking about.  The above code was my best guess as to what you _might_ be talking about.  But since there's no MemoryError in sight, apparently not.

I'm suspecting this may be an XY problem[1], and especially because you posted "a solution" instead of an actual problem ;-)

[1] https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
msg366113 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-04-10 10:27
> when I moved to testing on non-trivial graphs, I immediately had 
> Out of Memory Errors.

I'm going to hazard a guess that the input to product() was a graph traversal iterator that got trapped in an undetected cycle.  Feeding an infinite iterator into product() would eat all available memory, just as it would with list().
msg366132 - (view) Author: Henry Carscadden (Henry Carscadden) Date: 2020-04-10 15:13
@Tim Peters
For example,
the following should reproduce the error:
many_arguments = [[1,2] for i in range(50)]
for term in product(*many_arguments):
    print(term)
In my application, I was taking the Cartesian product of the sets of all simple path to several nodes. This regularly produced a lot of arguments and crashed the compute nodes on which my jobs were running. Perhaps, this is not a salient concern for most uses, but I just wanted to make you aware of the issue.
msg366139 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2020-04-10 16:43
Henry, no, I see no problem while running your example.  It's been running on my box for over 5 minutes now, and memory use remains trivial.

Note that in the code I posted for you, instead of [1, 2] I used range(100), and instead of 50 I used a million:  the same kind of thing, but I used _far_ larger numbers.  And still didn't have a problem.  Although, yes, that did consume about a gigabyte to materialize a million instances of tuple(range(100)) under the covers.

The code you posted also works fine for me, and with minor memory consumption, if I replace your "50" with "1000000":

>>> many_arguments = [[1,2] for i in range(1000000)]
>>> for term in product(*many_arguments):
...     pass

100% of a CPU is used for as long as I let it run, but memory use jumps just a little at the start.  Which is what I expected.  It just doesn't take all that much memory to create a million 2-tuples - which is what the `product()` implementation does.

I believe you saw a MemoryError, but at this point I have to guess you misdiagnosed the cause.  In any case, if you can't supply an example that reproduces the problem, we're going to have to close this.

Perhaps there's some weird memory-allocation flaw on the _platform_ you're using?  Extreme fragmentation?  Without an example that exhibits a problem, there's just no way to guess from here :-(
msg366144 - (view) Author: Henry Carscadden (Henry Carscadden) Date: 2020-04-10 18:08
Tim, I'm guessing it was a misdiagnosis then. In any case, something changed about my codebase that alleviated the problem. Thanks for looking into it.
History
Date User Action Args
2020-04-10 18:39:35rhettingersetstatus: open -> closed
resolution: not a bug
stage: test needed -> resolved
2020-04-10 18:08:45Henry Carscaddensetmessages: + msg366144
2020-04-10 16:43:50tim.peterssetmessages: + msg366139
2020-04-10 15:13:50Henry Carscaddensetmessages: + msg366132
2020-04-10 10:27:12rhettingersetmessages: + msg366113
2020-04-10 04:51:44tim.peterssetmessages: + msg366107
2020-04-10 04:10:15Henry Carscaddensetmessages: + msg366103
2020-04-10 03:42:17rhettingersetassignee: rhettinger
2020-04-09 03:53:45tim.peterssetnosy: + tim.peters
messages: + msg366033
2020-04-09 01:51:40Dennis Sweeneysetversions: + Python 3.9, - Python 3.7
2020-04-09 01:49:09Dennis Sweeneysetnosy: + Dennis Sweeney
messages: + msg366031
2020-04-08 20:38:18zach.waresetnosy: + rhettinger, - eric.araujo, dstufft

components: + Library (Lib), - Distutils
stage: test needed
2020-04-08 20:12:06Henry Carscaddencreate