itertools.tee doesn't have a sizeof method #63248

pitrou · 2013-09-19T09:19:51Z

BPO	19048
Nosy	@loewis, @rhettinger, @amauryfa, @pitrou, @serhiy-storchaka
Files	tee_sizeof.patch gettotalsizeof.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/rhettinger'
closed_at = <Date 2014-05-30.09:47:38.703>
created_at = <Date 2013-09-19.09:19:50.648>
labels = ['type-bug', 'library']
title = "itertools.tee doesn't have a __sizeof__ method"
updated_at = <Date 2014-05-30.09:47:38.702>
user = 'https://github.com/pitrou'

bugs.python.org fields:

activity = <Date 2014-05-30.09:47:38.702>
actor = 'rhettinger'
assignee = 'rhettinger'
closed = True
closed_date = <Date 2014-05-30.09:47:38.703>
closer = 'rhettinger'
components = ['Library (Lib)']
creation = <Date 2013-09-19.09:19:50.648>
creator = 'pitrou'
dependencies = []
files = ['31813', '31822']
hgrepos = []
issue_num = 19048
keywords = ['patch']
message_count = 30.0
messages = ['198048', '198050', '198051', '198052', '198053', '198054', '198055', '198061', '198065', '198070', '198074', '198087', '198089', '198093', '198094', '198098', '198101', '198102', '198104', '198105', '198106', '198107', '198108', '198110', '198111', '198113', '198114', '198124', '198130', '198178']
nosy_count = 5.0
nosy_names = ['loewis', 'rhettinger', 'amaury.forgeotdarc', 'pitrou', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue19048'
versions = ['Python 3.4']

pitrou · 2013-09-19T09:19:51Z

An itertools.tee object can cache an arbitrary number of objects (pointers), but its sys.getsizeof() value will always remain the same.

pitrou · 2013-09-19T10:04:13Z

I find the implementation of itertools.tee a bit weird: why does teedataobject have to be a PyObject? It seems to complicate things and make them less optimal.

pitrou · 2013-09-19T10:06:09Z

Anywhere, here is a patch.

serhiy-storchaka · 2013-09-19T10:43:17Z

This is a duplicate of bpo-15475.

serhiy-storchaka · 2013-09-19T10:49:21Z

Hmm, no, I'm wrong, it's not a duplication.

serhiy-storchaka · 2013-09-19T11:02:04Z

I'm not sure that sys.getsizeof() should recursively count all Python subobjects. That is why I had omitted tee() in my patch.

>>> sys.getsizeof([[]])
36
>>> sys.getsizeof([list(range(10000))])
36

pitrou · 2013-09-19T11:05:42Z

I'm not sure that sys.getsizeof() should recursively count all Python
subobjects.

Those are private subobjects. They are not visible to the programmer
(except perhaps by calling __reduce__ or __setstate__).

serhiy-storchaka · 2013-09-19T11:48:13Z

Thy are visible by calling gc.get_referents(). High-level function can use this to count recursive size of objects.

>>> import sys, gc, itertools
>>> def gettotalsizeof(*args, seen=None):
...     if seen is None:
...         seen = {}
...     sum = 0
...     for obj in args:
...         if id(obj) not in seen:
...             seen[id(obj)] = obj
...             sum += sys.getsizeof(obj)
...             sum += gettotalsizeof(*gc.get_referents(obj), seen=seen)
...     return sum
... 
>>> a, b = tee(range(10000))
>>> sum(next(a) for i in range(1000))
499500
>>> gettotalsizeof(a)
750
>>> gettotalsizeof(b)
18734

pitrou · 2013-09-19T13:03:47Z

Thy are visible by calling gc.get_referents().

That's totally besides the point. The point is that those
objects are invisible in normal conditions, not that they can't
be read using advanced implementation-dependent tricks.

serhiy-storchaka · 2013-09-19T13:39:53Z

The point is that your patch breaks functions like gettotalsizeof(). It makes impossible to get a total size of general object.

It will be better to add gettotalsizeof() to the stdlib (or add an optional parameter to sys.getsizeof() for recursive counting).

pitrou · 2013-09-19T14:02:21Z

The point is that your patch breaks functions like gettotalsizeof().
It makes impossible to get a total size of general object.

The thing is, "Total size" is generally meaningless. It can include
things such as the object's type, or anything transitively referenced
by the object, such as modules.

It will be better to add gettotalsizeof() to the stdlib (or add an
optional parameter to sys.getsizeof() for recursive counting).

This patch has *nothing* to do with recursive counting. It counts
the internal arrays of itertools.tee() as part of its memory size,
which is reasonable and expected. It does *not* count memory recursively:
it doesn't count the size of the itertools.tee()'s cached objects,
for example.

Recursive counting doesn't make sense with Python. Where do you stop
counting?

amauryfa · 2013-09-19T15:47:17Z

I like the definition of __sizeof__ that was discussed some time ago:
http://bugs.python.org/issue14520#msg157798

With that definition (do we have it somewhere in the docs, by the way?)
The current code works gives the correct answer.

pitrou · 2013-09-19T15:56:10Z

I like the definition of __sizeof__ that was discussed some time ago:
http://bugs.python.org/issue14520#msg157798

The problem is that that definition isn't helpful.
If we ever change itertools.tee to use non-PyObjects internally, suddenly
its sys.getsizeof() would have to return much larger numbers despite
visible behaviour not having changed at all (and despite the memory
overhead being actually lower).

And gc.get_referents() is really a low-level debugging tool, certainly
not a "reflection API" (inspect would serve that role).

serhiy-storchaka · 2013-09-19T16:24:40Z

Isn't sys.getsizeof() a low-level debugging tool?

pitrou · 2013-09-19T17:06:06Z

Isn't sys.getsizeof() a low-level debugging tool?

What would it help debug exactly? :-)
I would hope it gives remotely useful information about the passed
object.

amauryfa · 2013-09-19T17:47:51Z

getsizeof() is interesting only if it gives sensible results when used correctly, especially if you want to sum these values and get a global memory usage.

One usage is to traverse objects through gc.get_referents(); in this case the definition above is correct.

Now, are you suggesting to traverse objects differently? With dir(), or __dict__?

(btw, this discussion explains why pypy still does not implement getsizeof())

pitrou · 2013-09-19T18:05:25Z

getsizeof() is interesting only if it gives sensible results when used
correctly, especially if you want to sum these values and get a global
memory usage.

"Getting a global memory usage" isn't a correct use of getsizeof(),
though, because it totally ignores the memory allocation overhead (not
to mention fragmentation, or any memory areas that may have been
allocated without being accounted for by __sizeof__).

If you want global Python memory usage, use sys._debugmallocstats(), not
sys.getsizeof().

One usage is to traverse objects through gc.get_referents(); in this
case the definition above is correct.

What are the intended semantics? get_referents() can give you references
you didn't expect, such as type objects, module objects...

Now, are you suggesting to traverse objects differently? With dir(),
or __dict__?

sys.getsizeof() gives you the memory usage of a given Python object, it
doesn't guarantee that "traversing objects" will give you the right
answer for any question.

loewis · 2013-09-19T18:06:32Z

The problem is that that definition isn't helpful.
If we ever change itertools.tee to use non-PyObjects internally, s
suddenly its sys.getsizeof() would have to return much larger numbers
despite visible behaviour not having changed at all (and despite the
memory overhead being actually lower).

I see no problem with that. If the internal representation changes, nobody should be surprised if sizeof changes.

I would hope it gives remotely useful information about the passed object.

It certainly does: it reports the memory consumption of the object itself,
not counting the memory of other objects.

I proposed a precise definition of what "an other object" is. If you don't like it,
please propose a different definition that still allows to automatically sum up the
memory of a graph of objects.

pitrou · 2013-09-19T18:09:24Z

I see no problem with that. If the internal representation changes,
nobody should be surprised if sizeof changes.

Who is "nobody"? Users aren't aware of internal representation changes.
It sounds like you want sys.getsizeof() to be a tool for language
implementors anyway.

I proposed a precise definition of what "an other object" is. If you don't like it,
please propose a different definition that still allows to automatically sum up the
memory of a graph of objects.

What is the use case for "summing up the memory of a graph of objects?
How do you stop walking your graph if it spans the whole graph of Python
objects?

serhiy-storchaka · 2013-09-19T18:11:30Z

Well. itertools._tee is one Python object and itertools._tee_dataobject is another Python object. sys.getsizeof() gives you the memory usage of this objects separately.

pitrou · 2013-09-19T18:15:57Z

Well. itertools._tee is one Python object and
itertools._tee_dataobject is another Python object. sys.getsizeof()
gives you the memory usage of this objects separately.

This is great... And how do I know that I need to use gc.get_referents()
to get those objects in case I'm measuring the memory consumption of a
teeobject (rather than, say, trusting __dict__, or simply trusting the
getsizeof() output at face value)?

If sys.getsizeof() is only useful for people who know *already* how an
object is implemented internally, then it's actually useless, because
those people can just as well do the calculation themselves.

pitrou · 2013-09-19T18:17:46Z

Here is why using get_referents() is stupid in the general case:

>>> class C: pass
... 
>>> c = C()
>>> gc.get_referents(c)
[<class '__main__.C'>]

With your method, measuring c's memory consumption also includes the memory consumption of its type object.
(and of course this is only a trivial example... one can only imagine what kind of mess it is with a non-trivial object)

pitrou · 2013-09-19T18:19:39Z

(By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose)

serhiy-storchaka · 2013-09-19T18:32:33Z

How do you stop walking your graph if it spans the whole graph of Python objects?

We can stop at specific types of objects (for example types and modules).

If sys.getsizeof() is only useful for people who know *already* how an
object is implemented internally, then it's actually useless, because
those people can just as well do the calculation themselves.

It's why sys.getsizeof() is a low-level tool. We need high-level tool in the stdlib. Even imperfect recursive counting will be better than confusing for novices sys.getsizeof().

(By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose)

Yes, I know, and I think it is wrong.

Here is improved version of gettotalsizeof():

def gettotalsizeof(*args, exclude_types=(type, type(sys))):
    seen = {}
    stack = []
    for obj in args:
        if id(obj) not in seen:
            seen[id(obj)] = obj
            stack.append(obj)
    sum = 0
    while stack:
        obj = stack.pop()
        sum += sys.getsizeof(obj)
        for obj in gc.get_referents(obj):
            if id(obj) not in seen and not isinstance(obj, exclude_types):
                seen[id(obj)] = obj
                stack.append(obj)
    return sum

>>> gettotalsizeof(sys)
206575
>>> gettotalsizeof(gc)
2341
>>> gettotalsizeof(sys.getsizeof)
60
>>> gettotalsizeof(gettotalsizeof)
60854
>>> class C: pass
... 
>>> gettotalsizeof(C)
805
>>> gettotalsizeof(C())
28

pitrou · 2013-09-19T18:39:56Z

It's why sys.getsizeof() is a low-level tool. We need high-level tool
in the stdlib. Even imperfect recursive counting will be better than
confusing for novices sys.getsizeof().

Ok, but I need to see a satisfying version of "gettotalsizeof" before
I'm convinced (see below).

> Here is improved version of gettotalsizeof():
> 
[...]
> >>> gettotalsizeof(gettotalsizeof)
> 60854

Why that big? Does it make sense?

What if say, a large object is "shared" between many small objects?
Should it count towards the memory size of any of those small objects?
What if that object is actually immortal (it is also a module global,
for example)?

serhiy-storchaka · 2013-09-19T18:44:08Z

Optionally we can also not count objects which are referenced from outside of a graph of objects (this isn't so easy implement in Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True) will count only bare list and a square of 22, because 1 and 'abc' are interned.

pitrou · 2013-09-19T18:45:57Z

Optionally we can also not count objects which are referenced from
outside of a graph of objects (this isn't so easy implement in
Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True)
will count only bare list and a square of 22, because 1 and 'abc' are
interned.

That's only part of the equation. What if I have an object which
references, for example, a logging.Logger? Loggers are actually eternal
(they live in a global dictionary somewhere in the logging module), but
gettotalsizeof() will still count it.

serhiy-storchaka · 2013-09-20T00:29:41Z

Here is advanced function which counts only objects on which there are no external references.

>>> import itertools
>>> a, b = itertools.tee(range(10000))
>>> max(zip(a, range(100)))
(99, 99)
>>> sys.getsizeof(a)
32
>>> gettotalinnersizeof(a)
32
>>> gettotalinnersizeof(b)
292
>>> gettotalinnersizeof(a, b)
608

Total size of a and b is larger than a sum of sizes of a and b. It's because it includes size of one shared between a and teedataobject and one shared range iterator.

loewis · 2013-09-20T06:04:26Z

Antoine: in (my experience of) memory analysis, the size of a single object is mostly irrelevant. If you need to know how much memory something consumes, you typically want to know the memory of a set of objects. So this is the case that really must be supported.

For that, users will have to use libraries that know how to count memory. It's not asked to much that the authors of such libraries know about internals of Python (such as the existence of sys.getsizeof, or gc.get_referents). The question is: can such a library reasonably implemented? For that, it is important that getsizeof behaves uniformly across objects.

If you really don't like the proposed uniformity, please propose a different rule. However, don't give deviations in other places (OrderedDict) as a reason to break the rule here as well. Instead, if OrderedDict.__getsizeof__ is broken, it needs to be fixed as well.

rhettinger · 2013-09-21T00:12:25Z

getsizeof() is interesting only if it gives sensible results
when used correctly, especially if you want to sum these values
and get a global memory usage.

If accounting for global memory usage is a goal, it needs to have a much more comprehensively thought out, implementation dependent approach. There are many issues (memory fragmentation, key-sharing dictionaries, dummy objects, list over-allocation, the minsize dictionary that is part of the dict object in addition to its variable sized portion, non-python objects held by Python objects, the extra few bytes per object consumed by the freelisting scheme in Objects/obmalloc.c etc).

The thing is, "Total size" is generally meaningless.

I concur. This is a pipe dream without a serious investment of time and without creating a new and unnecessary maintenance burden.

(By the way, OrderedDict.__sizeof__ already breaks the
rule you are trying to impose)

FWIW, the way OrderedDict computes sizeof is probably typical of how anyone is currently using sys.getsizeof(). If you change the premise of how it operates, you're probably going to break the code written by the very few people in the world who care about sys.getsizeof():

    def __sizeof__(self):
        sizeof = _sys.getsizeof
        n = len(self) + 1                       # number of links including root
        size = sizeof(self.__dict__)            # instance dictionary
        size += sizeof(self.__map) * 2          # internal dict and inherited dict
        size += sizeof(self.__hardroot) * n     # link objects
        size += sizeof(self.__root) * n         # proxy objects
        return size

I don't have any specific recommendation for itertools.tee other than that I think it doesn't really need a __sizeof__ method. The typical uses of tee are a transient phenomena that temporarily use some memory and then disappear. I'm not sure that any mid-stream sizeof checks reveal information of any worth.

Overall, this thread indicates that the entire concept of __sizeof__ has been poorly defined, unevenly implemented, and not really useful when aggregated.

For those who are interested in profiling and optimizing Python's memory usage, I think we would be much better off providing a memory allocator hook that can know about every memory allocation and how those allocations have been arranged (revealing the fragmentation of the unused memory in the spaces between). Almost anything short of that will provide a grossly misleading picture of memory usage.

pitrou added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Sep 19, 2013

rhettinger self-assigned this Sep 20, 2013

rhettinger closed this as completed May 30, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

michaeljpeters mentioned this issue Sep 14, 2022

Implement getsizeof_total for Container Types static-frame/static-frame#601

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

itertools.tee doesn't have a sizeof method #63248

itertools.tee doesn't have a sizeof method #63248

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

amauryfa commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

amauryfa commented Sep 19, 2013

pitrou commented Sep 19, 2013

loewis mannequin commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 20, 2013

loewis mannequin commented Sep 20, 2013

rhettinger commented Sep 21, 2013

itertools.tee doesn't have a __sizeof__ method #63248

itertools.tee doesn't have a __sizeof__ method #63248

Comments

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

amauryfa commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

amauryfa commented Sep 19, 2013

pitrou commented Sep 19, 2013

loewis mannequin commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 19, 2013

pitrou commented Sep 19, 2013

serhiy-storchaka commented Sep 20, 2013

loewis mannequin commented Sep 20, 2013

rhettinger commented Sep 21, 2013

itertools.tee doesn't have a sizeof method #63248

itertools.tee doesn't have a sizeof method #63248