classification
Title: itertools.tee doesn't have a __sizeof__ method
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: amaury.forgeotdarc, loewis, pitrou, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-09-19 09:19 by pitrou, last changed 2014-05-30 09:47 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
tee_sizeof.patch pitrou, 2013-09-19 10:06 review
gettotalsizeof.py serhiy.storchaka, 2013-09-20 00:29
Messages (30)
msg198048 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 09:19
An itertools.tee object can cache an arbitrary number of objects (pointers), but its sys.getsizeof() value will always remain the same.
msg198050 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 10:04
I find the implementation of itertools.tee a bit weird: why does teedataobject have to be a PyObject? It seems to complicate things and make them less optimal.
msg198051 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 10:06
Anywhere, here is a patch.
msg198052 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 10:43
This is a duplicate of issue15475.
msg198053 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 10:49
Hmm, no, I'm wrong, it's not a duplication.
msg198054 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 11:02
I'm not sure that sys.getsizeof() should recursively count all Python subobjects. That is why I had omitted tee() in my patch.

>>> sys.getsizeof([[]])
36
>>> sys.getsizeof([list(range(10000))])
36
msg198055 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 11:05
> I'm not sure that sys.getsizeof() should recursively count all Python
> subobjects.

Those are private subobjects. They are not visible to the programmer
(except perhaps by calling __reduce__ or __setstate__).
msg198061 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 11:48
Thy are visible by calling gc.get_referents(). High-level function can use this to count recursive size of objects.

>>> import sys, gc, itertools
>>> def gettotalsizeof(*args, seen=None):
...     if seen is None:
...         seen = {}
...     sum = 0
...     for obj in args:
...         if id(obj) not in seen:
...             seen[id(obj)] = obj
...             sum += sys.getsizeof(obj)
...             sum += gettotalsizeof(*gc.get_referents(obj), seen=seen)
...     return sum
... 
>>> a, b = tee(range(10000))
>>> sum(next(a) for i in range(1000))
499500
>>> gettotalsizeof(a)
750
>>> gettotalsizeof(b)
18734
msg198065 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 13:03
> Thy are visible by calling gc.get_referents().

That's totally besides the point. The point is that those
objects are invisible in normal conditions, not that they can't
be read using advanced implementation-dependent tricks.
msg198070 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 13:39
The point is that your patch breaks functions like gettotalsizeof(). It makes impossible to get a total size of general object.

It will be better to add gettotalsizeof() to the stdlib (or add an optional parameter to sys.getsizeof() for recursive counting).
msg198074 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 14:02
> The point is that your patch breaks functions like gettotalsizeof().
> It makes impossible to get a total size of general object.

The thing is, "Total size" is generally meaningless. It can include
things such as the object's type, or anything transitively referenced
by the object, such as modules.

> It will be better to add gettotalsizeof() to the stdlib (or add an
> optional parameter to sys.getsizeof() for recursive counting).

This patch has *nothing* to do with recursive counting. It counts
the internal arrays of itertools.tee() as part of its memory size,
which is reasonable and expected. It does *not* count memory recursively:
it doesn't count the size of the itertools.tee()'s cached objects,
for example.

Recursive counting doesn't make sense with Python. Where do you stop
counting?
msg198087 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-09-19 15:47
I like the definition of __sizeof__ that was discussed some time ago:
http://bugs.python.org/issue14520#msg157798

With that definition (do we have it somewhere in the docs, by the way?)
The current code works gives the correct answer.
msg198089 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 15:56
> I like the definition of __sizeof__ that was discussed some time ago:
> http://bugs.python.org/issue14520#msg157798

The problem is that that definition isn't helpful.
If we ever change itertools.tee to use non-PyObjects internally, suddenly
its sys.getsizeof() would have to return much larger numbers despite
visible behaviour not having changed at all (and despite the memory
overhead being actually lower).

And gc.get_referents() is really a low-level debugging tool, certainly
not a "reflection API" (inspect would serve that role).
msg198093 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 16:24
Isn't sys.getsizeof() a low-level debugging tool?
msg198094 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 17:06
> Isn't sys.getsizeof() a low-level debugging tool?

What would it help debug exactly? :-)
I would hope it gives remotely useful information about the passed
object.
msg198098 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2013-09-19 17:47
getsizeof() is interesting only if it gives sensible results when used correctly, especially if you want to sum these values and get a global memory usage.

One usage is to traverse objects through gc.get_referents(); in this case the definition above is correct.

Now, are you suggesting to traverse objects differently? With dir(), or __dict__?

(btw, this discussion explains why pypy still does not implement getsizeof())
msg198101 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 18:05
> getsizeof() is interesting only if it gives sensible results when used
> correctly, especially if you want to sum these values and get a global
> memory usage.

"Getting a global memory usage" isn't a correct use of getsizeof(),
though, because it totally ignores the memory allocation overhead (not
to mention fragmentation, or any memory areas that may have been
allocated without being accounted for by __sizeof__).

If you want global Python memory usage, use sys._debugmallocstats(), not
sys.getsizeof().

> One usage is to traverse objects through gc.get_referents(); in this
> case the definition above is correct.

What are the intended semantics? get_referents() can give you references
you didn't expect, such as type objects, module objects...

> Now, are you suggesting to traverse objects differently? With dir(),
> or __dict__?

sys.getsizeof() gives you the memory usage of a given Python object, it
doesn't guarantee that "traversing objects" will give you the right
answer for any question.
msg198102 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-09-19 18:06
> The problem is that that definition isn't helpful.
> If we ever change itertools.tee to use non-PyObjects internally, s
> suddenly its sys.getsizeof() would have to return much larger numbers 
> despite visible behaviour not having changed at all (and despite the 
> memory overhead being actually lower).

I see no problem with that. If the internal representation changes, nobody should be surprised if sizeof changes.

> I would hope it gives remotely useful information about the passed object.

It certainly does: it reports the memory consumption of the object itself, 
not counting the memory of other objects.

I proposed a precise definition of what "an other object" is. If you don't like it,
please propose a different definition that still allows to automatically sum up the
memory of a graph of objects.
msg198104 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 18:09
> I see no problem with that. If the internal representation changes,
> nobody should be surprised if sizeof changes.

Who is "nobody"? Users aren't aware of internal representation changes.
It sounds like you want sys.getsizeof() to be a tool for language
implementors anyway.

> I proposed a precise definition of what "an other object" is. If you don't like it,
> please propose a different definition that still allows to automatically sum up the
> memory of a graph of objects.

What is the use case for "summing up the memory of a graph of objects?
How do you stop walking your graph if it spans the whole graph of Python
objects?
msg198105 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 18:11
Well. itertools._tee is one Python object and itertools._tee_dataobject is another Python object. sys.getsizeof() gives you the memory usage of this objects separately.
msg198106 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 18:15
> Well. itertools._tee is one Python object and
> itertools._tee_dataobject is another Python object. sys.getsizeof()
> gives you the memory usage of this objects separately.

This is great... And how do I know that I need to use gc.get_referents()
to get those objects in case I'm measuring the memory consumption of a
teeobject (rather than, say, trusting __dict__, or simply trusting the
getsizeof() output at face value)?

If sys.getsizeof() is only useful for people who know *already* how an
object is implemented internally, then it's actually useless, because
those people can just as well do the calculation themselves.
msg198107 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 18:17
Here is why using get_referents() is stupid in the general case:

>>> class C: pass
... 
>>> c = C()
>>> gc.get_referents(c)
[<class '__main__.C'>]

With your method, measuring c's memory consumption also includes the memory consumption of its type object.
(and of course this is only a trivial example... one can only imagine what kind of mess it is with a non-trivial object)
msg198108 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 18:19
(By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose)
msg198110 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 18:32
> How do you stop walking your graph if it spans the whole graph of Python objects?

We can stop at specific types of objects (for example types and modules).

> If sys.getsizeof() is only useful for people who know *already* how an
object is implemented internally, then it's actually useless, because
those people can just as well do the calculation themselves.

It's why sys.getsizeof() is a low-level tool. We need high-level tool in the stdlib. Even imperfect recursive counting will be better than confusing for novices sys.getsizeof().

> (By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose)

Yes, I know, and I think it is wrong.

Here is improved version of gettotalsizeof():

def gettotalsizeof(*args, exclude_types=(type, type(sys))):
    seen = {}
    stack = []
    for obj in args:
        if id(obj) not in seen:
            seen[id(obj)] = obj
            stack.append(obj)
    sum = 0
    while stack:
        obj = stack.pop()
        sum += sys.getsizeof(obj)
        for obj in gc.get_referents(obj):
            if id(obj) not in seen and not isinstance(obj, exclude_types):
                seen[id(obj)] = obj
                stack.append(obj)
    return sum


>>> gettotalsizeof(sys)
206575
>>> gettotalsizeof(gc)
2341
>>> gettotalsizeof(sys.getsizeof)
60
>>> gettotalsizeof(gettotalsizeof)
60854
>>> class C: pass
... 
>>> gettotalsizeof(C)
805
>>> gettotalsizeof(C())
28
msg198111 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 18:39
> It's why sys.getsizeof() is a low-level tool. We need high-level tool
> in the stdlib. Even imperfect recursive counting will be better than
> confusing for novices sys.getsizeof().

Ok, but I need to see a satisfying version of "gettotalsizeof" before
I'm convinced (see below).

> Here is improved version of gettotalsizeof():
> 
[...]
> >>> gettotalsizeof(gettotalsizeof)
> 60854

Why that big? Does it make sense?

What if say, a large object is "shared" between many small objects?
Should it count towards the memory size of any of those small objects?
What if that object is actually immortal (it is also a module global,
for example)?
msg198113 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-19 18:44
Optionally we can also not count objects which are referenced from outside of a graph of objects (this isn't so easy implement in Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True) will count only bare list and a square of 22, because 1 and 'abc' are interned.
msg198114 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-19 18:45
> Optionally we can also not count objects which are referenced from
> outside of a graph of objects (this isn't so easy implement in
> Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True)
> will count only bare list and a square of 22, because 1 and 'abc' are
> interned.

That's only part of the equation. What if I have an object which
references, for example, a logging.Logger? Loggers are actually eternal
(they live in a global dictionary somewhere in the logging module), but
gettotalsizeof() will still count it.
msg198124 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-20 00:29
Here is advanced function which counts only objects on which there are no external references.

>>> import itertools
>>> a, b = itertools.tee(range(10000))
>>> max(zip(a, range(100)))
(99, 99)
>>> sys.getsizeof(a)
32
>>> gettotalinnersizeof(a)
32
>>> gettotalinnersizeof(b)
292
>>> gettotalinnersizeof(a, b)
608

Total size of a and b is larger than a sum of sizes of a and b. It's because it includes size of one shared between a and teedataobject and one shared range iterator.
msg198130 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-09-20 06:04
Antoine: in (my experience of) memory analysis, the size of a single object is mostly irrelevant. If you need to know how much memory something consumes, you typically want to know the memory of a set of objects. So this is the case that really must be supported.

For that, users will have to use libraries that know how to count memory. It's not asked to much that the authors of such libraries know about internals of Python (such as the existence of sys.getsizeof, or gc.get_referents). The question is: can such a library reasonably implemented? For that, it is important that getsizeof behaves uniformly across objects.

If you really don't like the proposed uniformity, please propose a different rule. However, don't give deviations in other places (OrderedDict) as a reason to break the rule here as well. Instead, if OrderedDict.__getsizeof__ is broken, it needs to be fixed as well.
msg198178 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-09-21 00:12
> getsizeof() is interesting only if it gives sensible results 
> when used correctly, especially if you want to sum these values
> and get a global memory usage.

If accounting for global memory usage is a goal, it needs to have a much more comprehensively thought out, implementation dependent approach.  There are many issues (memory fragmentation, key-sharing dictionaries, dummy objects, list over-allocation, the minsize dictionary that is part of the dict object in addition to its variable sized portion, non-python objects held by Python objects, the extra few bytes per object consumed by the freelisting scheme in Objects/obmalloc.c etc).

> The thing is, "Total size" is generally meaningless. 

I concur.  This is a pipe dream without a serious investment of time and without creating a new and unnecessary maintenance burden.

> (By the way, OrderedDict.__sizeof__ already breaks the
> rule you are trying to impose)

FWIW, the way OrderedDict computes sizeof is probably typical of how anyone is currently using sys.getsizeof().   If you change the premise of how it operates, you're probably going to break the code written by the very few people in the world who care about sys.getsizeof():

    def __sizeof__(self):
        sizeof = _sys.getsizeof
        n = len(self) + 1                       # number of links including root
        size = sizeof(self.__dict__)            # instance dictionary
        size += sizeof(self.__map) * 2          # internal dict and inherited dict
        size += sizeof(self.__hardroot) * n     # link objects
        size += sizeof(self.__root) * n         # proxy objects
        return size

I don't have any specific recommendation for itertools.tee other than that I think it doesn't really need a __sizeof__ method.  The typical uses of tee are a transient phenomena that temporarily use some memory and then disappear.  I'm not sure that any mid-stream sizeof checks reveal information of any worth.

Overall, this thread indicates that the entire concept of __sizeof__ has been poorly defined, unevenly implemented, and not really useful when aggregated.

For those who are interested in profiling and optimizing Python's memory usage, I think we would be much better off providing a memory allocator hook that can know about every memory allocation and how those allocations have been arranged (revealing the fragmentation of the unused memory in the spaces between).  Almost anything short of that will provide a grossly misleading picture of memory usage.
History
Date User Action Args
2014-05-30 09:47:38rhettingersetstatus: open -> closed
resolution: rejected
2013-09-21 00:12:26rhettingersetmessages: + msg198178
2013-09-20 23:06:45rhettingersetassignee: rhettinger
versions: - Python 3.3
2013-09-20 06:04:27loewissetmessages: + msg198130
2013-09-20 00:29:42serhiy.storchakasetfiles: + gettotalsizeof.py

messages: + msg198124
2013-09-19 18:45:56pitrousetmessages: + msg198114
2013-09-19 18:44:07serhiy.storchakasetmessages: + msg198113
2013-09-19 18:39:55pitrousetmessages: + msg198111
2013-09-19 18:32:32serhiy.storchakasetmessages: + msg198110
2013-09-19 18:19:38pitrousetmessages: + msg198108
2013-09-19 18:17:46pitrousetmessages: + msg198107
2013-09-19 18:15:57pitrousetmessages: + msg198106
2013-09-19 18:11:29serhiy.storchakasetmessages: + msg198105
2013-09-19 18:09:24pitrousetmessages: + msg198104
2013-09-19 18:06:32loewissetmessages: + msg198102
2013-09-19 18:05:25pitrousetmessages: + msg198101
2013-09-19 17:47:50amaury.forgeotdarcsetmessages: + msg198098
2013-09-19 17:06:06pitrousetmessages: + msg198094
2013-09-19 16:24:40serhiy.storchakasetmessages: + msg198093
2013-09-19 15:56:10pitrousetmessages: + msg198089
2013-09-19 15:47:17amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg198087
2013-09-19 14:02:21pitrousetmessages: + msg198074
2013-09-19 13:39:52serhiy.storchakasetnosy: + loewis
messages: + msg198070
2013-09-19 13:03:47pitrousetmessages: + msg198065
2013-09-19 11:48:13serhiy.storchakasetmessages: + msg198061
2013-09-19 11:05:41pitrousetmessages: + msg198055
2013-09-19 11:02:04serhiy.storchakasetmessages: + msg198054
2013-09-19 10:50:20serhiy.storchakasetstage: needs patch -> patch review
2013-09-19 10:49:20serhiy.storchakasetsuperseder: Correct __sizeof__ support for itertools ->
messages: + msg198053
2013-09-19 10:43:16serhiy.storchakasetsuperseder: Correct __sizeof__ support for itertools
messages: + msg198052
2013-09-19 10:06:09pitrousetfiles: + tee_sizeof.patch
keywords: + patch
messages: + msg198051
2013-09-19 10:04:12pitrousetmessages: + msg198050
2013-09-19 09:19:50pitroucreate