Issue19048
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013-09-19 09:19 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
tee_sizeof.patch | pitrou, 2013-09-19 10:06 | review | ||
gettotalsizeof.py | serhiy.storchaka, 2013-09-20 00:29 |
Messages (30) | |||
---|---|---|---|
msg198048 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 09:19 | |
An itertools.tee object can cache an arbitrary number of objects (pointers), but its sys.getsizeof() value will always remain the same. |
|||
msg198050 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 10:04 | |
I find the implementation of itertools.tee a bit weird: why does teedataobject have to be a PyObject? It seems to complicate things and make them less optimal. |
|||
msg198051 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 10:06 | |
Anywhere, here is a patch. |
|||
msg198052 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 10:43 | |
This is a duplicate of issue15475. |
|||
msg198053 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 10:49 | |
Hmm, no, I'm wrong, it's not a duplication. |
|||
msg198054 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 11:02 | |
I'm not sure that sys.getsizeof() should recursively count all Python subobjects. That is why I had omitted tee() in my patch. >>> sys.getsizeof([[]]) 36 >>> sys.getsizeof([list(range(10000))]) 36 |
|||
msg198055 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 11:05 | |
> I'm not sure that sys.getsizeof() should recursively count all Python > subobjects. Those are private subobjects. They are not visible to the programmer (except perhaps by calling __reduce__ or __setstate__). |
|||
msg198061 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 11:48 | |
Thy are visible by calling gc.get_referents(). High-level function can use this to count recursive size of objects. >>> import sys, gc, itertools >>> def gettotalsizeof(*args, seen=None): ... if seen is None: ... seen = {} ... sum = 0 ... for obj in args: ... if id(obj) not in seen: ... seen[id(obj)] = obj ... sum += sys.getsizeof(obj) ... sum += gettotalsizeof(*gc.get_referents(obj), seen=seen) ... return sum ... >>> a, b = tee(range(10000)) >>> sum(next(a) for i in range(1000)) 499500 >>> gettotalsizeof(a) 750 >>> gettotalsizeof(b) 18734 |
|||
msg198065 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 13:03 | |
> Thy are visible by calling gc.get_referents(). That's totally besides the point. The point is that those objects are invisible in normal conditions, not that they can't be read using advanced implementation-dependent tricks. |
|||
msg198070 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 13:39 | |
The point is that your patch breaks functions like gettotalsizeof(). It makes impossible to get a total size of general object. It will be better to add gettotalsizeof() to the stdlib (or add an optional parameter to sys.getsizeof() for recursive counting). |
|||
msg198074 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 14:02 | |
> The point is that your patch breaks functions like gettotalsizeof(). > It makes impossible to get a total size of general object. The thing is, "Total size" is generally meaningless. It can include things such as the object's type, or anything transitively referenced by the object, such as modules. > It will be better to add gettotalsizeof() to the stdlib (or add an > optional parameter to sys.getsizeof() for recursive counting). This patch has *nothing* to do with recursive counting. It counts the internal arrays of itertools.tee() as part of its memory size, which is reasonable and expected. It does *not* count memory recursively: it doesn't count the size of the itertools.tee()'s cached objects, for example. Recursive counting doesn't make sense with Python. Where do you stop counting? |
|||
msg198087 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * ![]() |
Date: 2013-09-19 15:47 | |
I like the definition of __sizeof__ that was discussed some time ago: http://bugs.python.org/issue14520#msg157798 With that definition (do we have it somewhere in the docs, by the way?) The current code works gives the correct answer. |
|||
msg198089 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 15:56 | |
> I like the definition of __sizeof__ that was discussed some time ago: > http://bugs.python.org/issue14520#msg157798 The problem is that that definition isn't helpful. If we ever change itertools.tee to use non-PyObjects internally, suddenly its sys.getsizeof() would have to return much larger numbers despite visible behaviour not having changed at all (and despite the memory overhead being actually lower). And gc.get_referents() is really a low-level debugging tool, certainly not a "reflection API" (inspect would serve that role). |
|||
msg198093 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 16:24 | |
Isn't sys.getsizeof() a low-level debugging tool? |
|||
msg198094 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 17:06 | |
> Isn't sys.getsizeof() a low-level debugging tool? What would it help debug exactly? :-) I would hope it gives remotely useful information about the passed object. |
|||
msg198098 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * ![]() |
Date: 2013-09-19 17:47 | |
getsizeof() is interesting only if it gives sensible results when used correctly, especially if you want to sum these values and get a global memory usage. One usage is to traverse objects through gc.get_referents(); in this case the definition above is correct. Now, are you suggesting to traverse objects differently? With dir(), or __dict__? (btw, this discussion explains why pypy still does not implement getsizeof()) |
|||
msg198101 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 18:05 | |
> getsizeof() is interesting only if it gives sensible results when used > correctly, especially if you want to sum these values and get a global > memory usage. "Getting a global memory usage" isn't a correct use of getsizeof(), though, because it totally ignores the memory allocation overhead (not to mention fragmentation, or any memory areas that may have been allocated without being accounted for by __sizeof__). If you want global Python memory usage, use sys._debugmallocstats(), not sys.getsizeof(). > One usage is to traverse objects through gc.get_referents(); in this > case the definition above is correct. What are the intended semantics? get_referents() can give you references you didn't expect, such as type objects, module objects... > Now, are you suggesting to traverse objects differently? With dir(), > or __dict__? sys.getsizeof() gives you the memory usage of a given Python object, it doesn't guarantee that "traversing objects" will give you the right answer for any question. |
|||
msg198102 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2013-09-19 18:06 | |
> The problem is that that definition isn't helpful. > If we ever change itertools.tee to use non-PyObjects internally, s > suddenly its sys.getsizeof() would have to return much larger numbers > despite visible behaviour not having changed at all (and despite the > memory overhead being actually lower). I see no problem with that. If the internal representation changes, nobody should be surprised if sizeof changes. > I would hope it gives remotely useful information about the passed object. It certainly does: it reports the memory consumption of the object itself, not counting the memory of other objects. I proposed a precise definition of what "an other object" is. If you don't like it, please propose a different definition that still allows to automatically sum up the memory of a graph of objects. |
|||
msg198104 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 18:09 | |
> I see no problem with that. If the internal representation changes, > nobody should be surprised if sizeof changes. Who is "nobody"? Users aren't aware of internal representation changes. It sounds like you want sys.getsizeof() to be a tool for language implementors anyway. > I proposed a precise definition of what "an other object" is. If you don't like it, > please propose a different definition that still allows to automatically sum up the > memory of a graph of objects. What is the use case for "summing up the memory of a graph of objects? How do you stop walking your graph if it spans the whole graph of Python objects? |
|||
msg198105 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 18:11 | |
Well. itertools._tee is one Python object and itertools._tee_dataobject is another Python object. sys.getsizeof() gives you the memory usage of this objects separately. |
|||
msg198106 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 18:15 | |
> Well. itertools._tee is one Python object and > itertools._tee_dataobject is another Python object. sys.getsizeof() > gives you the memory usage of this objects separately. This is great... And how do I know that I need to use gc.get_referents() to get those objects in case I'm measuring the memory consumption of a teeobject (rather than, say, trusting __dict__, or simply trusting the getsizeof() output at face value)? If sys.getsizeof() is only useful for people who know *already* how an object is implemented internally, then it's actually useless, because those people can just as well do the calculation themselves. |
|||
msg198107 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 18:17 | |
Here is why using get_referents() is stupid in the general case: >>> class C: pass ... >>> c = C() >>> gc.get_referents(c) [<class '__main__.C'>] With your method, measuring c's memory consumption also includes the memory consumption of its type object. (and of course this is only a trivial example... one can only imagine what kind of mess it is with a non-trivial object) |
|||
msg198108 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 18:19 | |
(By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose) |
|||
msg198110 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 18:32 | |
> How do you stop walking your graph if it spans the whole graph of Python objects? We can stop at specific types of objects (for example types and modules). > If sys.getsizeof() is only useful for people who know *already* how an object is implemented internally, then it's actually useless, because those people can just as well do the calculation themselves. It's why sys.getsizeof() is a low-level tool. We need high-level tool in the stdlib. Even imperfect recursive counting will be better than confusing for novices sys.getsizeof(). > (By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose) Yes, I know, and I think it is wrong. Here is improved version of gettotalsizeof(): def gettotalsizeof(*args, exclude_types=(type, type(sys))): seen = {} stack = [] for obj in args: if id(obj) not in seen: seen[id(obj)] = obj stack.append(obj) sum = 0 while stack: obj = stack.pop() sum += sys.getsizeof(obj) for obj in gc.get_referents(obj): if id(obj) not in seen and not isinstance(obj, exclude_types): seen[id(obj)] = obj stack.append(obj) return sum >>> gettotalsizeof(sys) 206575 >>> gettotalsizeof(gc) 2341 >>> gettotalsizeof(sys.getsizeof) 60 >>> gettotalsizeof(gettotalsizeof) 60854 >>> class C: pass ... >>> gettotalsizeof(C) 805 >>> gettotalsizeof(C()) 28 |
|||
msg198111 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 18:39 | |
> It's why sys.getsizeof() is a low-level tool. We need high-level tool > in the stdlib. Even imperfect recursive counting will be better than > confusing for novices sys.getsizeof(). Ok, but I need to see a satisfying version of "gettotalsizeof" before I'm convinced (see below). > Here is improved version of gettotalsizeof(): > [...] > >>> gettotalsizeof(gettotalsizeof) > 60854 Why that big? Does it make sense? What if say, a large object is "shared" between many small objects? Should it count towards the memory size of any of those small objects? What if that object is actually immortal (it is also a module global, for example)? |
|||
msg198113 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-19 18:44 | |
Optionally we can also not count objects which are referenced from outside of a graph of objects (this isn't so easy implement in Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True) will count only bare list and a square of 22, because 1 and 'abc' are interned. |
|||
msg198114 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-09-19 18:45 | |
> Optionally we can also not count objects which are referenced from > outside of a graph of objects (this isn't so easy implement in > Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True) > will count only bare list and a square of 22, because 1 and 'abc' are > interned. That's only part of the equation. What if I have an object which references, for example, a logging.Logger? Loggers are actually eternal (they live in a global dictionary somewhere in the logging module), but gettotalsizeof() will still count it. |
|||
msg198124 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2013-09-20 00:29 | |
Here is advanced function which counts only objects on which there are no external references. >>> import itertools >>> a, b = itertools.tee(range(10000)) >>> max(zip(a, range(100))) (99, 99) >>> sys.getsizeof(a) 32 >>> gettotalinnersizeof(a) 32 >>> gettotalinnersizeof(b) 292 >>> gettotalinnersizeof(a, b) 608 Total size of a and b is larger than a sum of sizes of a and b. It's because it includes size of one shared between a and teedataobject and one shared range iterator. |
|||
msg198130 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2013-09-20 06:04 | |
Antoine: in (my experience of) memory analysis, the size of a single object is mostly irrelevant. If you need to know how much memory something consumes, you typically want to know the memory of a set of objects. So this is the case that really must be supported. For that, users will have to use libraries that know how to count memory. It's not asked to much that the authors of such libraries know about internals of Python (such as the existence of sys.getsizeof, or gc.get_referents). The question is: can such a library reasonably implemented? For that, it is important that getsizeof behaves uniformly across objects. If you really don't like the proposed uniformity, please propose a different rule. However, don't give deviations in other places (OrderedDict) as a reason to break the rule here as well. Instead, if OrderedDict.__getsizeof__ is broken, it needs to be fixed as well. |
|||
msg198178 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2013-09-21 00:12 | |
> getsizeof() is interesting only if it gives sensible results > when used correctly, especially if you want to sum these values > and get a global memory usage. If accounting for global memory usage is a goal, it needs to have a much more comprehensively thought out, implementation dependent approach. There are many issues (memory fragmentation, key-sharing dictionaries, dummy objects, list over-allocation, the minsize dictionary that is part of the dict object in addition to its variable sized portion, non-python objects held by Python objects, the extra few bytes per object consumed by the freelisting scheme in Objects/obmalloc.c etc). > The thing is, "Total size" is generally meaningless. I concur. This is a pipe dream without a serious investment of time and without creating a new and unnecessary maintenance burden. > (By the way, OrderedDict.__sizeof__ already breaks the > rule you are trying to impose) FWIW, the way OrderedDict computes sizeof is probably typical of how anyone is currently using sys.getsizeof(). If you change the premise of how it operates, you're probably going to break the code written by the very few people in the world who care about sys.getsizeof(): def __sizeof__(self): sizeof = _sys.getsizeof n = len(self) + 1 # number of links including root size = sizeof(self.__dict__) # instance dictionary size += sizeof(self.__map) * 2 # internal dict and inherited dict size += sizeof(self.__hardroot) * n # link objects size += sizeof(self.__root) * n # proxy objects return size I don't have any specific recommendation for itertools.tee other than that I think it doesn't really need a __sizeof__ method. The typical uses of tee are a transient phenomena that temporarily use some memory and then disappear. I'm not sure that any mid-stream sizeof checks reveal information of any worth. Overall, this thread indicates that the entire concept of __sizeof__ has been poorly defined, unevenly implemented, and not really useful when aggregated. For those who are interested in profiling and optimizing Python's memory usage, I think we would be much better off providing a memory allocator hook that can know about every memory allocation and how those allocations have been arranged (revealing the fragmentation of the unused memory in the spaces between). Almost anything short of that will provide a grossly misleading picture of memory usage. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:51 | admin | set | github: 63248 |
2014-05-30 09:47:38 | rhettinger | set | status: open -> closed resolution: rejected |
2013-09-21 00:12:26 | rhettinger | set | messages: + msg198178 |
2013-09-20 23:06:45 | rhettinger | set | assignee: rhettinger versions: - Python 3.3 |
2013-09-20 06:04:27 | loewis | set | messages: + msg198130 |
2013-09-20 00:29:42 | serhiy.storchaka | set | files:
+ gettotalsizeof.py messages: + msg198124 |
2013-09-19 18:45:56 | pitrou | set | messages: + msg198114 |
2013-09-19 18:44:07 | serhiy.storchaka | set | messages: + msg198113 |
2013-09-19 18:39:55 | pitrou | set | messages: + msg198111 |
2013-09-19 18:32:32 | serhiy.storchaka | set | messages: + msg198110 |
2013-09-19 18:19:38 | pitrou | set | messages: + msg198108 |
2013-09-19 18:17:46 | pitrou | set | messages: + msg198107 |
2013-09-19 18:15:57 | pitrou | set | messages: + msg198106 |
2013-09-19 18:11:29 | serhiy.storchaka | set | messages: + msg198105 |
2013-09-19 18:09:24 | pitrou | set | messages: + msg198104 |
2013-09-19 18:06:32 | loewis | set | messages: + msg198102 |
2013-09-19 18:05:25 | pitrou | set | messages: + msg198101 |
2013-09-19 17:47:50 | amaury.forgeotdarc | set | messages: + msg198098 |
2013-09-19 17:06:06 | pitrou | set | messages: + msg198094 |
2013-09-19 16:24:40 | serhiy.storchaka | set | messages: + msg198093 |
2013-09-19 15:56:10 | pitrou | set | messages: + msg198089 |
2013-09-19 15:47:17 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg198087 |
2013-09-19 14:02:21 | pitrou | set | messages: + msg198074 |
2013-09-19 13:39:52 | serhiy.storchaka | set | nosy:
+ loewis messages: + msg198070 |
2013-09-19 13:03:47 | pitrou | set | messages: + msg198065 |
2013-09-19 11:48:13 | serhiy.storchaka | set | messages: + msg198061 |
2013-09-19 11:05:41 | pitrou | set | messages: + msg198055 |
2013-09-19 11:02:04 | serhiy.storchaka | set | messages: + msg198054 |
2013-09-19 10:50:20 | serhiy.storchaka | set | stage: needs patch -> patch review |
2013-09-19 10:49:20 | serhiy.storchaka | set | superseder: Correct __sizeof__ support for itertools -> messages: + msg198053 |
2013-09-19 10:43:16 | serhiy.storchaka | set | superseder: Correct __sizeof__ support for itertools messages: + msg198052 |
2013-09-19 10:06:09 | pitrou | set | files:
+ tee_sizeof.patch keywords: + patch messages: + msg198051 |
2013-09-19 10:04:12 | pitrou | set | messages: + msg198050 |
2013-09-19 09:19:50 | pitrou | create |