New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
itertools.tee doesn't have a __sizeof__ method #63248
Comments
An itertools.tee object can cache an arbitrary number of objects (pointers), but its sys.getsizeof() value will always remain the same. |
I find the implementation of itertools.tee a bit weird: why does teedataobject have to be a PyObject? It seems to complicate things and make them less optimal. |
Anywhere, here is a patch. |
This is a duplicate of bpo-15475. |
Hmm, no, I'm wrong, it's not a duplication. |
I'm not sure that sys.getsizeof() should recursively count all Python subobjects. That is why I had omitted tee() in my patch. >>> sys.getsizeof([[]])
36
>>> sys.getsizeof([list(range(10000))])
36 |
Those are private subobjects. They are not visible to the programmer |
Thy are visible by calling gc.get_referents(). High-level function can use this to count recursive size of objects. >>> import sys, gc, itertools
>>> def gettotalsizeof(*args, seen=None):
... if seen is None:
... seen = {}
... sum = 0
... for obj in args:
... if id(obj) not in seen:
... seen[id(obj)] = obj
... sum += sys.getsizeof(obj)
... sum += gettotalsizeof(*gc.get_referents(obj), seen=seen)
... return sum
...
>>> a, b = tee(range(10000))
>>> sum(next(a) for i in range(1000))
499500
>>> gettotalsizeof(a)
750
>>> gettotalsizeof(b)
18734 |
That's totally besides the point. The point is that those |
The point is that your patch breaks functions like gettotalsizeof(). It makes impossible to get a total size of general object. It will be better to add gettotalsizeof() to the stdlib (or add an optional parameter to sys.getsizeof() for recursive counting). |
The thing is, "Total size" is generally meaningless. It can include
This patch has *nothing* to do with recursive counting. It counts Recursive counting doesn't make sense with Python. Where do you stop |
I like the definition of __sizeof__ that was discussed some time ago: With that definition (do we have it somewhere in the docs, by the way?) |
The problem is that that definition isn't helpful. And gc.get_referents() is really a low-level debugging tool, certainly |
Isn't sys.getsizeof() a low-level debugging tool? |
What would it help debug exactly? :-) |
getsizeof() is interesting only if it gives sensible results when used correctly, especially if you want to sum these values and get a global memory usage. One usage is to traverse objects through gc.get_referents(); in this case the definition above is correct. Now, are you suggesting to traverse objects differently? With dir(), or __dict__? (btw, this discussion explains why pypy still does not implement getsizeof()) |
"Getting a global memory usage" isn't a correct use of getsizeof(), If you want global Python memory usage, use sys._debugmallocstats(), not
What are the intended semantics? get_referents() can give you references
sys.getsizeof() gives you the memory usage of a given Python object, it |
I see no problem with that. If the internal representation changes, nobody should be surprised if sizeof changes.
It certainly does: it reports the memory consumption of the object itself, I proposed a precise definition of what "an other object" is. If you don't like it, |
Who is "nobody"? Users aren't aware of internal representation changes.
What is the use case for "summing up the memory of a graph of objects? |
Well. itertools._tee is one Python object and itertools._tee_dataobject is another Python object. sys.getsizeof() gives you the memory usage of this objects separately. |
This is great... And how do I know that I need to use gc.get_referents() If sys.getsizeof() is only useful for people who know *already* how an |
Here is why using get_referents() is stupid in the general case: >>> class C: pass
...
>>> c = C()
>>> gc.get_referents(c)
[<class '__main__.C'>] With your method, measuring c's memory consumption also includes the memory consumption of its type object. |
(By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose) |
We can stop at specific types of objects (for example types and modules).
It's why sys.getsizeof() is a low-level tool. We need high-level tool in the stdlib. Even imperfect recursive counting will be better than confusing for novices sys.getsizeof().
Yes, I know, and I think it is wrong. Here is improved version of gettotalsizeof(): def gettotalsizeof(*args, exclude_types=(type, type(sys))):
seen = {}
stack = []
for obj in args:
if id(obj) not in seen:
seen[id(obj)] = obj
stack.append(obj)
sum = 0
while stack:
obj = stack.pop()
sum += sys.getsizeof(obj)
for obj in gc.get_referents(obj):
if id(obj) not in seen and not isinstance(obj, exclude_types):
seen[id(obj)] = obj
stack.append(obj)
return sum >>> gettotalsizeof(sys)
206575
>>> gettotalsizeof(gc)
2341
>>> gettotalsizeof(sys.getsizeof)
60
>>> gettotalsizeof(gettotalsizeof)
60854
>>> class C: pass
...
>>> gettotalsizeof(C)
805
>>> gettotalsizeof(C())
28 |
Ok, but I need to see a satisfying version of "gettotalsizeof" before > Here is improved version of gettotalsizeof():
>
[...]
> >>> gettotalsizeof(gettotalsizeof)
> 60854 Why that big? Does it make sense? What if say, a large object is "shared" between many small objects? |
Optionally we can also not count objects which are referenced from outside of a graph of objects (this isn't so easy implement in Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True) will count only bare list and a square of 22, because 1 and 'abc' are interned. |
That's only part of the equation. What if I have an object which |
Here is advanced function which counts only objects on which there are no external references. >>> import itertools
>>> a, b = itertools.tee(range(10000))
>>> max(zip(a, range(100)))
(99, 99)
>>> sys.getsizeof(a)
32
>>> gettotalinnersizeof(a)
32
>>> gettotalinnersizeof(b)
292
>>> gettotalinnersizeof(a, b)
608 Total size of a and b is larger than a sum of sizes of a and b. It's because it includes size of one shared between a and teedataobject and one shared range iterator. |
Antoine: in (my experience of) memory analysis, the size of a single object is mostly irrelevant. If you need to know how much memory something consumes, you typically want to know the memory of a set of objects. So this is the case that really must be supported. For that, users will have to use libraries that know how to count memory. It's not asked to much that the authors of such libraries know about internals of Python (such as the existence of sys.getsizeof, or gc.get_referents). The question is: can such a library reasonably implemented? For that, it is important that getsizeof behaves uniformly across objects. If you really don't like the proposed uniformity, please propose a different rule. However, don't give deviations in other places (OrderedDict) as a reason to break the rule here as well. Instead, if OrderedDict.__getsizeof__ is broken, it needs to be fixed as well. |
If accounting for global memory usage is a goal, it needs to have a much more comprehensively thought out, implementation dependent approach. There are many issues (memory fragmentation, key-sharing dictionaries, dummy objects, list over-allocation, the minsize dictionary that is part of the dict object in addition to its variable sized portion, non-python objects held by Python objects, the extra few bytes per object consumed by the freelisting scheme in Objects/obmalloc.c etc).
I concur. This is a pipe dream without a serious investment of time and without creating a new and unnecessary maintenance burden.
FWIW, the way OrderedDict computes sizeof is probably typical of how anyone is currently using sys.getsizeof(). If you change the premise of how it operates, you're probably going to break the code written by the very few people in the world who care about sys.getsizeof(): def __sizeof__(self):
sizeof = _sys.getsizeof
n = len(self) + 1 # number of links including root
size = sizeof(self.__dict__) # instance dictionary
size += sizeof(self.__map) * 2 # internal dict and inherited dict
size += sizeof(self.__hardroot) * n # link objects
size += sizeof(self.__root) * n # proxy objects
return size I don't have any specific recommendation for itertools.tee other than that I think it doesn't really need a __sizeof__ method. The typical uses of tee are a transient phenomena that temporarily use some memory and then disappear. I'm not sure that any mid-stream sizeof checks reveal information of any worth. Overall, this thread indicates that the entire concept of __sizeof__ has been poorly defined, unevenly implemented, and not really useful when aggregated. For those who are interested in profiling and optimizing Python's memory usage, I think we would be much better off providing a memory allocator hook that can know about every memory allocation and how those allocations have been arranged (revealing the fragmentation of the unused memory in the spaces between). Almost anything short of that will provide a grossly misleading picture of memory usage. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: