Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

itertools.tee doesn't have a __sizeof__ method #63248

Closed
pitrou opened this issue Sep 19, 2013 · 30 comments
Closed

itertools.tee doesn't have a __sizeof__ method #63248

pitrou opened this issue Sep 19, 2013 · 30 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@pitrou
Copy link
Member

pitrou commented Sep 19, 2013

BPO 19048
Nosy @loewis, @rhettinger, @amauryfa, @pitrou, @serhiy-storchaka
Files
  • tee_sizeof.patch
  • gettotalsizeof.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2014-05-30.09:47:38.703>
    created_at = <Date 2013-09-19.09:19:50.648>
    labels = ['type-bug', 'library']
    title = "itertools.tee doesn't have a __sizeof__ method"
    updated_at = <Date 2014-05-30.09:47:38.702>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2014-05-30.09:47:38.702>
    actor = 'rhettinger'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2014-05-30.09:47:38.703>
    closer = 'rhettinger'
    components = ['Library (Lib)']
    creation = <Date 2013-09-19.09:19:50.648>
    creator = 'pitrou'
    dependencies = []
    files = ['31813', '31822']
    hgrepos = []
    issue_num = 19048
    keywords = ['patch']
    message_count = 30.0
    messages = ['198048', '198050', '198051', '198052', '198053', '198054', '198055', '198061', '198065', '198070', '198074', '198087', '198089', '198093', '198094', '198098', '198101', '198102', '198104', '198105', '198106', '198107', '198108', '198110', '198111', '198113', '198114', '198124', '198130', '198178']
    nosy_count = 5.0
    nosy_names = ['loewis', 'rhettinger', 'amaury.forgeotdarc', 'pitrou', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue19048'
    versions = ['Python 3.4']

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    An itertools.tee object can cache an arbitrary number of objects (pointers), but its sys.getsizeof() value will always remain the same.

    @pitrou pitrou added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Sep 19, 2013
    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    I find the implementation of itertools.tee a bit weird: why does teedataobject have to be a PyObject? It seems to complicate things and make them less optimal.

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    Anywhere, here is a patch.

    @serhiy-storchaka
    Copy link
    Member

    This is a duplicate of bpo-15475.

    @serhiy-storchaka
    Copy link
    Member

    Hmm, no, I'm wrong, it's not a duplication.

    @serhiy-storchaka
    Copy link
    Member

    I'm not sure that sys.getsizeof() should recursively count all Python subobjects. That is why I had omitted tee() in my patch.

    >>> sys.getsizeof([[]])
    36
    >>> sys.getsizeof([list(range(10000))])
    36

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    I'm not sure that sys.getsizeof() should recursively count all Python
    subobjects.

    Those are private subobjects. They are not visible to the programmer
    (except perhaps by calling __reduce__ or __setstate__).

    @serhiy-storchaka
    Copy link
    Member

    Thy are visible by calling gc.get_referents(). High-level function can use this to count recursive size of objects.

    >>> import sys, gc, itertools
    >>> def gettotalsizeof(*args, seen=None):
    ...     if seen is None:
    ...         seen = {}
    ...     sum = 0
    ...     for obj in args:
    ...         if id(obj) not in seen:
    ...             seen[id(obj)] = obj
    ...             sum += sys.getsizeof(obj)
    ...             sum += gettotalsizeof(*gc.get_referents(obj), seen=seen)
    ...     return sum
    ... 
    >>> a, b = tee(range(10000))
    >>> sum(next(a) for i in range(1000))
    499500
    >>> gettotalsizeof(a)
    750
    >>> gettotalsizeof(b)
    18734

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    Thy are visible by calling gc.get_referents().

    That's totally besides the point. The point is that those
    objects are invisible in normal conditions, not that they can't
    be read using advanced implementation-dependent tricks.

    @serhiy-storchaka
    Copy link
    Member

    The point is that your patch breaks functions like gettotalsizeof(). It makes impossible to get a total size of general object.

    It will be better to add gettotalsizeof() to the stdlib (or add an optional parameter to sys.getsizeof() for recursive counting).

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    The point is that your patch breaks functions like gettotalsizeof().
    It makes impossible to get a total size of general object.

    The thing is, "Total size" is generally meaningless. It can include
    things such as the object's type, or anything transitively referenced
    by the object, such as modules.

    It will be better to add gettotalsizeof() to the stdlib (or add an
    optional parameter to sys.getsizeof() for recursive counting).

    This patch has *nothing* to do with recursive counting. It counts
    the internal arrays of itertools.tee() as part of its memory size,
    which is reasonable and expected. It does *not* count memory recursively:
    it doesn't count the size of the itertools.tee()'s cached objects,
    for example.

    Recursive counting doesn't make sense with Python. Where do you stop
    counting?

    @amauryfa
    Copy link
    Member

    I like the definition of __sizeof__ that was discussed some time ago:
    http://bugs.python.org/issue14520#msg157798

    With that definition (do we have it somewhere in the docs, by the way?)
    The current code works gives the correct answer.

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    I like the definition of __sizeof__ that was discussed some time ago:
    http://bugs.python.org/issue14520#msg157798

    The problem is that that definition isn't helpful.
    If we ever change itertools.tee to use non-PyObjects internally, suddenly
    its sys.getsizeof() would have to return much larger numbers despite
    visible behaviour not having changed at all (and despite the memory
    overhead being actually lower).

    And gc.get_referents() is really a low-level debugging tool, certainly
    not a "reflection API" (inspect would serve that role).

    @serhiy-storchaka
    Copy link
    Member

    Isn't sys.getsizeof() a low-level debugging tool?

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    Isn't sys.getsizeof() a low-level debugging tool?

    What would it help debug exactly? :-)
    I would hope it gives remotely useful information about the passed
    object.

    @amauryfa
    Copy link
    Member

    getsizeof() is interesting only if it gives sensible results when used correctly, especially if you want to sum these values and get a global memory usage.

    One usage is to traverse objects through gc.get_referents(); in this case the definition above is correct.

    Now, are you suggesting to traverse objects differently? With dir(), or __dict__?

    (btw, this discussion explains why pypy still does not implement getsizeof())

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    getsizeof() is interesting only if it gives sensible results when used
    correctly, especially if you want to sum these values and get a global
    memory usage.

    "Getting a global memory usage" isn't a correct use of getsizeof(),
    though, because it totally ignores the memory allocation overhead (not
    to mention fragmentation, or any memory areas that may have been
    allocated without being accounted for by __sizeof__).

    If you want global Python memory usage, use sys._debugmallocstats(), not
    sys.getsizeof().

    One usage is to traverse objects through gc.get_referents(); in this
    case the definition above is correct.

    What are the intended semantics? get_referents() can give you references
    you didn't expect, such as type objects, module objects...

    Now, are you suggesting to traverse objects differently? With dir(),
    or __dict__?

    sys.getsizeof() gives you the memory usage of a given Python object, it
    doesn't guarantee that "traversing objects" will give you the right
    answer for any question.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 19, 2013

    The problem is that that definition isn't helpful.
    If we ever change itertools.tee to use non-PyObjects internally, s
    suddenly its sys.getsizeof() would have to return much larger numbers
    despite visible behaviour not having changed at all (and despite the
    memory overhead being actually lower).

    I see no problem with that. If the internal representation changes, nobody should be surprised if sizeof changes.

    I would hope it gives remotely useful information about the passed object.

    It certainly does: it reports the memory consumption of the object itself,
    not counting the memory of other objects.

    I proposed a precise definition of what "an other object" is. If you don't like it,
    please propose a different definition that still allows to automatically sum up the
    memory of a graph of objects.

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    I see no problem with that. If the internal representation changes,
    nobody should be surprised if sizeof changes.

    Who is "nobody"? Users aren't aware of internal representation changes.
    It sounds like you want sys.getsizeof() to be a tool for language
    implementors anyway.

    I proposed a precise definition of what "an other object" is. If you don't like it,
    please propose a different definition that still allows to automatically sum up the
    memory of a graph of objects.

    What is the use case for "summing up the memory of a graph of objects?
    How do you stop walking your graph if it spans the whole graph of Python
    objects?

    @serhiy-storchaka
    Copy link
    Member

    Well. itertools._tee is one Python object and itertools._tee_dataobject is another Python object. sys.getsizeof() gives you the memory usage of this objects separately.

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    Well. itertools._tee is one Python object and
    itertools._tee_dataobject is another Python object. sys.getsizeof()
    gives you the memory usage of this objects separately.

    This is great... And how do I know that I need to use gc.get_referents()
    to get those objects in case I'm measuring the memory consumption of a
    teeobject (rather than, say, trusting __dict__, or simply trusting the
    getsizeof() output at face value)?

    If sys.getsizeof() is only useful for people who know *already* how an
    object is implemented internally, then it's actually useless, because
    those people can just as well do the calculation themselves.

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    Here is why using get_referents() is stupid in the general case:

    >>> class C: pass
    ... 
    >>> c = C()
    >>> gc.get_referents(c)
    [<class '__main__.C'>]

    With your method, measuring c's memory consumption also includes the memory consumption of its type object.
    (and of course this is only a trivial example... one can only imagine what kind of mess it is with a non-trivial object)

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    (By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose)

    @serhiy-storchaka
    Copy link
    Member

    How do you stop walking your graph if it spans the whole graph of Python objects?

    We can stop at specific types of objects (for example types and modules).

    If sys.getsizeof() is only useful for people who know *already* how an
    object is implemented internally, then it's actually useless, because
    those people can just as well do the calculation themselves.

    It's why sys.getsizeof() is a low-level tool. We need high-level tool in the stdlib. Even imperfect recursive counting will be better than confusing for novices sys.getsizeof().

    (By the way, OrderedDict.__sizeof__ already breaks the rule you are trying to impose)

    Yes, I know, and I think it is wrong.

    Here is improved version of gettotalsizeof():

    def gettotalsizeof(*args, exclude_types=(type, type(sys))):
        seen = {}
        stack = []
        for obj in args:
            if id(obj) not in seen:
                seen[id(obj)] = obj
                stack.append(obj)
        sum = 0
        while stack:
            obj = stack.pop()
            sum += sys.getsizeof(obj)
            for obj in gc.get_referents(obj):
                if id(obj) not in seen and not isinstance(obj, exclude_types):
                    seen[id(obj)] = obj
                    stack.append(obj)
        return sum
    >>> gettotalsizeof(sys)
    206575
    >>> gettotalsizeof(gc)
    2341
    >>> gettotalsizeof(sys.getsizeof)
    60
    >>> gettotalsizeof(gettotalsizeof)
    60854
    >>> class C: pass
    ... 
    >>> gettotalsizeof(C)
    805
    >>> gettotalsizeof(C())
    28

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    It's why sys.getsizeof() is a low-level tool. We need high-level tool
    in the stdlib. Even imperfect recursive counting will be better than
    confusing for novices sys.getsizeof().

    Ok, but I need to see a satisfying version of "gettotalsizeof" before
    I'm convinced (see below).

    > Here is improved version of gettotalsizeof():
    > 
    [...]
    > >>> gettotalsizeof(gettotalsizeof)
    > 60854

    Why that big? Does it make sense?

    What if say, a large object is "shared" between many small objects?
    Should it count towards the memory size of any of those small objects?
    What if that object is actually immortal (it is also a module global,
    for example)?

    @serhiy-storchaka
    Copy link
    Member

    Optionally we can also not count objects which are referenced from outside of a graph of objects (this isn't so easy implement in Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True) will count only bare list and a square of 22, because 1 and 'abc' are interned.

    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 19, 2013

    Optionally we can also not count objects which are referenced from
    outside of a graph of objects (this isn't so easy implement in
    Python). I.e. gettotalsizeof([1, 'abc', math.sqrt(22)], inner=True)
    will count only bare list and a square of 22, because 1 and 'abc' are
    interned.

    That's only part of the equation. What if I have an object which
    references, for example, a logging.Logger? Loggers are actually eternal
    (they live in a global dictionary somewhere in the logging module), but
    gettotalsizeof() will still count it.

    @serhiy-storchaka
    Copy link
    Member

    Here is advanced function which counts only objects on which there are no external references.

    >>> import itertools
    >>> a, b = itertools.tee(range(10000))
    >>> max(zip(a, range(100)))
    (99, 99)
    >>> sys.getsizeof(a)
    32
    >>> gettotalinnersizeof(a)
    32
    >>> gettotalinnersizeof(b)
    292
    >>> gettotalinnersizeof(a, b)
    608

    Total size of a and b is larger than a sum of sizes of a and b. It's because it includes size of one shared between a and teedataobject and one shared range iterator.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Sep 20, 2013

    Antoine: in (my experience of) memory analysis, the size of a single object is mostly irrelevant. If you need to know how much memory something consumes, you typically want to know the memory of a set of objects. So this is the case that really must be supported.

    For that, users will have to use libraries that know how to count memory. It's not asked to much that the authors of such libraries know about internals of Python (such as the existence of sys.getsizeof, or gc.get_referents). The question is: can such a library reasonably implemented? For that, it is important that getsizeof behaves uniformly across objects.

    If you really don't like the proposed uniformity, please propose a different rule. However, don't give deviations in other places (OrderedDict) as a reason to break the rule here as well. Instead, if OrderedDict.__getsizeof__ is broken, it needs to be fixed as well.

    @rhettinger rhettinger self-assigned this Sep 20, 2013
    @rhettinger
    Copy link
    Contributor

    getsizeof() is interesting only if it gives sensible results
    when used correctly, especially if you want to sum these values
    and get a global memory usage.

    If accounting for global memory usage is a goal, it needs to have a much more comprehensively thought out, implementation dependent approach. There are many issues (memory fragmentation, key-sharing dictionaries, dummy objects, list over-allocation, the minsize dictionary that is part of the dict object in addition to its variable sized portion, non-python objects held by Python objects, the extra few bytes per object consumed by the freelisting scheme in Objects/obmalloc.c etc).

    The thing is, "Total size" is generally meaningless.

    I concur. This is a pipe dream without a serious investment of time and without creating a new and unnecessary maintenance burden.

    (By the way, OrderedDict.__sizeof__ already breaks the
    rule you are trying to impose)

    FWIW, the way OrderedDict computes sizeof is probably typical of how anyone is currently using sys.getsizeof(). If you change the premise of how it operates, you're probably going to break the code written by the very few people in the world who care about sys.getsizeof():

        def __sizeof__(self):
            sizeof = _sys.getsizeof
            n = len(self) + 1                       # number of links including root
            size = sizeof(self.__dict__)            # instance dictionary
            size += sizeof(self.__map) * 2          # internal dict and inherited dict
            size += sizeof(self.__hardroot) * n     # link objects
            size += sizeof(self.__root) * n         # proxy objects
            return size

    I don't have any specific recommendation for itertools.tee other than that I think it doesn't really need a __sizeof__ method. The typical uses of tee are a transient phenomena that temporarily use some memory and then disappear. I'm not sure that any mid-stream sizeof checks reveal information of any worth.

    Overall, this thread indicates that the entire concept of __sizeof__ has been poorly defined, unevenly implemented, and not really useful when aggregated.

    For those who are interested in profiling and optimizing Python's memory usage, I think we would be much better off providing a memory allocator hook that can know about every memory allocation and how those allocations have been arranged (revealing the fragmentation of the unused memory in the spaces between). Almost anything short of that will provide a grossly misleading picture of memory usage.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants