Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sizeof set after set_merge() is doubled from 3.5 #74135

Closed
methane opened this issue Mar 30, 2017 · 12 comments
Closed

sizeof set after set_merge() is doubled from 3.5 #74135

methane opened this issue Mar 30, 2017 · 12 comments
Assignees
Labels
3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@methane
Copy link
Member

methane commented Mar 30, 2017

BPO 29949
Nosy @rhettinger, @cameron-simpson, @methane, @serhiy-storchaka
PRs
  • bpo-29949: Fix set memory usage regression #943
  • [3.6] bpo-29949: Fix set memory usage regression #945
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2017-04-01.11:53:48.062>
    created_at = <Date 2017-03-30.18:09:13.367>
    labels = ['interpreter-core', '3.7']
    title = 'sizeof set after set_merge() is doubled from 3.5'
    updated_at = <Date 2017-04-01.14:29:33.601>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2017-04-01.14:29:33.601>
    actor = 'methane'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2017-04-01.11:53:48.062>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2017-03-30.18:09:13.367>
    creator = 'methane'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 29949
    keywords = ['3.6regression']
    message_count = 12.0
    messages = ['290868', '290888', '290906', '290908', '290909', '290911', '290912', '290915', '290928', '290934', '290970', '290984']
    nosy_count = 5.0
    nosy_names = ['rhettinger', 'cameron', 'methane', 'serhiy.storchaka', 'jgosmann']
    pr_nums = ['943', '945']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue29949'
    versions = ['Python 3.6', 'Python 3.7']

    @methane
    Copy link
    Member Author

    methane commented Mar 30, 2017

    (original thread is https://mail.python.org/pipermail/python-list/2017-March/720391.html)

    4897300

    this commit doubles sizeof set object created by set_merge().
    It is used by constructor of set and frozenset.

    $ /usr/bin/python3
    Python 3.5.2+ (default, Sep 22 2016, 12:18:14)
    [GCC 6.2.0 20160927] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> s = set(range(10))
    >>> sys.getsizeof(frozenset(s))
    736
    $ python3
    Python 3.6.0 (default, Dec 30 2016, 20:49:54)
    [GCC 6.2.0 20161005] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import  sys
    >>> s = set(range(10))
    >>> sys.getsizeof(frozenset(s))
    1248

    @methane methane added 3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Mar 30, 2017
    @methane
    Copy link
    Member Author

    methane commented Mar 31, 2017

    https://gist.github.com/methane/8faf12621cdb2166019bbcee65987e99
    This patch fixes the regression.

    But I think it is still larger than idiomatic. See this code:

    code:
    minused *= 2;
    /* Find the smallest table size > minused. */
    /* XXX speed-up with intrinsics */
    for (newsize = PySet_MINSIZE;
    newsize <= minused && newsize > 0;
    newsize <<= 1)
    ;

    When original minused is X, newsize will be about 2X ~ 4X.

    For set.add(), preserving extra space for further add() make sense.
    But for set_merge(), intention is avoiding repeated resize while merging.
    There may be not "further add()", especially for frozenset(s).
    So 30% ~ 60% seems better than 25% ~ 50%.

    How do you think, Raymond?

    @rhettinger rhettinger self-assigned this Mar 31, 2017
    @rhettinger
    Copy link
    Contributor

    I'll look at this over the next week or two. I don't really like the proposed patch at all but will carefully think through the speed/space trade-offs.

    @rhettinger
    Copy link
    Contributor

    Hmm, I wonder why I'm not seeing the same sizes you are seeing.

    $ cat setsize.py
    from sys import getsizeof
    print( [getsizeof(frozenset(range(n))) for n in range(20)] )
    $ python3.4 setsize.py
    [224, 224, 224, 224, 224, 224, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736]
    $ python3.5 setsize.py
    [224, 224, 224, 224, 224, 224, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736]
    $ python3.6 setsize.py
    [224, 224, 224, 224, 224, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736]
    $ python3.4 --version
    Python 3.4.4
    $ python3.5 --version
    Python 3.5.3
    $ python3.6 --version
    Python 3.6.1

    @methane
    Copy link
    Member Author

    methane commented Mar 31, 2017

    See set_update_internal().
    https://github.com/python/cpython/blob/master/Objects/setobject.c#L969-L1016

    This happens only when iterable is set or dict.

    >>> import sys
    >>> sys.getsizeof(set(range(10)))
    736
    >>> sys.getsizeof(set(set(range(10))))
    1248
    >>> sys.getsizeof(set(dict.fromkeys(range(10))))
    1248

    @serhiy-storchaka
    Copy link
    Member

    $ ./python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(frozenset(range(n))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )'

    3.5: [(0, 112), (6, 368), (22, 1136), (86, 4208), (342, 16496), (1366, 65648), (5462, 262256)]
    3.6: [(0, 112), (5, 368), (21, 1136), (85, 4208), (341, 16496), (1365, 65648), (5461, 262256)]
    3.7: [(0, 112), (5, 368), (19, 1136), (77, 4208), (307, 16496), (1229, 65648), (4915, 262256)]

    $ ./python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(frozenset(set(range(n)))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )'

    3.5: [(0, 112), (6, 240), (8, 368), (16, 624), (32, 1136), (64, 2160), (128, 4208), (256, 8304), (512, 16496), (1024, 32880), (2048, 65648), (4096, 131184)]
    3.6: [(0, 112), (5, 368), (8, 624), (16, 1136), (32, 2160), (64, 4208), (128, 8304), (256, 16496), (512, 32880), (1024, 65648), (2048, 131184), (4096, 262256)]
    3.7: [(0, 112), (5, 368), (8, 624), (16, 1136), (32, 2160), (64, 4208), (128, 8304), (256, 16496), (512, 32880), (1024, 65648), (2048, 131184), (4096, 262256)]

    frozenset(range(n)) is slightly larger in 3.7 than in 3.6. It is 4 times larger for about 10% of sizes.

    frozenset(set(range(n))) is 2 times larger in 3.6 than in 3.5 for all sizes >= 16.

    @methane
    Copy link
    Member Author

    methane commented Mar 31, 2017

    frozenset(range(n)) is slightly larger in 3.7 than in 3.6. It is 4 times larger for about 10% of sizes.

    This is intensional:
    5cd87a8

    load factor is reduced from 66% to 60%. (-10%)

    @rhettinger
    Copy link
    Contributor

    I think the best thing to do is to undo the refactoring in 4897300 . It is was intended be neutral but did affect set_update_internal() for small sets.

    @methane
    Copy link
    Member Author

    methane commented Mar 31, 2017

    I agree. Before thinking about rebalance between size and speed,
    resolving regression is important.

    @rhettinger
    Copy link
    Contributor

    Do you want to prepare a PR for me? I not yet set-up with the ways of Github. Please limit the PR to just unwinding the refactoring in the simplest possible way..

    If in the future you want to chat about speed/space trade-offs, we can do that offline. I've spent years thinking about this, interacting with users, discussing with other devs, speaking on the topic, and working through use cases for sets. The original reasons for the choices made largely are still true today. I would be happy to walk you through the history (the tracker isn't a good place to do this, IRC would serve us better).

    @methane
    Copy link
    Member Author

    methane commented Apr 1, 2017

    New changeset e82cf86 by INADA Naoki in branch 'master':
    bpo-29949: Fix set memory usage regression (GH-943)
    e82cf86

    @methane methane closed this as completed Apr 1, 2017
    @methane
    Copy link
    Member Author

    methane commented Apr 1, 2017

    New changeset efde51a by INADA Naoki in branch '3.6':
    bpo-29949: Fix set memory usage regression (GH-945)
    efde51a

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants