Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in memory use of shared key dictionaries for "compact dicts" #84297

Closed
markshannon opened this issue Mar 30, 2020 · 15 comments
Closed
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@markshannon
Copy link
Member

BPO 40116
Nosy @malemburg, @rhettinger, @methane, @markshannon, @serhiy-storchaka
PRs
  • bpo-40116: Add insertion order bit-vector to dict values to allow dicts to share keys more freely. #28520
  • bpo-40116: dict: Add regression test for iteration order. #31550
  • Files
  • compact_dict_prevents_key_sharing.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-01-20.11:50:36.639>
    created_at = <Date 2020-03-30.15:13:56.918>
    labels = ['interpreter-core', 'type-bug']
    title = 'Regression in memory use of shared key dictionaries for "compact dicts"'
    updated_at = <Date 2022-03-03.04:06:40.503>
    user = 'https://github.com/markshannon'

    bugs.python.org fields:

    activity = <Date 2022-03-03.04:06:40.503>
    actor = 'methane'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-01-20.11:50:36.639>
    closer = 'Mark.Shannon'
    components = ['Interpreter Core']
    creation = <Date 2020-03-30.15:13:56.918>
    creator = 'Mark.Shannon'
    dependencies = []
    files = ['49013']
    hgrepos = []
    issue_num = 40116
    keywords = ['patch']
    message_count = 15.0
    messages = ['365319', '365320', '365322', '365325', '365326', '365390', '402443', '402465', '402468', '402493', '402494', '403299', '413884', '413894', '414403']
    nosy_count = 5.0
    nosy_names = ['lemburg', 'rhettinger', 'methane', 'Mark.Shannon', 'serhiy.storchaka']
    pr_nums = ['28520', '31550']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue40116'
    versions = []

    @markshannon
    Copy link
    Member Author

    The current implementation of dicts prevents keys from being shared when the order of attribute differs from the first instance created.
    This can potentially use a considerably larger amount of memory than expected.

    Consider the class:

    class C:
    
       opt = DEFAULT
    
       def __init__(self, attr, optional=None):
           if optional:
               self.opt = optional
           self.attr = attr

    This is a reasonable way to write a class, but has unpredictable memory use.
    In the attached example, per-instance dict size goes from 104 bytes to 232 bytes when sharing is prevented.

    The language specification says that the dicts maintain insertion order, but the wording implies that this only to explicit dictionaries, not instance attribute or other namespace dicts.

    Either we should allow key sharing in these cases, or clarify the documentation.

    @markshannon markshannon added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Mar 30, 2020
    @serhiy-storchaka
    Copy link
    Member

    But the following class should not lead to unlimited memory consumption when create new instances:

    class C:
        count = 0
        def __init__(self):
            count = self.__class__.count
            self.__class__.count = count + 1
            setattr(self, f'a{count}', count)

    @markshannon
    Copy link
    Member Author

    Indeed it shouldn't.
    How is that relevant?

    @markshannon
    Copy link
    Member Author

    Just to clarify.

    class AlwaysShared:
    
       opt = DEFAULT
    
       def __init__(self, attr, optional=None):
           self.attr = attr
           if optional:
               self.opt = optional
    
    class SometimesShared:
    
       opt = DEFAULT
    
       def __init__(self, attr, optional=None):
           if optional:
               self.opt = optional
           self.attr = attr

    The class AlwaysShared always has shared keys, whereas the class SometimesShared is not shared if optional is True for first instantiation, but False for a later instantiation.

    @serhiy-storchaka
    Copy link
    Member

    I think the current behavior is a guard against such pitfall. If you allow to add new keys when not all other keys are set, you can end with sharing growing set of keys.

    @markshannon
    Copy link
    Member Author

    You can't end up with a growing set of keys.
    The problem is to do with the *order* of insertion, not the number of keys.

    @markshannon
    Copy link
    Member Author

    This can be mitigated, if not entirely fixed, by storing an ordering bit vector in the values.

    This way all instances of the class SometimesShared in the example above can share the keys.

    The keys might be ("optional", "attr")

    For any instances with only "attr" as an attibute, the values would be (NULL, value) and the order would be (1,)

    The downsides of this approach are:

    1. Each values, and thus dict needs an extra 64 bit value.
    2. Shared keys have a maximum size of 16.

    Overall, I expect the improved sharing to more than compensate for the disadvantages.

    @rhettinger
    Copy link
    Contributor

    Overall, I expect the improved sharing to more than
    compensate for the disadvantages.

    I expect the opposite. This makes all dicts pay a price (in space, initialization time, and complexity) for a micro-optimization of an uncommon case (the normal case is for __init__ to run and set all the keys in a consistent order). It is unlikely that the "benefits" to never be felt in real-word applications, but "disadvantages" would affect every Python program.

    The language specification says that the dicts maintain insertion
    order, but the wording implies that this only to explicit
    dictionaries, not instance attribute or other namespace dicts.

    That is a quite liberal reading of the spec. I would object to making instance and namespace dicts behave differently. That would be a behavior regression and we would forever have to wrestle with the difference.

    @malemburg
    Copy link
    Member

    On 22.09.2021 21:02, Raymond Hettinger wrote:

    > The language specification says that the dicts maintain insertion
    > order, but the wording implies that this only to explicit
    > dictionaries, not instance attribute or other namespace dicts.

    That is a quite liberal reading of the spec. I would object to making instance and namespace dicts behave differently. That would be a behavior regression and we would forever have to wrestle with the difference.

    I agree. Keeping the insertion order is essential for many common
    use cases, including those where a class or instance dict is used,
    e.g. namespaces used for data records, data caches, field
    definitions in data records, etc. (and yes, those often can be
    dynamically extended as well :-)).

    I think for the case you mention, a documentation patch would be
    better and more helpful for the programmers. Point them to slots
    and the sharing problem should go away in most cases :-)

    @markshannon
    Copy link
    Member Author

    Raymond,

    Only split dicts need the extra field.

    Classes where many instances do not have exactly the same set of attributes may be more common than you think.
    There are many reasons why some attributes may be added conditionally.

    PR 28520 actually makes the dictionary code a bit simpler, as we don't need to maintain the invariant that value arrays cannot have holes. Maintaining an order is simple and cheap:

    order = (order<<4) | insertion_index

    There are pros and cons to both schemes: PR 28520 and the current implementation.

    The problem with the current scheme is that it only works well for classes where all instances are initialized with exactly the same attributes, and in the same order.

    The PR 28520 scheme can handle those cases where order and key set differ a bit, but has a maximum size of 16 before the dict must be combined.

    We need as many instances as possible to have split dictionaries, to get faster-cpython/ideas#72 working well as it will make the cost of not sharing even greater, relatively.

    @serhiy-storchaka
    Copy link
    Member

    There are several common idioms which do not work well with shared dictionaries.

    1. Class attributes as defaults. If most instances of the class have the default value for some attribute, it can be set as the class attribute. It saves memory because most instances will have smaller dict. But if the first instance has such attribute, it cancels shared dict for all following instances without this attribute.

    2. cached_property (and analogs). The cached instance attributes can be added in arbitrary order, canceling shared dict for this instance and maybe for all instances created later.

    3. Some classes delete attributes in close() or __exit__() methods to avoid reference loops and to release resources earlier. Since shared dicts do not support deletion, such closed objects have now larger size than non-closed objects.

    @markshannon
    Copy link
    Member Author

    New changeset a7252f8 by Mark Shannon in branch 'main':
    bpo-40116: Add insertion order bit-vector to dict values to allow dicts to share keys more freely. (GH-28520)
    a7252f8

    @methane
    Copy link
    Member

    methane commented Feb 24, 2022

    I found regression caused by #72706.

    class C:
        def __init__(self, n):
            if n:
                self.a = 1
                self.b = 2
                self.c = 3
            else:
                self.c = 1
                self.b = 2
                self.a = 3
    
    
    o1 = C(True)
    o2 = C(False)
    print(o2.__dict__)  # {'c': 1, 'b': 2, 'a': 3}
    
    d1 = {}
    d1.update(o2.__dict__)  # {'a': 3, 'b': 2, 'c': 1}
    print(d1)
    

    @methane
    Copy link
    Member

    methane commented Feb 24, 2022

    PyDict_Keys(), PyDict_Values(), and PyDict_Items() don't respect insertion order too.

    @methane
    Copy link
    Member

    methane commented Mar 3, 2022

    New changeset 4f74052 by Inada Naoki in branch 'main':
    bpo-40116: dict: Add regression test for iteration order. (GH-31550)
    4f74052

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants