Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot pickle self-referencing sets #53515

Closed
abalkin opened this issue Jul 15, 2010 · 11 comments
Closed

Cannot pickle self-referencing sets #53515

abalkin opened this issue Jul 15, 2010 · 11 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@abalkin
Copy link
Member

abalkin commented Jul 15, 2010

BPO 9269
Nosy @rhettinger, @jcea, @abalkin, @pitrou, @jackdied, @avassalotti, @merwok, @alex
Superseder
  • bpo-17810: Implement PEP 3154 (pickle protocol 4)
  • Files
  • cycle.py: Script reproducing the bug
  • self_referential-sets.patch: Adds support for self-referential sets
  • sets-test.patch: Adds testing of pickling/unpickling sets
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/avassalotti'
    closed_at = <Date 2013-11-30.05:07:31.220>
    created_at = <Date 2010-07-15.23:01:40.612>
    labels = ['type-feature', 'library']
    title = 'Cannot pickle self-referencing sets'
    updated_at = <Date 2013-11-30.05:07:31.219>
    user = 'https://github.com/abalkin'

    bugs.python.org fields:

    activity = <Date 2013-11-30.05:07:31.219>
    actor = 'alexandre.vassalotti'
    assignee = 'alexandre.vassalotti'
    closed = True
    closed_date = <Date 2013-11-30.05:07:31.220>
    closer = 'alexandre.vassalotti'
    components = ['Library (Lib)']
    creation = <Date 2010-07-15.23:01:40.612>
    creator = 'belopolsky'
    dependencies = []
    files = ['18021', '26533', '26539']
    hgrepos = []
    issue_num = 9269
    keywords = ['patch']
    message_count = 11.0
    messages = ['110397', '110402', '110403', '110407', '110408', '110409', '110410', '110423', '110617', '166530', '166570']
    nosy_count = 13.0
    nosy_names = ['rhettinger', 'jcea', 'belopolsky', 'grubert', 'pitrou', 'jackdied', 'alexandre.vassalotti', 'schmir', 'eric.araujo', 'alex', 'zzzeek', 'bcroq', 'mstefanro']
    pr_nums = []
    priority = 'low'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '17810'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue9269'
    versions = ['Python 3.4']

    @abalkin
    Copy link
    Member Author

    abalkin commented Jul 15, 2010

    Attached script, cycle.py demonstrates a simplification of the behavior reported by mike bayer in msg77200.

    Essentially, the script attempts to pickle a set that contains a class instance that has an attribute referring back to the set:

    class C:
        pass
    c = C()
    cycle = set([c])
    c.foo = cycle

    An attempt to pickle the *cycle* object triggers an assertion error in 2.7 or in 3.2 with disabled _pickle acceleration and produces a broken cycle in 3.2 or if cPickle is used instead of pickle in 2.7.

    $ python3 cycle.py
    FAIILURE
    ..
    $ python2 cycle.py
    Traceback (most recent call last):
      ..
      File ".../pickle.py", line 244, in memoize
        assert id(obj) not in self.memo
    AssertionError

    If you run cycle.py with an argument, it uses a dict instead of set to create the cycle and shows that the cycles with dict can be pickled correctly:

    $ python3 cycle.py dict
    SUCCESS
    ..

    After reporting success or failure, cycle.py, prints a disassembly of the pickle stream which makes it clear what happens:

    In case of dict, we see the following:

    $ python3 cycle.py dict
    SUCCESS
     ..
        2: }    EMPTY_DICT
        3: q    BINPUT     0
     ..
       26: X    BINUNICODE 'foo'
    ..
       36: h    BINGET     0
       38: s    SETITEM
    ..
       40: N    NONE
       41: s    SETITEM

    An empty dict is created and saved in the memo. Then a C object is built with foo attribute is set to the dict retrieved from the memo. Finally, the same dict is updated with (C object, None) key-value pair. The result is the cycle identical to the one we built in python code.

    The sets, however, are built differently. There is no pickle opcode to add items to a set, so all set items must exist by the time set is built. So here is what we see:

    $ python3 cycle.py
    FAIILURE
        2: c    GLOBAL     'builtins set'
       16: q    BINPUT     0
    .. instead of empty set the constructor is saved in memo
       42: X    BINUNICODE 'foo'
       52: h    BINGET     0
    ..
       63: R    REDUCE
    .. a set object containing c is constructed 
       66: s    SETITEM  
    .. and assigned to c.foo
       72: R    REDUCE
    .. another set object is constructed containing c
     
    As a result, we have
    cycle = {c}
    c.foo = {c}

    Instead of

    c.foo = cycle

    @abalkin abalkin added type-bug An unexpected behavior, bug, or error interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jul 15, 2010
    @abalkin abalkin self-assigned this Jul 15, 2010
    @abalkin
    Copy link
    Member Author

    abalkin commented Jul 16, 2010

    I am reclassifying this as an RFE because as a bug, this is a duplicate of bpo-1062277. The later contains an excellent description of the problem by Dima Dorfman:

     Our pickle implementations don't check for reduce
     cycles. This is somewhat cosmetic except that stack
     overflow protection is imperfect (for cPickle), causing
     crashes, and some kinds of cycles trigger asserts (in
     pickle).  [[msg47267](https://bugs.python.org/msg47267)]
    

    This is undeniably a bug and the solution offered in bpo-1062277 is quite reasonable: detect reduce cycles and bail out with an informative message. This will not solve the problem of pickling self-referencing sets, but at least once documented can be defended as expected behavior.

    What remains after bpo-1062277 is a feature request to allow pickling of self-referencing sets. I would argue that this is really a pathological case. Sets are not supposed to contain mutable items and immutable objects cannot create reference cycles among themselves.

    The test case in cycle.py tricks set into accepting mutable objects by creating a class with default __hash__. This falls into a category of "don't do it".

    I am lowering the priority and adding bpo-1062277 as a dependency. Once bpo-1062277 is fixed, I would not mind closing this as "won't fix".

    @abalkin abalkin added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Jul 16, 2010
    @zzzeek
    Copy link
    Mannequin

    zzzeek mannequin commented Jul 16, 2010

    where is it defined that sets are not "supposed" to contain mutable items? such a requirement vastly limits the usefulness of sets.

    Consider that relational database rows are mutable. A result set containing multiple rows which each have a primary key comprises a set, hashed on primary key. But other attributes of each row can be updated. Surely this is not controversial ? What Python data structure should I (and a whole bunch of Python ORMs) be using to represent mutable, relational database rows, unordered and unique on primary key, in memory ?

    @jackdied
    Copy link
    Contributor

    Mike, it is better to think of database rows as immutable tuples. During the course of a query the contents of the database are considered static - hence all that locking and kvetching about this or that database not having "true" foreign key support. If database rows were mutable the results of a JOIN could be nonsensical.

    @abalkin
    Copy link
    Member Author

    abalkin commented Jul 16, 2010

    On Thu, Jul 15, 2010 at 8:26 PM, mike bayer <report@bugs.python.org> wrote:
    ..

    where is it defined that sets are not "supposed" to contain mutable items?   such a requirement vastly limits the usefulness of sets.

    Well, there is no such requirement. The actual requirement is that
    they should be hashable. For built-in types, however hashable is the
    same as immutable. Arguably, user-defined classes should emulate
    that.

    Consider that relational database rows are mutable.  A result set containing multiple rows which each have a primary key comprises a set, hashed on primary key.  But other attributes of each row can be updated.   Surely this is not controversial ?  What Python data structure should I (and a whole bunch of Python ORMs) be using to represent mutable, relational database rows, unordered and unique on primary key, in memory ?

    Why wouldn't you represent a result set as a dict mapping primary key
    (tuple) to list of column values?

    @abalkin
    Copy link
    Member Author

    abalkin commented Jul 16, 2010

    On Thu, Jul 15, 2010 at 8:52 PM, Jack Diederich <report@bugs.python.org> wrote:
    ..

     If database rows were mutable the results of a JOIN could be nonsensical.

    And if your result set is self-referential you have a bigger problem
    on your hands than not being able to pickle it. :-)

    @zzzeek
    Copy link
    Mannequin

    zzzeek mannequin commented Jul 16, 2010

    OK, more specifically, here's the kind of situation where items in a set are mutable:

    company = Session.query(Company).first()

    # company.employees is a set()
    company.employees

    # each employee references the parent company
    for e in company.employees:
        assert e.company is company

    So nothing is mutated relationally in this case. It's just a plain old bidirectional structure. If two objects are related via many-to-many, then you might have a set in both directions. I'm not sure if this specific situation produces the pickle bug, however, it's been awhile since I've seen which specific case causes the problem. But there are mutable items in a set in these examples and it doesn't seem unreasonable to me.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 16, 2010

    The test case in cycle.py tricks set into accepting mutable objects by > creating a class with default __hash__. This falls into a category of > "don't do it".

    I beg to differ. There is a reason we allow people to define __hash__ and that's to define arbitrary hashable types (not only immutable ones).
    Furthermore, the default __hash__ (equivalent to id()) is also perfectly useful in some cases.
    And in object-oriented designs it is very common to have reference cycles.

    @abalkin
    Copy link
    Member Author

    abalkin commented Jul 17, 2010

    Upon further investigation, I am no longer convinced that "reduce cycles" are necessarily fatal to pickling. I am attaching a script testcycle.py that allows creating cycles using various containers. It takes the name of container class as an argument and prints SUCCESS/FAILURE followed by annotated pickle disassembly. It turns out that self-referential tuples are pickled successfully even though being immutable, they must receive their state through arguments:

    $ ./python.exe testcycle.py tuple
    SUCCESS
        0: \x80 PROTO      3              Protocol version indicator.
        2: c    GLOBAL     '__main__ C'   Push a global object (module.attr) on the stack.
       14: )    EMPTY_TUPLE               Push an empty tuple.
       15: \x81 NEWOBJ                    Build an object instance.
       16: q    BINPUT     1              Store the stack top into the memo.  The stack is not popped.
       18: }    EMPTY_DICT                Push an empty dict.
       19: X    BINUNICODE 'foo'          Push a Python Unicode string object.
       27: h    BINGET     1              Read an object from the memo and push it on the stack.
       29: \x85 TUPLE1                    Build a one-tuple out of the topmost item on the stack.
       30: q    BINPUT     4              Store the stack top into the memo.  The stack is not popped.
       32: s    SETITEM                   Add a key+value pair to an existing dict.
       33: b    BUILD                     Finish building an object, via __setstate__ or dict update.
       34: 0    POP                       Discard the top stack item, shrinking the stack by one item.
       35: h    BINGET     4              Read an object from the memo and push it on the stack.
       37: .    STOP                      Stop the unpickling machine.
    highest protocol among opcodes = 2

    There is a comment in save_tuple() that explains how it works:

            # Subtle.  d was not in memo when we entered save_tuple(), so                                                           
            # the process of saving the tuple's elements must have saved                                                            
            # the tuple itself:  the tuple is recursive.  The proper action                                                         
            # now is to throw away everything we put on the stack, and                                                              
            # simply GET the tuple (it's already constructed).  This check                                                          
            # could have been done in the "for element" loop instead, but                                                           
            # recursive tuples are a rare thing.        
    

    I wonder if the same trick could be used in save_reduce() to resolve bpo-1062277 by pickling "reduce cycles" correctly instead of bailing out when they are encountered.

    @terryjreedy terryjreedy added stdlib Python modules in the Lib dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Aug 9, 2010
    @mstefanro
    Copy link
    Mannequin

    mstefanro mannequin commented Jul 26, 2012

    I have attached a fix to this issue (and implicitly bpo-1062277).

    This patch allows pickling self-referential sets by implementing a set.__reduce__ which uses states as opposed to ctor parameters.

    Before:
    >>> s=set([1,2,3])
    >>> s.__reduce__()
    (<class 'set'>, ([1, 2, 3],), None)
    >>> len(pickle.dumps(s,1))
    38
    
    After:
    >>> s=set([1,2,3])
    >>> s.__reduce__()
    (<class 'set'>, (), [1, 2, 3])
    >>> len(pickle.dumps(s,1))
    36

    Basically what this does is: instead of unpickling the set by doing set([1,2,3]) it does s=set(); s.__setstate__([1,2,3]).
    States are supported in all versions of pickle so this shouldn't break anything.
    Creating empty data structures and then filling them is the way pickle does it for all mutable containers in order to allow self-references (with the exception of sets, of course).

    Since memoization is performed after the object is created but before its state is set, pickling an object's state can contain references to oneself.

        class A:
            pass
    
        a=A()
        s=set([a])
        a.s=s
    
        s_=loads(dumps(s,1))
        next(iter(s_)).s is s_ # True

    Note that this fix only applies for sets, not frozensets. Frozensets are a different matter, because their immutability makes it impossible to set their state. Self-referential frozensets are currently supported in my implementation of pickle4 using a trick similar to what tuples use. But the trick works more easily there because frozensets have their own opcodes, like tuples.

    Also note that applying this patch makes Lib/test/pickletester.py:test_pickle_to_2x fail (DATA3 and DATA6 there contain pickled data of sets, which naturally have changed).
    I'll upload a patch fixing this as well as adding one or more test for sets soon.

    @mstefanro
    Copy link
    Mannequin

    mstefanro mannequin commented Jul 27, 2012

    Attaching patch for fixing a test and adding better testing of sets.

    @avassalotti avassalotti self-assigned this Nov 30, 2013
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants