Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unpickling does not intern attribute names #49334

Closed
jakemcguire mannequin opened this issue Jan 27, 2009 · 16 comments
Closed

unpickling does not intern attribute names #49334

jakemcguire mannequin opened this issue Jan 27, 2009 · 16 comments
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@jakemcguire
Copy link
Mannequin

jakemcguire mannequin commented Jan 27, 2009

BPO 5084
Nosy @jcea, @pitrou, @avassalotti
Files
  • pickle.py.diff: diff to pickle to intern attribute names
  • cPickle.c.diff: take three.
  • pickletester.diff: add a unit test to verify attribute names are interned
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/pitrou'
    closed_at = <Date 2009-05-02.21:42:49.807>
    created_at = <Date 2009-01-27.21:52:19.380>
    labels = ['library', 'performance']
    title = 'unpickling does not intern attribute names'
    updated_at = <Date 2009-05-02.21:42:49.789>
    user = 'https://bugs.python.org/jakemcguire'

    bugs.python.org fields:

    activity = <Date 2009-05-02.21:42:49.789>
    actor = 'pitrou'
    assignee = 'pitrou'
    closed = True
    closed_date = <Date 2009-05-02.21:42:49.807>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2009-01-27.21:52:19.380>
    creator = 'jakemcguire'
    dependencies = []
    files = ['12880', '13178', '13835']
    hgrepos = []
    issue_num = 5084
    keywords = ['patch']
    message_count = 16.0
    messages = ['80670', '80678', '80688', '80689', '80690', '80918', '82686', '82690', '82691', '82692', '82720', '86919', '86920', '86930', '86939', '86982']
    nosy_count = 7.0
    nosy_names = ['collinwinter', 'jcea', 'ggenellina', 'pitrou', 'alexandre.vassalotti', 'jyasskin', 'jakemcguire']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue5084'
    versions = ['Python 3.1', 'Python 2.7']

    @jakemcguire
    Copy link
    Mannequin Author

    jakemcguire mannequin commented Jan 27, 2009

    Instance attribute names are normally interned - this is done in
    PyObject_SetAttr (among other places). Unpickling (in pickle and
    cPickle) directly updates __dict__ on the instance object. This
    bypasses the interning so you end up with many copies of the strings
    representing your attribute names, which wastes a lot of space, both in
    RAM and in pickles of sequences of objects created from pickles. Note
    that the native python memcached client uses pickle to serialize
    objects.

    >>> import pickle
    >>> class C(object):
    ...   def __init__(self, x):
    ...     self.long_attribute_name = x
    ...
    >>> len(pickle.dumps([pickle.loads(pickle.dumps(C(None), 
    pickle.HIGHEST_PROTOCOL)) for i in range(100)], 
    pickle.HIGHEST_PROTOCOL))
    3658
    >>> len(pickle.dumps([C(None) for i in range(100)], 
    pickle.HIGHEST_PROTOCOL))
    1441
    >>>

    Interning the strings on unpickling makes the pickles smaller, and at
    least for cPickle actually makes unpickling sequences of many objects
    slightly faster. I have included proposed patches to cPickle.c and
    pickle.py, and would appreciate any feedback.

    @jakemcguire jakemcguire mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Jan 27, 2009
    @ggenellina
    Copy link
    Mannequin

    ggenellina mannequin commented Jan 27, 2009

    Either my browser got crazy, or you uploaded the same patch (.py) twice.

    @avassalotti
    Copy link
    Member

    The patch for cPickle doesn't do the same thing as the pickle one. In
    the cPickle one, you are only interning slot attributes, which, I
    believe, is not what you intended. :-)

    @avassalotti avassalotti self-assigned this Jan 28, 2009
    @jakemcguire
    Copy link
    Mannequin Author

    jakemcguire mannequin commented Jan 28, 2009

    Are you sure? I may not have enough context in my diff, which I
    should have done against an anonymous svn checkout, but it looks like
    the slot attributes get set several lines after my diff. "while
    (PyDict_Next(slotstate, ...))" as opposed to the "while
    (PyDict_Next(state, ...))" in my change...

    -jake

    On Jan 27, 2009, at 6:54 PM, Alexandre Vassalotti wrote:

    Alexandre Vassalotti <alexandre@peadrop.com> added the comment:

    The patch for cPickle doesn't do the same thing as the pickle one. In
    the cPickle one, you are only interning slot attributes, which, I
    believe, is not what you intended. :-)

    ----------
    assignee: -> alexandre.vassalotti
    nosy: +alexandre.vassalotti


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue5084\>


    @avassalotti
    Copy link
    Member

    Oh, you are right. I was looking at py3k's version of pickle, which uses
    PyDict_Update instead of a while loop; that is why I got confused.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 1, 2009

    Why do you call PyString_AsString() followed by PyString_FromString()?
    Strings are immutable so you shouldn't neek to take a copy.

    Besides, it would be nice to add a test.

    @jakemcguire
    Copy link
    Mannequin Author

    jakemcguire mannequin commented Feb 24, 2009

    The fromstring/asstring dance was due to my incomplete understanding of
    refcounting. PyDict_Next returns a borrowed reference but
    PyString_InternInPlace expects an owned reference. Thanks to Kirk
    McDonald, I have a new patch that does the refcounting correctly.

    What sort of test did you have in mind?

    @pitrou
    Copy link
    Member

    pitrou commented Feb 25, 2009

    The test should check that the unpickled strings have been interned.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 25, 2009

    In your patch, I'm not sure where the name variable is coming from.
    Have you checked it works?

    @pitrou
    Copy link
    Member

    pitrou commented Feb 25, 2009

    To give an example of what the test could check:

    >>> class C(object):
    ...  def __init__(self):
    ...    self.some_long_attribute_name = 5
    ... 
    >>> c = C()
    >>> c.__dict__
    {'some_long_attribute_name': 5}
    >>> sorted(map(id, c.__dict__))
    [140371243499696]
    >>> import pickle
    >>> d = pickle.loads(pickle.dumps(c, -1))
    >>> d
    <__main__.C object at 0x7faaba1b0390>
    >>> d.__dict__
    {'some_long_attribute_name': 5}
    >>> sorted(map(id, d.__dict__))
    [140371243501232]

    The sorted(map(id, d.__dict__)) should have been the same before and
    after pickling.

    @jakemcguire
    Copy link
    Mannequin Author

    jakemcguire mannequin commented Feb 25, 2009

    Ugh. Clearly I didn't check to see if it worked or not, as it didn't even
    compile. A new diff, tested and verified to work, is attached. I'll work
    on creating a test.

    @pitrou
    Copy link
    Member

    pitrou commented May 1, 2009

    Jake, are you string working on a test?

    @pitrou
    Copy link
    Member

    pitrou commented May 1, 2009

    (s/string/still/, of course...)

    @jakemcguire
    Copy link
    Mannequin Author

    jakemcguire mannequin commented May 2, 2009

    This fell through the cracks. But the following unit test seems like it
    does the right thing (fails with the old module, works with the new ones).

    @pitrou
    Copy link
    Member

    pitrou commented May 2, 2009

    Thanks, I'll take a look very soon.

    @pitrou pitrou assigned pitrou and unassigned avassalotti May 2, 2009
    @pitrou
    Copy link
    Member

    pitrou commented May 2, 2009

    Committed in r72223, r72224. Thanks!

    @pitrou pitrou closed this as completed May 2, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants