classification
Title: unpickling does not intern attribute names
Type: resource usage Stage: patch review
Components: Library (Lib) Versions: Python 3.1, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: alexandre.vassalotti, collinwinter, ggenellina, jakemcguire, jcea, jyasskin, pitrou
Priority: normal Keywords: patch

Created on 2009-01-27 21:52 by jakemcguire, last changed 2009-05-02 21:42 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
pickle.py.diff jakemcguire, 2009-01-27 21:52 diff to pickle to intern attribute names
cPickle.c.diff jakemcguire, 2009-02-25 21:17 take three.
pickletester.diff jakemcguire, 2009-05-02 02:38 add a unit test to verify attribute names are interned
Messages (16)
msg80670 - (view) Author: Jake McGuire (jakemcguire) Date: 2009-01-27 21:52
Instance attribute names are normally interned - this is done in 
PyObject_SetAttr (among other places).  Unpickling (in pickle and 
cPickle) directly updates __dict__ on the instance object.  This 
bypasses the interning so you end up with many copies of the strings 
representing your attribute names, which wastes a lot of space, both in 
RAM and in pickles of sequences of objects created from pickles.  Note 
that the native python memcached client uses pickle to serialize 
objects.

>>> import pickle
>>> class C(object):
...   def __init__(self, x):
...     self.long_attribute_name = x
...
>>> len(pickle.dumps([pickle.loads(pickle.dumps(C(None), 
pickle.HIGHEST_PROTOCOL)) for i in range(100)], 
pickle.HIGHEST_PROTOCOL))
3658
>>> len(pickle.dumps([C(None) for i in range(100)], 
pickle.HIGHEST_PROTOCOL))
1441
>>>

Interning the strings on unpickling makes the pickles smaller, and at 
least for cPickle actually makes unpickling sequences of many objects 
slightly faster.  I have included proposed patches to cPickle.c and 
pickle.py, and would appreciate any feedback.
msg80678 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-27 23:30
Either my browser got crazy, or you uploaded the same patch (.py) twice.
msg80688 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-28 02:54
The patch for cPickle doesn't do the same thing as the pickle one. In
the cPickle one, you are only interning slot attributes, which, I
believe, is not what you intended. :-)
msg80689 - (view) Author: Jake McGuire (jakemcguire) Date: 2009-01-28 03:00
Are you sure?  I may not have enough context in my diff, which I  
should have done against an anonymous svn checkout, but it looks like  
the slot attributes get set several lines after my diff.  "while  
(PyDict_Next(slotstate, ...))" as opposed to the "while  
(PyDict_Next(state, ...))" in my change...

-jake

On Jan 27, 2009, at 6:54 PM, Alexandre Vassalotti wrote:

>
> Alexandre Vassalotti <alexandre@peadrop.com> added the comment:
>
> The patch for cPickle doesn't do the same thing as the pickle one. In
> the cPickle one, you are only interning slot attributes, which, I
> believe, is not what you intended. :-)
>
> ----------
> assignee:  -> alexandre.vassalotti
> nosy: +alexandre.vassalotti
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue5084>
> _______________________________________
msg80690 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2009-01-28 03:03
Oh, you are right. I was looking at py3k's version of pickle, which uses
PyDict_Update instead of a while loop; that is why I got confused.
msg80918 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-01 20:53
Why do you call PyString_AsString() followed by PyString_FromString()?
Strings are immutable so you shouldn't neek to take a copy.

Besides, it would be nice to add a test.
msg82686 - (view) Author: Jake McGuire (jakemcguire) Date: 2009-02-24 23:32
The fromstring/asstring dance was due to my incomplete understanding of 
refcounting.  PyDict_Next returns a borrowed reference but 
PyString_InternInPlace expects an owned reference.  Thanks to Kirk 
McDonald, I have a new patch that does the refcounting correctly.

What sort of test did you have in mind?
msg82690 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-25 02:03
The test should check that the unpickled strings have been interned.
msg82691 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-25 02:07
In your patch, I'm not sure where the `name` variable is coming from.
Have you checked it works?
msg82692 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-25 02:11
To give an example of what the test could check:

>>> class C(object):
...  def __init__(self):
...    self.some_long_attribute_name = 5
... 
>>> c = C()
>>> c.__dict__
{'some_long_attribute_name': 5}
>>> sorted(map(id, c.__dict__))
[140371243499696]
>>> import pickle
>>> d = pickle.loads(pickle.dumps(c, -1))
>>> d
<__main__.C object at 0x7faaba1b0390>
>>> d.__dict__
{'some_long_attribute_name': 5}
>>> sorted(map(id, d.__dict__))
[140371243501232]

The `sorted(map(id, d.__dict__))` should have been the same before and
after pickling.
msg82720 - (view) Author: Jake McGuire (jakemcguire) Date: 2009-02-25 21:17
Ugh.  Clearly I didn't check to see if it worked or not, as it didn't even 
compile.  A new diff, tested and verified to work, is attached.  I'll work 
on creating a test.
msg86919 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-01 21:54
Jake, are you string working on a test?
msg86920 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-01 21:55
(s/string/still/, of course...)
msg86930 - (view) Author: Jake McGuire (jakemcguire) Date: 2009-05-02 02:38
This fell through the cracks.  But the following unit test seems like it 
does the right thing (fails with the old module, works with the new ones).
msg86939 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-02 11:41
Thanks, I'll take a look very soon.
msg86982 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-05-02 21:42
Committed in r72223, r72224. Thanks!
History
Date User Action Args
2009-05-02 21:42:49pitrousetstatus: open -> closed
resolution: fixed
messages: + msg86982
2009-05-02 11:41:39pitrousetassignee: alexandre.vassalotti -> pitrou
messages: + msg86939
2009-05-02 11:40:40pitrousetstage: needs patch -> patch review
2009-05-02 02:38:05jakemcguiresetfiles: + pickletester.diff

messages: + msg86930
2009-05-01 21:55:32pitrousetmessages: + msg86920
2009-05-01 21:54:58pitrousetpriority: normal

stage: needs patch
messages: + msg86919
versions: + Python 3.1, Python 2.7
2009-02-25 21:19:13jakemcguiresetfiles: - cPickle.c.diff
2009-02-25 21:19:08jakemcguiresetfiles: - cPickle.c.diff
2009-02-25 21:17:28jakemcguiresetfiles: + cPickle.c.diff
messages: + msg82720
2009-02-25 02:11:36pitrousetmessages: + msg82692
2009-02-25 02:07:01pitrousetmessages: + msg82691
2009-02-25 02:03:09pitrousetmessages: + msg82690
2009-02-24 23:32:52jakemcguiresetfiles: + cPickle.c.diff
messages: + msg82686
2009-02-03 02:20:50collinwintersetnosy: + collinwinter, jyasskin
2009-02-01 20:53:38pitrousetnosy: + pitrou
messages: + msg80918
2009-01-29 10:56:14jceasetnosy: + jcea
2009-01-28 03:03:10alexandre.vassalottisetmessages: + msg80690
2009-01-28 03:00:02jakemcguiresetmessages: + msg80689
2009-01-28 02:54:53alexandre.vassalottisetassignee: alexandre.vassalotti
messages: + msg80688
nosy: + alexandre.vassalotti
2009-01-27 23:39:02jakemcguiresetfiles: + cPickle.c.diff
2009-01-27 23:38:46jakemcguiresetfiles: - cPickle.c.diff
2009-01-27 23:30:05ggenellinasetnosy: + ggenellina
messages: + msg80678
2009-01-27 21:52:58jakemcguiresetfiles: + pickle.py.diff
2009-01-27 21:52:19jakemcguirecreate