Author jakemcguire
Recipients jakemcguire
Date 2009-01-27.21:52:16
SpamBayes Score 2.4688e-09
Marked as misclassified No
Message-id <1233093140.55.0.0122548653801.issue5084@psf.upfronthosting.co.za>
In-reply-to
Content
Instance attribute names are normally interned - this is done in 
PyObject_SetAttr (among other places).  Unpickling (in pickle and 
cPickle) directly updates __dict__ on the instance object.  This 
bypasses the interning so you end up with many copies of the strings 
representing your attribute names, which wastes a lot of space, both in 
RAM and in pickles of sequences of objects created from pickles.  Note 
that the native python memcached client uses pickle to serialize 
objects.

>>> import pickle
>>> class C(object):
...   def __init__(self, x):
...     self.long_attribute_name = x
...
>>> len(pickle.dumps([pickle.loads(pickle.dumps(C(None), 
pickle.HIGHEST_PROTOCOL)) for i in range(100)], 
pickle.HIGHEST_PROTOCOL))
3658
>>> len(pickle.dumps([C(None) for i in range(100)], 
pickle.HIGHEST_PROTOCOL))
1441
>>>

Interning the strings on unpickling makes the pickles smaller, and at 
least for cPickle actually makes unpickling sequences of many objects 
slightly faster.  I have included proposed patches to cPickle.c and 
pickle.py, and would appreciate any feedback.
History
Date User Action Args
2009-01-27 21:52:20jakemcguiresetrecipients: + jakemcguire
2009-01-27 21:52:20jakemcguiresetmessageid: <1233093140.55.0.0122548653801.issue5084@psf.upfronthosting.co.za>
2009-01-27 21:52:19jakemcguirelinkissue5084 messages
2009-01-27 21:52:17jakemcguirecreate