This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author kristjan.jonsson
Recipients christian.heimes, gregory.p.smith, kristjan.jonsson, loewis, pitrou, serhiy.storchaka
Date 2012-11-19.16:18:20
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1353341901.46.0.734791005055.issue16475@psf.upfronthosting.co.za>
In-reply-to
Content
Ok, I did some tests with my recode module.  The following are the sizes of the marshal data:

test2To3 ... 24748 24748 212430 212430
test3To3 ... 18420 17848 178969 174806
test4To3 ... 18425 18411 178969 178550

The columns:
a) test_marshal.py without transform
b) test_marshal.py with recode.intern() (folding common objects)
c) and d): decimal.py module (the largest one in lib)

The lines:
1) Version 2 of the protocol.
2) Version 3 of the protocol (object instancing and the works)
3) Version 4, an dummy version that only instances strings)

As expected, there is no difference between version 3 and 4 unless I employ the recode module to fold common subobjects.  This brings an additional saving of some 3% bringing the total reduction up to 28% and 
18% respectively.

Note that the transform is a simple recursive folding of objects.  common argument lists, such as (self) are subject to this.  No renaming of local variables or other stripping is performed.
So, although the "recode" module is work in progress, and not the subject of this "defect", its use shows how it is important to be able to support proper instancing in serialization protocols.

Implementation note:  The trick of using a bit flag on the type to indicate a slot reservation in the instance list is one that has been in use in CCP´s own "Marshal" format, a proprietary serialization format based on marshal back in 2002 (adding many more special opcodes and other stuff)

Serhiy: There is no reason _not_ to reuse INT objects if we are doing it for other immutables to.  As you note, the size of the data is the same. This will ensure that integers that are not cached can be folded into the same object, e.g. the value 123, if used in two functions, can be the same int object.

I should also point out that the marshal protocol takes care to be able to serialize lists, sets and frozensets correctly, the latter being added in version 2.4.  This despite the fact that code objects don't make use of these.
History
Date User Action Args
2012-11-19 16:18:21kristjan.jonssonsetrecipients: + kristjan.jonsson, loewis, gregory.p.smith, pitrou, christian.heimes, serhiy.storchaka
2012-11-19 16:18:21kristjan.jonssonsetmessageid: <1353341901.46.0.734791005055.issue16475@psf.upfronthosting.co.za>
2012-11-19 16:18:21kristjan.jonssonlinkissue16475 messages
2012-11-19 16:18:20kristjan.jonssoncreate