This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Optimize marshal format and add version token.
Type: enhancement Stage:
Components: Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: jcea, kristjan.jonsson, pitrou, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2013-10-13 22:30 by kristjan.jonsson, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
marshal.patch kristjan.jonsson, 2013-10-13 22:30 review
marshal2.patch kristjan.jonsson, 2013-10-14 12:30 review
Messages (19)
msg199818 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-10-13 22:30
Issue $19219 added new tokens making marshal format smaller and faster.
This patch adds two new tokens:
TYPE_SHORT_REF for which the ref index is a byte and
TYPE_VERSION for which the operand is the protocol version.

The former helps because it catches common singletons such as 0, 1, () and so on which typically show up early in a pickle.  they then need only two bytes to encoded.
This shrinks the code for the decimal.py module from 172K to 162K.


The second can help break backwards compatibility requirements in the future.  The format (if 4 or larger) is now put into the stream, so that future new formats can re-assign opcodes if needed.

I don't reassign the version number, leaving it at the new value of 4.  This change is still backwards compatible with the previous '4' so there should be no problem.

For size / performance comparison, try:
python.exe -m timeit -s "import decimal; c=compile(open(decimal.__file__).read(), decimal.__file__, 'exec'); import marshal; d=marshal.dumps(c); print(len(d))" "marshal.loads(d)"
msg199819 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-13 22:39
Let defer version token until it be needed.
msg199820 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-13 22:42
Good idea to add a version. But it should not be a token, but a mandatory header at the beginning. For example, 16 unsigned bits at the beginning.

Many file formats use a "magic" key, like "MZ" for Windows executable or "GIF" for GIF pictures. What do you think of adding such magic string (if it does not exist yet, I don't know the marshal format... is it documented somewhere?).
msg199821 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-13 22:44
I actually agree with Kristjan that an opcode is the least disruptive
choice here. I also agree it's useful if we want to be able to evolve
the protocol without being tied by backwards compatibility constraints.
msg199822 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-13 22:53
"I actually agree with Kristjan that an opcode is the least disruptive
choice here."

Does you mean that data serialized with python 3.3 can be read with
python 3.4? But not the opposite (version token unknown in Python
3.3)?
msg199823 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-13 22:57
> "I actually agree with Kristjan that an opcode is the least disruptive
> choice here."
> 
> Does you mean that data serialized with python 3.3 can be read with
> python 3.4? But not the opposite (version token unknown in Python
> 3.3)?

Yes, indeed. I'm not sure it's very important but it's safer in case
people have old "frozen" modules around.
msg199846 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-10-14 07:09
Unlike pickle, the marshal module makes no promises about keeping the format consistent between Python versions.
msg199850 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-14 07:54
The version token needed only when we want break backward compatibility (change the meaning of existing opcodes).
msg199856 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-14 09:21
> The version token needed only when we want break backward
> compatibility (change the meaning of existing opcodes).

This is true. For example, my first hunch was to make w_long()
emit variable-length data, (0-254: one byte integer, 255: four
bytes integer following the 255 prefix).
msg199867 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-10-14 11:28
Right, the idea of the version token is to introduce it now, as early as possible, even if it is not needed, for prudence.
For example, if version 5 decides to change the semantics of some of the opcodes, we can then support both kinds, in the future.  Read old files _and_ the new ones.

Despite the fact that we claim that we don't guarantee interoperability, in reality it is very desirable.  During development, for instance. Frozen modules, .pyc files, and all that.
msg199869 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-14 11:31
Should we support serialization in an older version? For example,
disable the new SMALL and ASCII tokens?
msg199873 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-14 11:46
> Should we support serialization in an older version? For example,
> disable the new SMALL and ASCII tokens?

It is officially supported with the "version" parameter:
http://docs.python.org/3.4/library/marshal.html#marshal.dump
msg199874 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-10-14 11:47
We have done so previously and should continue to do that.  Thanks for pointing out that the new SHORT_REF needs that fix :)
msg199878 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-14 12:10
> Right, the idea of the version token is to introduce it now, as early as possible, even if it is not needed, for prudence.

How it will help? If we will break backward compatibility in version 10, then we will introduce the version token for version 10 and larger. If the file will not contain the version token, then it will compatible with version 9.
msg199879 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-10-14 12:30
Only output TYPE_SHORT_REF for version >= 4
msg199880 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-10-14 12:38
Quoting myself: "for prudence."
We probably should have had this from the beginning.  Adding this now makes it easier to make such changes in future, because then you don't have to re-invent the versioning mechanism.
msg199882 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-14 12:43
What is difference between introducing the version token right now or only when (if?) we will need it?
msg199885 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-10-14 13:45
Never put off to tomorrow what you can do today :)
msg235027 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-30 07:21
Largest modules in the stdlib have up to 800 references. So it would be worth to have 4 opcodes for short (10-bit) ref indices.
History
Date User Action Args
2022-04-11 14:57:51adminsetgithub: 63455
2015-01-30 07:21:10serhiy.storchakasetmessages: + msg235027
2014-10-14 15:54:15skrahsetnosy: - skrah
2013-10-15 10:31:36jceasetnosy: + jcea
2013-10-14 13:45:49kristjan.jonssonsetmessages: + msg199885
2013-10-14 12:43:12serhiy.storchakasetmessages: + msg199882
2013-10-14 12:38:36kristjan.jonssonsetmessages: + msg199880
2013-10-14 12:30:06kristjan.jonssonsetfiles: + marshal2.patch

messages: + msg199879
2013-10-14 12:10:37serhiy.storchakasetmessages: + msg199878
2013-10-14 11:47:04kristjan.jonssonsetmessages: + msg199874
2013-10-14 11:46:08pitrousetmessages: + msg199873
2013-10-14 11:31:55vstinnersetmessages: + msg199869
2013-10-14 11:28:20kristjan.jonssonsetmessages: + msg199867
2013-10-14 09:21:28pitrousetmessages: + msg199856
2013-10-14 07:54:14serhiy.storchakasetmessages: + msg199850
2013-10-14 07:09:22rhettingersetnosy: + rhettinger
messages: + msg199846
2013-10-13 22:57:10pitrousetmessages: + msg199823
2013-10-13 22:53:12vstinnersetmessages: + msg199822
2013-10-13 22:44:34pitrousetmessages: + msg199821
2013-10-13 22:42:37vstinnersetmessages: + msg199820
2013-10-13 22:39:07serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg199819
2013-10-13 22:34:20pitrousetnosy: + skrah
2013-10-13 22:30:08kristjan.jonssoncreate