msg199818 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) * |
Date: 2013-10-13 22:30 |
Issue $19219 added new tokens making marshal format smaller and faster.
This patch adds two new tokens:
TYPE_SHORT_REF for which the ref index is a byte and
TYPE_VERSION for which the operand is the protocol version.
The former helps because it catches common singletons such as 0, 1, () and so on which typically show up early in a pickle. they then need only two bytes to encoded.
This shrinks the code for the decimal.py module from 172K to 162K.
The second can help break backwards compatibility requirements in the future. The format (if 4 or larger) is now put into the stream, so that future new formats can re-assign opcodes if needed.
I don't reassign the version number, leaving it at the new value of 4. This change is still backwards compatible with the previous '4' so there should be no problem.
For size / performance comparison, try:
python.exe -m timeit -s "import decimal; c=compile(open(decimal.__file__).read(), decimal.__file__, 'exec'); import marshal; d=marshal.dumps(c); print(len(d))" "marshal.loads(d)"
|
msg199819 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-10-13 22:39 |
Let defer version token until it be needed.
|
msg199820 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2013-10-13 22:42 |
Good idea to add a version. But it should not be a token, but a mandatory header at the beginning. For example, 16 unsigned bits at the beginning.
Many file formats use a "magic" key, like "MZ" for Windows executable or "GIF" for GIF pictures. What do you think of adding such magic string (if it does not exist yet, I don't know the marshal format... is it documented somewhere?).
|
msg199821 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2013-10-13 22:44 |
I actually agree with Kristjan that an opcode is the least disruptive
choice here. I also agree it's useful if we want to be able to evolve
the protocol without being tied by backwards compatibility constraints.
|
msg199822 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2013-10-13 22:53 |
"I actually agree with Kristjan that an opcode is the least disruptive
choice here."
Does you mean that data serialized with python 3.3 can be read with
python 3.4? But not the opposite (version token unknown in Python
3.3)?
|
msg199823 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2013-10-13 22:57 |
> "I actually agree with Kristjan that an opcode is the least disruptive
> choice here."
>
> Does you mean that data serialized with python 3.3 can be read with
> python 3.4? But not the opposite (version token unknown in Python
> 3.3)?
Yes, indeed. I'm not sure it's very important but it's safer in case
people have old "frozen" modules around.
|
msg199846 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2013-10-14 07:09 |
Unlike pickle, the marshal module makes no promises about keeping the format consistent between Python versions.
|
msg199850 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-10-14 07:54 |
The version token needed only when we want break backward compatibility (change the meaning of existing opcodes).
|
msg199856 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2013-10-14 09:21 |
> The version token needed only when we want break backward
> compatibility (change the meaning of existing opcodes).
This is true. For example, my first hunch was to make w_long()
emit variable-length data, (0-254: one byte integer, 255: four
bytes integer following the 255 prefix).
|
msg199867 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) * |
Date: 2013-10-14 11:28 |
Right, the idea of the version token is to introduce it now, as early as possible, even if it is not needed, for prudence.
For example, if version 5 decides to change the semantics of some of the opcodes, we can then support both kinds, in the future. Read old files _and_ the new ones.
Despite the fact that we claim that we don't guarantee interoperability, in reality it is very desirable. During development, for instance. Frozen modules, .pyc files, and all that.
|
msg199869 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2013-10-14 11:31 |
Should we support serialization in an older version? For example,
disable the new SMALL and ASCII tokens?
|
msg199873 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2013-10-14 11:46 |
> Should we support serialization in an older version? For example,
> disable the new SMALL and ASCII tokens?
It is officially supported with the "version" parameter:
http://docs.python.org/3.4/library/marshal.html#marshal.dump
|
msg199874 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) * |
Date: 2013-10-14 11:47 |
We have done so previously and should continue to do that. Thanks for pointing out that the new SHORT_REF needs that fix :)
|
msg199878 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-10-14 12:10 |
> Right, the idea of the version token is to introduce it now, as early as possible, even if it is not needed, for prudence.
How it will help? If we will break backward compatibility in version 10, then we will introduce the version token for version 10 and larger. If the file will not contain the version token, then it will compatible with version 9.
|
msg199879 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) * |
Date: 2013-10-14 12:30 |
Only output TYPE_SHORT_REF for version >= 4
|
msg199880 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) * |
Date: 2013-10-14 12:38 |
Quoting myself: "for prudence."
We probably should have had this from the beginning. Adding this now makes it easier to make such changes in future, because then you don't have to re-invent the versioning mechanism.
|
msg199882 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2013-10-14 12:43 |
What is difference between introducing the version token right now or only when (if?) we will need it?
|
msg199885 - (view) |
Author: Kristján Valur Jónsson (kristjan.jonsson) * |
Date: 2013-10-14 13:45 |
Never put off to tomorrow what you can do today :)
|
msg235027 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2015-01-30 07:21 |
Largest modules in the stdlib have up to 800 references. So it would be worth to have 4 opcodes for short (10-bit) ref indices.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:51 | admin | set | github: 63455 |
2015-01-30 07:21:10 | serhiy.storchaka | set | messages:
+ msg235027 |
2014-10-14 15:54:15 | skrah | set | nosy:
- skrah
|
2013-10-15 10:31:36 | jcea | set | nosy:
+ jcea
|
2013-10-14 13:45:49 | kristjan.jonsson | set | messages:
+ msg199885 |
2013-10-14 12:43:12 | serhiy.storchaka | set | messages:
+ msg199882 |
2013-10-14 12:38:36 | kristjan.jonsson | set | messages:
+ msg199880 |
2013-10-14 12:30:06 | kristjan.jonsson | set | files:
+ marshal2.patch
messages:
+ msg199879 |
2013-10-14 12:10:37 | serhiy.storchaka | set | messages:
+ msg199878 |
2013-10-14 11:47:04 | kristjan.jonsson | set | messages:
+ msg199874 |
2013-10-14 11:46:08 | pitrou | set | messages:
+ msg199873 |
2013-10-14 11:31:55 | vstinner | set | messages:
+ msg199869 |
2013-10-14 11:28:20 | kristjan.jonsson | set | messages:
+ msg199867 |
2013-10-14 09:21:28 | pitrou | set | messages:
+ msg199856 |
2013-10-14 07:54:14 | serhiy.storchaka | set | messages:
+ msg199850 |
2013-10-14 07:09:22 | rhettinger | set | nosy:
+ rhettinger messages:
+ msg199846
|
2013-10-13 22:57:10 | pitrou | set | messages:
+ msg199823 |
2013-10-13 22:53:12 | vstinner | set | messages:
+ msg199822 |
2013-10-13 22:44:34 | pitrou | set | messages:
+ msg199821 |
2013-10-13 22:42:37 | vstinner | set | messages:
+ msg199820 |
2013-10-13 22:39:07 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg199819
|
2013-10-13 22:34:20 | pitrou | set | nosy:
+ skrah
|
2013-10-13 22:30:08 | kristjan.jonsson | create | |