classification
Title: Marshal: performance regression with versions 3 and 4
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: kristjan.jonsson, larry, loewis, pitrou, python-dev, serhiy.storchaka, vajrasky, vstinner
Priority: normal Keywords: patch

Created on 2014-01-28 10:10 by vstinner, last changed 2015-02-11 15:53 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
marshal3_numbers.patch vstinner, 2014-01-28 10:10 review
bench.py vstinner, 2014-01-28 10:14
marshal_small_ints_refs.patch serhiy.storchaka, 2014-02-13 21:35 review
marshal_refs_by_value.patch serhiy.storchaka, 2015-01-28 16:45 review
bench_issue20416.py serhiy.storchaka, 2015-01-28 16:46
marshal_refs_by_value_3.patch serhiy.storchaka, 2015-02-04 11:16 review
marshal_hashtable.patch serhiy.storchaka, 2015-02-04 11:30 review
marshal_hashtable_2.patch serhiy.storchaka, 2015-02-10 22:00 review
Messages (25)
msg209524 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-28 10:10
Attached patched disables references for int and float types.
msg209525 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-28 10:14
Use attached bench.py to compare performances.

Without the patch:
---
dumps v0: 389.6 ms
data size v0: 45582.9 kB
loads v0: 573.3 ms

dumps v1: 391.4 ms
data size v1: 45582.9 kB
loads v1: 558.0 ms

dumps v2: 166.9 ms
data size v2: 41395.4 kB
loads v2: 482.2 ms

dumps v3: 431.2 ms
data size v3: 41395.4 kB
loads v3: 543.8 ms

dumps v4: 361.8 ms
data size v4: 37000.9 kB
loads v4: 560.4 ms
---

With the patch:
---
dumps v0: 391.4 ms
data size v0: 45582.9 kB
loads v0: 578.2 ms

dumps v1: 392.3 ms
data size v1: 45582.9 kB
loads v1: 556.8 ms

dumps v2: 167.7 ms
data size v2: 41395.4 kB
loads v2: 484.6 ms

dumps v3: 170.3 ms
data size v3: 41395.4 kB
loads v3: 467.0 ms

dumps v4: 122.8 ms
data size v4: 37000.9 kB
loads v4: 468.9 ms
---

dumps v3 is 60% faster, loads v3 is also 14% *faster*.

dumps v4 is 66% faster, loads v4 is 16% faster.
msg209526 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-28 10:19
Performance of Python 3.3.3+:
---
dumps v0: 374.8 ms
data size v0: 45582.9 kB
loads v0: 625.3 ms

dumps v1: 374.6 ms
data size v1: 45582.9 kB
loads v1: 605.1 ms

dumps v2: 152.9 ms
data size v2: 41395.4 kB
loads v2: 556.5 ms
---

So with the patch, the Python 3.4 default version (4) is *faster* (dump 20% faster, load 16% faster) and produces *smaller files* (10% smaller).
msg209528 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-28 10:20
Oh by the way, on Python 3.4, the file size (on version 3 and 4) is unchanged with my patch. Writing a reference produces takes exactly the same size than an integer.
msg209530 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2014-01-28 10:25
I am doing clinic conversion for marshal module so I am adding myself to nosy list to make sure both tickets are synchronized.

http://bugs.python.org/issue20185
msg209531 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-28 10:36
> I am doing clinic conversion for marshal module so I am adding myself to nosy list to make sure both tickets are synchronized.

The Derby is suspended until the release of Python 3.4 final.

I consider this issue as an important performance regression that should be fixed before Python 3.4 final.
msg209532 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-28 10:54
Did you tested for numerous shared int and floats? [1000] * 1000000 and [1000.0] * 1000000? AFAIK this was important use cases for adding 3 or 4 versions.
msg209533 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-01-28 11:34
> Did you tested for numerous shared int and floats? [1000] * 1000000 and [1000.0] * 1000000? AFAIK this was important use cases for adding 3 or 4 versions.

Here are new benchmarks on Python 3.4 with:

    Integers: [1000] * 1000000
    Floats: [1000.0] * 1000000

Integers, without the patch:

    dumps v3: 62.8 ms
    data size v3: 4882.8 kB
    loads v3: 10.7 ms

Integers, with the patch:

    dumps v3: 18.6 ms (-70%)
    data size v3: 4882.8 kB (same size)
    loads v3: 27.7 ms (+158%)

Floats, without the patch:

    dumps v3: 62.5 ms
    data size v3: 4882.8 kB
    loads v3: 11.0 ms

Floats, with the patch:

    dumps v3: 29.3 ms (-53%)
    data size v3: 8789.1 kB (+80%)
    loads v3: 25.5 ms (+132%)

The version 3 was added by:
---
changeset:   82816:01372117a5b4
user:        Kristján Valur Jónsson <sweskman@gmail.com>
date:        Tue Mar 19 18:02:10 2013 -0700
files:       Doc/library/marshal.rst Include/marshal.h Lib/test/test_marshal.py Misc/NEWS Python/marshal.c
description:
Issue #16475: Support object instancing, recursion and interned strings in marshal
---

This issue tells about "sharing string constants, common tuples, even common code objects", not sharing numbers.

For real data, here are interesting numbers:
http://bugs.python.org/issue16475#msg176013

Integers only represent 4.8% of serialized data, and only 8.2% of these integers can be shared. (Floats represent 0.29%.) Whereas strings repsent 58% and 57% can be shared.
msg209534 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-01-28 11:35
For the record, format 3 was added through issue16475, format 4 was added through issue19219.

In msg175962, Kristjan argued that there is no reason _not_ to share int objects, e.g. across multiple code objects. Now it seems that this argument is flawed: there is a reason, namely the performance impact.

OTOH, I consider both use case (marshaling a large number of integers, and desiring to share ints across code objects) equally obscure: you shouldn't worry about marshal performance too much if you have loads of tiny int objects, and you shouldn't worry whether these ints get shared or not.

As a compromise, we could suppress the sharing for small int objects, since they are singletons, anyway. This would allow marshal to preserve/copy the object graph, while not impacting the use case that the original poster on python-dev presented.
msg209535 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-01-28 11:39
> Integers, without the patch:
> 
>     dumps v3: 62.8 ms
>     data size v3: 4882.8 kB
>     loads v3: 10.7 ms
> 
> Integers, with the patch:
> 
>     dumps v3: 18.6 ms (-70%)
>     data size v3: 4882.8 kB (same size)
>     loads v3: 27.7 ms (+158%)

As I wrote on python-dev, dumps performance isn't important for the pyc
use case, but loads performance is. Therefore it appears this patch goes
into the wrong direction.

You are also ignoring the *runtime* benefit of sharing objects: smaller
memory footprint of the actual Python process.
msg211164 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-13 19:29
> As a compromise, we could suppress the sharing for small int objects, since they are singletons, anyway.

This is implementation detail.

But we can use more efficient way to memoizating small int objects.

I also suppose than efficient C implementation of IdentityDict will significant decrease an overhead (may be 2 times). As far as this is probably a first case for which IdentityDict give significant benefit, I'll provide an implementation for 3.5.
msg211173 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-13 21:35
Here is a patch which special cases small integers. It decreases v3 and v4 dump time of bench.py by 10% without affecting load time. Of course in real cases the benefit will be much less.
msg211691 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2014-02-20 06:05
This is not a release blocker.  You guys can play with this for 3.5.  FWIW I prefer Serhiy's approach.
msg234903 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-28 16:45
Here is more general solution. For simple values (ints, floats, complex numbers, short strings) it is faster to use the value itself as a key than create new integer object (id).

Without the patch:

data              ver.  dumps(ms)  loads(ms)  size(KiB)

genData            2      103.9      186.4     4090.7
genData            3      250.3      196.8     4090.7
genData            4      223.5      182.5     3651.3

[1000]*10**6       2       98.6      134.8     4882.8
[1000]*10**6       3      491.1       62.2     4882.8
[1000]*10**6       4      494.9       62.1     4882.8

[1000.0]*10**6     2      173.5      158.4     8789.1
[1000.0]*10**6     3      494.8       62.2     4882.8
[1000.0]*10**6     4      491.4       62.8     4882.8

[1000.0j]*10**6    2      288.8      221.4    16601.6
[1000.0j]*10**6    3      493.6       62.4     4882.8
[1000.0j]*10**6    4      489.2       62.0     4882.8

20 pydecimals      2       85.0      114.7     3936.5
20 pydecimals      3       97.2       44.3     3373.4
20 pydecimals      4       86.2       40.0     3297.5


With marshal3_numbers.patch:

data              ver.  dumps(ms)  loads(ms)  size(KiB)

genData            3      108.4      187.5     4090.7
genData            4       83.0      179.3     3651.3

[1000]*10**6       3      104.2      145.8     4882.8
[1000]*10**6       4      103.9      147.0     4882.8

[1000.0]*10**6     3      177.4      154.5     8789.1
[1000.0]*10**6     4      177.1      164.2     8789.1

[1000.0j]*10**6    3      501.6       61.1     4882.8
[1000.0j]*10**6    4      501.6       62.3     4882.8

20 pydecimals      3       95.2       41.9     3373.4
20 pydecimals      4       83.5       38.5     3297.5


With marshal_refs_by_value.patch:

data              ver.  dumps(ms)  loads(ms)  size(KiB)

genData            3      150.4      197.0     4090.7
genData            4      122.1      184.8     3651.3

[1000]*10**6       3      169.1       62.3     4882.8
[1000]*10**6       4      169.2       62.2     4882.8

[1000.0]*10**6     3      313.5       62.2     4882.8
[1000.0]*10**6     4      312.8       62.3     4882.8

[1000.0j]*10**6    3      410.6       62.5     4882.8
[1000.0j]*10**6    4      410.5       62.3     4882.8

20 pydecimals      3       68.5       41.1     3373.4
20 pydecimals      4       57.5       39.8     3297.5

The marshal_refs_by_value.patch produces data so compact as unpatched code does, but dumps faster. Usually it dumps slower than marshal3_numbers.patch, but may produce smaller data and loads faster. Surprisingly it dumps the code of the _pydecimal module faster.

As side effect the patch can turn some simple equal values to identical objects.
msg235050 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-01-30 19:43
Here are results of the benchmark which measures dump and load time for all pyc files in the stdlib (including tests).

https://bitbucket.org/storchaka/cpython-stuff/src/default/marshal/marshalbench.py

$ find * -name __pycache__ -exec rm -rf '{}' +
$ ./python -m compileall -qq -r 99 Lib
$ find Lib -name '*.pyc' | xargs ./python marshalbench.py

Without the patch:

ver.  dumps   loads      size
              770.3   19941.2
0     695.7  1178.4   19941.2
1     680.4  1180.9   19941.2
2     635.9  1115.9   19941.2
3     896.3   757.3   19941.2
4     748.0   700.6   19941.2

With marshal3_numbers.patch:

ver.  dumps   loads      size
              750.5   19946.1
0     690.2  1173.7   19946.1
1     678.2  1166.6   19946.1
2     630.9  1105.2   19946.1
3     873.2   751.5   19946.1
4     724.6   687.6   19946.1

With marshal_refs_by_value.patch:

ver.  dumps   loads      size
              780.2   19935.8
0     712.7  1202.6   19935.8
1     713.8  1195.1   19935.8
2     676.5  1126.0   19935.8
3     686.1   762.7   19935.8
4     515.1   697.6   19935.8
msg235384 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-04 11:16
Here is a patch which adds separate dict for interned strings (otherwise they can be uninterned) and for bytes. It also slightly simplify the code.
msg235385 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-04 11:30
And here is alternative patch which uses a hashtable.

Both patches have about the same performance for *.pyc files, but marshal_hashtable.patch is much faster for duplicated values. Marshalling [1000]*10**6, [1000.0]*10**6 and [1000.0j]*10**6 with version 3 an 4 is so fast as marshalling [1000]*10**6 with version 2 (i.e. 5 times faster than current implementation).

data              ver.  dumps(ms)  loads(ms)  size(KiB)

genData            2       99.9      188.9     4090.7
genData            3      148.2      189.1     4090.7
genData            4      121.4      177.4     3651.3

[1000]*10**6       2       97.7      131.6     4882.8
[1000]*10**6       3       95.1       63.1     4882.8
[1000]*10**6       4       95.1       64.4     4882.8

[1000.0]*10**6     2      172.9      153.5     8789.1
[1000.0]*10**6     3       97.4       61.9     4882.8
[1000.0]*10**6     4       95.7       61.6     4882.8

[1000.0j]*10**6    2      288.6      228.2    16601.6
[1000.0j]*10**6    3       94.9       61.6     4882.8
[1000.0j]*10**6    4       95.1       62.2     4882.8

20 pydecimals      2       88.0      111.4     3929.6
20 pydecimals      3       57.0       51.4     3368.5
20 pydecimals      4       46.6       39.9     3292.8
msg235469 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-06 07:11
What are your thoughts Victor?
msg235479 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-02-06 12:39
Can you post the results of your hashtable patch with marshalbench.py?

PS: we don't care about versions older than 4.
msg235485 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-06 14:11
Here are new results with corrected marshalbench.py (more precise and with recalculated total size):

Without the patch:
ver.  dumps   loads      size
              746.5   19971.2
0     669.1  1149.2   26202.9
1     682.0  1130.1   26202.9
2     652.0  1134.8   26202.4
3     920.8   757.9   21316.7
4     771.2   691.4   19971.2

With marshal3_numbers.patch:
3     894.6   778.4   21321.6
4     733.1   704.2   19976.2

With marshal_refs_by_value_3.patch:
3     687.6   777.7   21310.3
4     535.5   701.5   19966.3

With marshal_hashtable.patch:
3     656.0   764.2   21316.7
4     512.9   696.6   19971.2
msg235713 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-10 22:00
Update patch addresses Victor's comments.
msg235721 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-02-11 00:09
Except of the minor suggestion that I added on the review, the patch looks good the me. It's quite simple and makes dumps() 34% faster (for protocol 4).
msg235722 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-02-11 00:09
(I didn't reproduce the benchmark, I just used Serhy numbers. I guess that using memcpy() doesn't make the code slower.)
msg235746 - (view) Author: Roundup Robot (python-dev) Date: 2015-02-11 13:55
New changeset 012364df2712 by Serhiy Storchaka in branch 'default':
Issue #20416: marshal.dumps() with protocols 3 and 4 is now 40-50% faster on
https://hg.python.org/cpython/rev/012364df2712
msg235753 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-11 15:53
Thanks for your review Antoine and Victor.
History
Date User Action Args
2015-02-11 15:53:15serhiy.storchakasetstatus: open -> closed
messages: + msg235753

assignee: serhiy.storchaka
resolution: fixed
stage: patch review -> resolved
2015-02-11 13:55:58python-devsetnosy: + python-dev
messages: + msg235746
2015-02-11 00:09:58vstinnersetmessages: + msg235722
2015-02-11 00:09:12vstinnersetmessages: + msg235721
2015-02-10 22:00:54serhiy.storchakasetfiles: + marshal_hashtable_2.patch

messages: + msg235713
2015-02-06 14:11:15serhiy.storchakasetmessages: + msg235485
2015-02-06 12:39:25pitrousetmessages: + msg235479
2015-02-06 07:11:58serhiy.storchakasetmessages: + msg235469
title: Marshal: special case int and float, don't use references -> Marshal: performance regression with versions 3 and 4
2015-02-04 11:30:45serhiy.storchakasetfiles: + marshal_hashtable.patch

messages: + msg235385
2015-02-04 11:16:49serhiy.storchakasetfiles: + marshal_refs_by_value_3.patch

messages: + msg235384
2015-01-30 19:43:35serhiy.storchakasetmessages: + msg235050
2015-01-28 16:46:03serhiy.storchakasetfiles: + bench_issue20416.py
2015-01-28 16:45:22serhiy.storchakasetfiles: + marshal_refs_by_value.patch

messages: + msg234903
components: + Interpreter Core
stage: patch review
2014-02-20 06:05:12larrysetpriority: release blocker -> normal

messages: + msg211691
versions: + Python 3.5, - Python 3.4
2014-02-13 21:35:45serhiy.storchakasetfiles: + marshal_small_ints_refs.patch

messages: + msg211173
2014-02-13 19:29:29serhiy.storchakasetmessages: + msg211164
2014-01-28 11:39:26pitrousetmessages: + msg209535
2014-01-28 11:35:30loewissetmessages: + msg209534
2014-01-28 11:34:43vstinnersetmessages: + msg209533
2014-01-28 10:54:50serhiy.storchakasetmessages: + msg209532
2014-01-28 10:50:55serhiy.storchakasetnosy: + pitrou, kristjan.jonsson, serhiy.storchaka
2014-01-28 10:36:08vstinnersetpriority: normal -> release blocker
nosy: + larry
messages: + msg209531

2014-01-28 10:25:57vajraskysetnosy: + vajrasky
messages: + msg209530
2014-01-28 10:20:21vstinnersetmessages: + msg209528
2014-01-28 10:19:31vstinnersetmessages: + msg209526
2014-01-28 10:14:45vstinnersetnosy: + loewis
2014-01-28 10:14:40vstinnersetfiles: + bench.py

messages: + msg209525
2014-01-28 10:10:48vstinnercreate