classification
Title: o(n*n) marshal.dumps performance for largish objects with patch
Type: performance Stage:
Components: Interpreter Core Versions: Python 2.6
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: aaron_watters, akuchling, christian.heimes, djmitche, facundobatista, rhettinger
Priority: high Keywords: easy, patch

Created on 2008-01-11 14:21 by aaron_watters, last changed 2008-06-16 01:44 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
marshal.diff aaron_watters, 2008-01-11 14:21 proposed patch to improve dumps performance
unnamed aaron_watters, 2008-01-11 15:14
unnamed aaron_watters, 2008-01-11 16:21
marshal_limited.diff facundobatista, 2008-01-25 17:58 same idea but with a limit
Messages (10)
msg59710 - (view) Author: Aaron Watters (aaron_watters) Date: 2008-01-11 14:21
Much to my surprise I found that one of
my applications seemed to be running slow as a result
of marshal.dumps.  I think the culprit is the w_more(...)
function, which grows the marshal buffer in 1k units.
This means that a marshal of size 100k will have 100
reallocations and string copies.  Other parts of Python
(and java etc) have a proportional reallocation strategy
which reallocates a new size based on the existing size.
This mean a 100k marshal requires just 5 or so
reallocations and string copies (n log n versus n**2
asymptotic performance).

I humbly submit the following patch (based on python 2.6a0
source).  I think it is a strict improvement on the existing
code, but I've been wrong before (twice ;)).
  -- Aaron Watters
msg59711 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-01-11 14:25
Why not just double the size? The "doubling + 1024" address some
specific issue? If so, it should be commented.

Also, do you have an example of a marshal.dumps() that suffers from this
issue?

Thank you!
msg59715 - (view) Author: Aaron Watters (aaron_watters) Date: 2008-01-11 15:14
Facundo

1) the +1024 was an accelerator to jump up to over 1k at the first resize.
I think it's a good idea or at least doesn't hurt.

2) Here is an example program:

def test():
    from marshal import dumps
    from time import time
    testString = "abc"*100000000
    print "now testing"
    now = time()
    dump = dumps(testString)
    elapsed = time()-now
    print "elapsed", elapsed

if __name__=="__main__":
    test()

Here are two runs: the first with the old marshal and the second with the
patched marshal.  The second is
better than 2* faster than the first.

arw:/home/arw/test> ~/apache2/htdocs/pythonsrc/Python/python_old mtest1.py
now testing
elapsed 4.13367795944
arw:/home/arw/test> ~/apache2/htdocs/pythonsrc/Python/python mtest1.py
now testing
elapsed 1.7495341301
arw:/home/arw/test>

The example that inspired this research was very complicated and involved
millions of calls to dumps
which caused a number of anomalies (system calls went berzerk for some
reason, maybe paging).

   -- Aaron Watters

On Jan 11, 2008 9:25 AM, Facundo Batista <report@bugs.python.org> wrote:

>
> Facundo Batista added the comment:
>
> Why not just double the size? The "doubling + 1024" address some
> specific issue? If so, it should be commented.
>
> Also, do you have an example of a marshal.dumps() that suffers from this
> issue?
>
> Thank you!
>
> ----------
> nosy: +facundobatista
>
> __________________________________
> Tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1792>
> __________________________________
>
msg59719 - (view) Author: Aaron Watters (aaron_watters) Date: 2008-01-11 16:21
also:

I just modified the code to do iterations using increasingly large data
sizes
and I see the kind of very unpleasant behaviour for the old implementation
(max time varies wildly from min time) that I saw in my more complex
program.  The new implementation doesn't have
these problems.   First the runs and then the modified code

runs

arw:/home/arw/test>
arw:/home/arw/test> ~/apache2/htdocs/pythonsrc/Python/python_old mtest1.pyold
old
0 40 elapsed max= 2.28881835938e-05 min= 4.76837158203e-06 ratio= 4.8
1 160 elapsed max= 1.59740447998e-05 min= 9.05990600586e-06 ratio=
1.76315789474
2 640 elapsed max= 2.40802764893e-05 min= 2.19345092773e-05 ratio=
1.09782608696
3 2560 elapsed max= 8.79764556885e-05 min= 3.981590271e-05 ratio=
2.20958083832
4 10240 elapsed max= 0.000290155410767 min= 0.000148057937622 ratio=
1.95974235105
5 40960 elapsed max= 0.000867128372192 min= 0.00060510635376 ratio=
1.43301812451
6 163840 elapsed max= 0.00739598274231 min= 0.00339317321777 ratio=
2.17966554244
7 655360 elapsed max= 0.0883929729462 min= 0.0139379501343 ratio=
6.34189189189
8 2621440 elapsed max= 1.69851398468 min= 0.0547370910645 ratio=
31.0304028155
9 10485760 elapsed max= 9.98945093155 min= 0.213104963303 ratio=
46.875730986
10 41943040 elapsed max= 132.281101942 min= 0.834150075912 ratio=
158.581897625
arw:/home/arw/test> ~/apache2/htdocs/pythonsrc/Python/python mtest1.py new
new
0 40 elapsed max= 2.19345092773e-05 min= 5.00679016113e-06 ratio=
4.38095238095
1 160 elapsed max= 1.00135803223e-05 min= 9.05990600586e-06 ratio=
1.10526315789
2 640 elapsed max= 3.19480895996e-05 min= 1.28746032715e-05 ratio=
2.48148148148
3 2560 elapsed max= 5.69820404053e-05 min= 3.981590271e-05 ratio=
1.43113772455
4 10240 elapsed max= 0.000186920166016 min= 0.000138998031616 ratio=
1.34476843911
5 40960 elapsed max= 0.00355315208435 min= 0.000746965408325 ratio=
4.75678263645
6 163840 elapsed max= 0.00326490402222 min= 0.00304794311523 ratio=
1.07118272841
7 655360 elapsed max= 0.0127630233765 min= 0.0122020244598 ratio=
1.04597588855
8 2621440 elapsed max= 0.0511522293091 min= 0.0484230518341 ratio=
1.05636112082
9 10485760 elapsed max= 0.198891878128 min= 0.187420129776 ratio=
1.06120873124
10 41943040 elapsed max= 0.758435964584 min= 0.729014158249 ratio=
1.04035834696
arw:/home/arw/test>

Above high ratio numbers indicate strange and unpleasant performance
variance.
For iteration 7 and higher the old implementation has a much
worse max time performance than the new one.

Here is the test code:

def test():
    from marshal import dumps
    from time import time
    size = 10
    for i in range(11):
        size = size*4
        testString = "abc"*size
        #print "now testing", i, size
        minelapsed = None
        for j in range(11):
            now = time()
            dump = dumps(testString)
            elapsed = time()-now
            if minelapsed is None:
                minelapsed = elapsed
                maxelapsed = elapsed
            else:
                minelapsed = min(elapsed, minelapsed)
                maxelapsed = max(elapsed, maxelapsed)
        print i, size, "elapsed max=", maxelapsed, "min=", minelapsed,
"ratio=", maxelapsed/minelapsed

if __name__=="__main__":
    import sys
    print sys.argv[1]
    test()

-- Aaron Watters

On Jan 11, 2008 10:14 AM, Aaron Watters < report@bugs.python.org> wrote:

>
> Aaron Watters added the comment:
>
> Facundo
>
> 1) the +1024 was an accelerator to jump up to over 1k at the first resize.
> I think it's a good idea or at least doesn't hurt.
>
> 2) Here is an example program:
>
> def test():
>    from marshal import dumps
>    from time import time
>    testString = "abc"*100000000
>    print "now testing"
>    now = time()
>    dump = dumps(testString)
>    elapsed = time()-now
>    print "elapsed", elapsed
>
> if __name__=="__main__":
>    test()
>
> Here are two runs: the first with the old marshal and the second with the
> patched marshal.  The second is
> better than 2* faster than the first.
>
> arw:/home/arw/test> ~/apache2/htdocs/pythonsrc/Python/python_old mtest1.py
> now testing
> elapsed 4.13367795944
> arw:/home/arw/test> ~/apache2/htdocs/pythonsrc/Python/python mtest1.py
> now testing
> elapsed 1.7495341301
> arw:/home/arw/test>
>
> The example that inspired this research was very complicated and involved
> millions of calls to dumps
> which caused a number of anomalies (system calls went berzerk for some
> reason, maybe paging).
>
>   -- Aaron Watters
>
> On Jan 11, 2008 9:25 AM, Facundo Batista <report@bugs.python.org> wrote:
>
> >
> > Facundo Batista added the comment:
> >
> > Why not just double the size? The "doubling + 1024" address some
> > specific issue? If so, it should be commented.
> >
> > Also, do you have an example of a marshal.dumps() that suffers from this
> > issue?
> >
> > Thank you!
> >
> > ----------
> > nosy: +facundobatista
> >
> > __________________________________
> > Tracker <report@bugs.python.org>
> > <http://bugs.python.org/issue1792>
> > __________________________________
> >
>
> Added file: http://bugs.python.org/file9124/unnamed
>
> __________________________________
> Tracker <report@bugs.python.org>
> <http://bugs.python.org/issue1792>
> __________________________________
>
msg59730 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-11 17:48
It looks like a reasonable and good patch. Aaron's tests clearly show
the speed up. However the maximum resize should be limited to a value
around 256 to 1024k:

   size = min(2*size + 1024, 512*1024);
msg60179 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-19 14:48
Yet another easy C task for the bug day.
msg61686 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-01-25 17:58
Christian: you can not do that, as you *must* increase the size.

Anyway, I think that is a good idea to not let it duplicate after some
point. Which point? It's arbitrary.

You suggested 512KB, but I think that is too low.

*My* arbitrary point is, when the newsize if greater than 32MB, only
increaase at 1MB steps.

I attach the patch.
msg66560 - (view) Author: Dustin J. Mitchell (djmitche) Date: 2008-05-10 18:52
This test passes -- is this ready to commit?

I see a reduction from 1.9s to 1.5s for the test script in msg59715.
msg66629 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2008-05-11 13:35
I've applied Facundo's version of the patch in r63059.
msg68255 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2008-06-16 01:44
Amended this fix to avoid quadratic behavior altogether.  See issue 
3116.
History
Date User Action Args
2008-06-16 01:44:33rhettingersetnosy: + rhettinger
messages: + msg68255
2008-05-11 13:35:09akuchlingsetstatus: open -> closed
resolution: accepted
messages: + msg66629
nosy: + akuchling
2008-05-10 18:52:23djmitchesetnosy: + djmitche
messages: + msg66560
2008-03-16 21:08:01georg.brandlsettype: resource usage -> performance
2008-01-25 17:58:33facundobatistasetfiles: + marshal_limited.diff
messages: + msg61686
2008-01-19 14:48:32christian.heimessetkeywords: + easy
messages: + msg60179
2008-01-11 17:48:21christian.heimessetpriority: high
keywords: + patch
messages: + msg59730
nosy: + christian.heimes
2008-01-11 16:21:46aaron_watterssetfiles: + unnamed
messages: + msg59719
2008-01-11 15:14:58aaron_watterssetfiles: + unnamed
messages: + msg59715
2008-01-11 14:25:49facundobatistasetnosy: + facundobatista
messages: + msg59711
2008-01-11 14:21:24aaron_watterscreate