classification
Title: Optimize BytesIO to do less reallocations when written, similarly to StringIO
Type: performance Stage: needs patch
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Arfrever, eli.bendersky, haypo, jcon, meador.inge, ncoghlan, pitrou, serhiy.storchaka, tshepang
Priority: normal Keywords: patch

Created on 2012-07-17 12:12 by eli.bendersky, last changed 2012-10-14 13:17 by eli.bendersky.

Files
File name Uploaded Description Edit
bytesio_resized_bytes-2.patch serhiy.storchaka, 2012-07-20 20:30 review
Messages (24)
msg165715 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-07-17 12:12
From this pydev thread: http://mail.python.org/pipermail/python-dev/2012-July/120981.html

"BytesIO is actually missing an optimisation that is already used in
StringIO: the StringIO C implementation uses a fragment accumulator
internally, and collapses that into a single string object when
getvalue() is called. BytesIO is still using the old
"resize-the-buffer-as-you-go" strategy, and thus ends up repeatedly
reallocating the buffer as the data sequence grows incrementally.

It should be optimised to work the same way StringIO does (which is
effectively the same way that the monkeypatched version works)"
msg165716 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-07-17 12:49
This optimization for StringIO was done in issue #13149
msg165718 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-17 13:26
I am not see that BytesIO is slower than StringIO (both are about 30% slower than append/join).

$ ./python -m timeit -s "import io; d=[b'a'*10,b'bb'*5,b'ccc'*5]*1000"  "b=[]; w=b.append"  "for x in d: w(x)"  "b''.join(b)"
1000 loops, best of 3: 966 usec per loop

$ ./python -m timeit -s "import io; d=['a'*10,'bb'*5,'ccc'*5]*1000"  "b=[]; w=b.append"  "for x in d: w(x)"  "''.join(b)"
1000 loops, best of 3: 918 usec per loop

$ ./python -m timeit -s "import io; d=[b'a'*10,b'bb'*5,b'ccc'*5]*1000"  "b=io.BytesIO(); w=b.write"  "for x in d: w(x)"  "b.getvalue()"
1000 loops, best of 3: 1.22 msec per loop

$ ./python -m timeit -s "import io; d=['a'*10,'bb'*5,'ccc'*5]*1000"  "b=io.StringIO(); w=b.write"  "for x in d: w(x)"  "b.getvalue()"
1000 loops, best of 3: 1.24 msec per loop
msg165748 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-07-18 09:32
I wonder if this is a fair comparison, Serhiy. Strings are unicode underneath, so they have a large overhead per string (more data to copy around). Increasing the length of the strings changes the game because due to PEP 393, the overhead for ASCII-only Unicode strings is constant:

>>> import sys
>>> sys.getsizeof('a')
50
>>> sys.getsizeof(b'a')
34
>>> sys.getsizeof('a' * 1000)
1049
>>> sys.getsizeof(b'a' * 1000)
1033
>>> 

When re-running your tests with larger chunks, the results are quite interesting:

$ ./python -m timeit -s "import io; d=[b'a'*100,b'bb'*50,b'ccc'*50]*1000"  "b=io.BytesIO(); w=b.write"  "for x in d: w(x)"  "b.getvalue()"
1000 loops, best of 3: 509 usec per loop
$ ./python -m timeit -s "import io; d=['a'*100,'bb'*50,'ccc'*50]*1000"  "s=io.StringIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
1000 loops, best of 3: 282 usec per loop

So, it seems to me that BytesIO could use some optimization!
msg165760 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2012-07-18 11:58
I'd like to take a shot at this, if Antoine doesn't mind. I'll prepare a patch for bytesio.c

Question: what set of benchmarks would it be good to run to make sure this doesn't degrade the performance of BytesIO in various cases?
msg165780 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-18 14:29
> I wonder if this is a fair comparison, Serhiy.

I just used your examples.

> When re-running your tests with larger chunks, the results are quite interesting:

Well, now I see, that BytesIO slower StringIO.
msg165795 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-18 20:06
Here is a preliminary version of the patch. I am not sure that it is fully correct.

Microbenchmark results:

$ ./python -m timeit -s "import io; n=100; d=['a'*n,'bb'*n,'ccc'*n]*10000"  "s=io.StringIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
10 loops, best of 3: 25.5 msec per loop
$ ./python -m timeit -s "import io; n=100; d=[b'a'*n,b'bb'*n,b'ccc'*n]*10000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
10 loops, best of 3: 39.9 msec per loop
$ ./python-patched -m timeit -s "import io; n=100; d=[b'a'*n,b'bb'*n,b'ccc'*n]*10000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
10 loops, best of 3: 26.1 msec per loop

$ ./python -m timeit -s "import io; n=1000; d=['a'*n,'bb'*n,'ccc'*n]*1000"  "s=io.StringIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
100 loops, best of 3: 12.1 msec per loop
$ ./python -m timeit -s "import io; n=1000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
10 loops, best of 3: 26.5 msec per loop
$ ./python-patched -m timeit -s "import io; n=1000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
100 loops, best of 3: 13.6 msec per loop
msg165883 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-19 22:04
> Here is a preliminary version of the patch.

I don't understand the purpose of your patch. It just replaces a direct realloc() call with an indirect one (since _PyBytes_Resize() will call realloc() internally).
msg165884 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-19 22:06
> > Here is a preliminary version of the patch.
> 
> I don't understand the purpose of your patch. It just replaces a
> direct realloc() call with an indirect one (since _PyBytes_Resize()
> will call realloc() internally).

Ok, I understand. You're trying to make the getvalue() call cheaper,
right?
msg165901 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-20 06:42
> Ok, I understand. You're trying to make the getvalue() call cheaper,
> right?

Yes, it saves up to half of time on large data (on Linux). It would be 
interesting to see the results of these microbenchmarks on Windows.
msg165935 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-20 15:03
There seems to be a problem with the patch: when you store the getvalue() result somewhere (instead of discarding it), things get much slower:

$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
1000 loops, best of 3: 913 usec per loop

$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "global y; y = s.getvalue()"
100 loops, best of 3: 4.67 msec per loop

This does not happen without the patch.
msg165937 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-20 15:16
Under Windows (64-bit Windows 7 on a  VirtualBox VM), the patch increases performance slightly but not as much as under Linux:

-> before patch:

C:\t\cpython>pc\VS9.0\amd64\python.exe -m timeit -s "import io; n=2000; d=[b'a'*
n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "s.g
etvalue()"
10 loops, best of 3: 49.2 msec per loop

-> after patch:

C:\t\cpython>pc\VS9.0\amd64\python.exe -m timeit -s "import io; n=2000; d=[b'a'*
n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "s.g
etvalue()"
10 loops, best of 3: 41.7 msec per loop


And the join() approach is 10x faster (!):

C:\t\cpython>pc\VS9.0\amd64\python.exe -m timeit -s "import io; n=2000; d=[b'a'*
n,b'bb'*n,b'ccc'*n]*1000"  "b''.join(d)"
100 loops, best of 3: 4.63 msec per loop

... which points to a much less optimized realloc() under Windows.
msg165976 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-20 20:30
The patch updated. Fixed some errors, optimized initialization, added checks and comments. I think that now the patch is ready for review.
msg165977 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-20 20:32
> There seems to be a problem with the patch: when you store the getvalue() result somewhere (instead of discarding it), things get much slower:

This problem exists with the new patch?
msg165978 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-20 20:48
> Under Windows (64-bit Windows 7 on a  VirtualBox VM), the patch increases performance slightly but not as much as under Linux:

Thank you, Antoine. This is an expected result.

> And the join() approach is 10x faster (!):
> C:\t\cpython>pc\VS9.0\amd64\python.exe -m timeit -s "import io; n=2000; d=[b'a'*
> n,b'bb'*n,b'ccc'*n]*1000"  "b''.join(d)"
> 100 loops, best of 3: 4.63 msec per loop

To be fair, test body should be: "s=[]; w=s.append"  "for x in d: w(x)"  "b''.join(s)"

May be tuning resize strategy (overallocate not 1/8, but 1/4, 1/2, or 100%) can help.
msg165979 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-20 20:53
> > There seems to be a problem with the patch: when you store the
> getvalue() result somewhere (instead of discarding it), things get
> much slower:
> 
> This problem exists with the new patch?

I think you can run the benchmark yourself (I ran it under Linux).
msg165980 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-20 21:11
> I think you can run the benchmark yourself (I ran it under Linux).

I don't see any differences.
msg165981 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-20 21:18
Well, with the latest patch I get:

$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "s.getvalue()"
1000 loops, best of 3: 982 usec per loop

$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "global y; y = s.getvalue()"
100 loops, best of 3: 4.79 msec per loop
msg165995 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-21 06:30
Hmm. Without the ability to reproduce this effect, it will be difficult for me to 
get rid of him. What times for StringIO, append/join, unpatched BytesIO? Do 
this happen with a little different numbers (n=1500, 1600,...,3000)?
msg166005 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-21 10:15
> Hmm. Without the ability to reproduce this effect, it will be difficult for me to 
> get rid of him.

It's under 64-bit Linux, Intel Core i5 CPU. Are you sure you're testing
in non-debug mode?

That said, the numbers under Windows suggest me that Eli's original idea
(append and then join at the end) would be more robust.
msg166007 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-21 10:27
By the way, here is Matt Mackall's take on realloc():

http://www.selenic.com/pipermail/mercurial-devel/2011-October/034988.html
msg166008 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-21 10:37
StringIO:

$ ./python -m timeit -s "import io; n=2000; d=['a'*n,'bb'*n,'ccc'*n]*1000"  "s=io.StringIO(); w=s.write"  "for x in d: w(x)"  "global y; y = s.getvalue()"

-> Linux: 1000 loops, best of 3: 985 usec per loop
-> Windows: 100 loops, best of 3: 4.26 msec per loop

Unpatched BytesIO:

$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "s=io.BytesIO(); w=s.write"  "for x in d: w(x)"  "global y; y = s.getvalue()"

-> Linux: 100 loops, best of 3: 2.44 msec per loop
-> Windows: 10 loops, best of 3: 38.4 msec per loop

b''.join():

$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "l = list(d)"  "global y; y = b''.join(l)"

-> Linux: 1000 loops, best of 3: 821 usec per loop
-> Windows: 100 loops, best of 3: 4.09 msec per loop

bytearray:

$ ./python -m timeit -s "import io; n=2000; d=[b'a'*n,b'bb'*n,b'ccc'*n]*1000"  "b = bytearray()" "for x in d: b += x" "global y; y = b"

-> Linux: 1000 loops, best of 3: 834 usec per loop
-> Windows: 10 loops, best of 3: 37.8 msec per loop
msg166043 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-21 15:54
> It's under 64-bit Linux, Intel Core i5 CPU. Are you sure you're testing
> in non-debug mode?

I use 32-bit Linux.

> That said, the numbers under Windows suggest me that Eli's original idea
> (append and then join at the end) would be more robust.

I agree, it would be more robust, but much more complex solution. I will try to implement this approach too. Victor Stinner in their experiments with formatting preferred 
overallocation approach (_PyUnicodeWriter), therefore, the solution probably will be a hybrid.

However, even in itself the patch deserves attention. It not only significant speeds up writing under Linux, but also speeds up the initialization, which leads to faster reading.

./python -m timeit -s "import io; d=(b'a'*99+b'\n')*10000"  "s=io.BytesIO(d); r=s.readline"  "while r(): pass"

Unpatched:  100 loops, best of 3: 5.92 msec per loop
Patched:  100 loops, best of 3: 3.95 msec per loop

I will try to preserve this advantage.

> By the way, here is Matt Mackall's take on realloc():

Note, that list also uses realloc() inside and therefore you get the same behavior with other constant factor. Actually, appending to a list less than sizeof(void*) bytes at a 
time you will run into this sooner. Perhaps this is the cause that append/join approach under Windows slower than under Linux.

Thank you for measurements, this shows the scale of the issue.
msg171205 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-09-24 23:53
See also issue #15612 for a possible optimization on StringIO (use _PyUnicodeWriter).
History
Date User Action Args
2012-10-14 13:17:35eli.benderskysettitle: Optimize BytesIO to so less reallocations when written, similarly to StringIO -> Optimize BytesIO to do less reallocations when written, similarly to StringIO
2012-09-24 23:53:42hayposetmessages: + msg171205
2012-09-24 23:50:42hayposetnosy: + haypo
2012-08-05 11:16:30serhiy.storchakasetassignee: serhiy.storchaka
2012-07-21 15:54:30serhiy.storchakasetmessages: + msg166043
2012-07-21 10:37:26pitrousetmessages: + msg166008
2012-07-21 10:27:20pitrousetmessages: + msg166007
2012-07-21 10:15:33pitrousetmessages: + msg166005
2012-07-21 06:30:21serhiy.storchakasetmessages: + msg165995
2012-07-20 21:18:45pitrousetmessages: + msg165981
2012-07-20 21:11:59serhiy.storchakasetmessages: + msg165980
2012-07-20 20:53:33pitrousetmessages: + msg165979
2012-07-20 20:48:26serhiy.storchakasetmessages: + msg165978
2012-07-20 20:32:04serhiy.storchakasetmessages: + msg165977
2012-07-20 20:30:58serhiy.storchakasetfiles: - bytesio_resized_bytes.patch
2012-07-20 20:30:28serhiy.storchakasetfiles: + bytesio_resized_bytes-2.patch

messages: + msg165976
2012-07-20 15:16:08pitrousetmessages: + msg165937
2012-07-20 15:03:28pitrousetmessages: + msg165935
2012-07-20 06:53:36eli.benderskysetassignee: eli.bendersky -> (no value)
2012-07-20 06:42:59serhiy.storchakasetmessages: + msg165901
2012-07-20 03:15:38meador.ingesetnosy: + meador.inge
2012-07-19 22:06:20pitrousetmessages: + msg165884
2012-07-19 22:04:36pitrousetmessages: + msg165883
2012-07-18 20:06:28serhiy.storchakasetfiles: + bytesio_resized_bytes.patch
keywords: + patch
messages: + msg165795
2012-07-18 14:29:55serhiy.storchakasetmessages: + msg165780
2012-07-18 11:58:40eli.benderskysetassignee: eli.bendersky
messages: + msg165760
2012-07-18 09:32:42eli.benderskysetmessages: + msg165748
2012-07-17 21:27:10Arfreversetnosy: + Arfrever
2012-07-17 13:44:31jconsetnosy: + jcon
2012-07-17 13:26:48serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg165718
2012-07-17 12:49:10eli.benderskysetmessages: + msg165716
2012-07-17 12:38:23tshepangsetnosy: + tshepang
2012-07-17 12:12:45eli.benderskycreate