classification
Title: FileIO.readall() has worst case O(n^2) complexity
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: haypo, pitrou, python-dev, sbt, schmir
Priority: normal Keywords: patch

Created on 2012-08-21 22:25 by sbt, last changed 2013-05-18 14:19 by sbt. This issue is now closed.

Files
File name Uploaded Description Edit
readall-benchmark.py sbt, 2012-08-24 14:02
readall.patch sbt, 2013-05-17 15:10 review
Messages (25)
msg168813 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-21 22:25
Piping significant amounts of data through a subprocess using Popen.communicate() is crazily slow on Windows.

The attached program just pushes data through mingw's cat.exe.

Python 3.3:
amount = 1 MB; time taken = 0.07 secs; rate = 13.51 MB/s
amount = 2 MB; time taken = 0.31 secs; rate = 6.51 MB/s
amount = 4 MB; time taken = 1.30 secs; rate = 3.08 MB/s
amount = 8 MB; time taken = 5.43 secs; rate = 1.47 MB/s
amount = 16 MB; time taken = 21.64 secs; rate = 0.74 MB/s
amount = 32 MB; time taken = 87.36 secs; rate = 0.37 MB/s

Python 2.7:
amount = 1 MB; time taken = 0.02 secs; rate = 66.67 MB/s
amount = 2 MB; time taken = 0.03 secs; rate = 68.97 MB/s
amount = 4 MB; time taken = 0.05 secs; rate = 76.92 MB/s
amount = 8 MB; time taken = 0.10 secs; rate = 82.47 MB/s
amount = 16 MB; time taken = 0.27 secs; rate = 60.38 MB/s
amount = 32 MB; time taken = 0.88 secs; rate = 36.36 MB/s
amount = 64 MB; time taken = 3.20 secs; rate = 20.03 MB/s
amount = 128 MB; time taken = 12.36 secs; rate = 10.35 MB/s

For Python 3.3 this looks like O(n^2) complexity to me.  2.7 is better but still struggles for large amounts.

Changing Popen._readerthread() to read in chunks rather than using FileIO.readall() produces a huge speed up:

Python 3.3 with patch:
amount = 1 MB; time taken = 0.01 secs; rate = 76.92 MB/s
amount = 2 MB; time taken = 0.03 secs; rate = 76.92 MB/s
amount = 4 MB; time taken = 0.04 secs; rate = 111.10 MB/s
amount = 8 MB; time taken = 0.05 secs; rate = 148.14 MB/s
amount = 16 MB; time taken = 0.10 secs; rate = 156.85 MB/s
amount = 32 MB; time taken = 0.16 secs; rate = 198.75 MB/s
amount = 64 MB; time taken = 0.31 secs; rate = 205.78 MB/s
amount = 128 MB; time taken = 0.61 secs; rate = 209.82 MB/s

Maybe FileIO.readall() should do something similar for files whose size cannot be determined by stat().
msg168819 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-21 22:59
Yes, I think FileIO.readall() should be fixed rather than avoided.
msg168825 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-21 23:45
RawIOBase.readall() does the sensible thing already.  Maybe FileIO should be allowed to inherit it.

The alternative patch (which probably only works for raw unbuffered case)

diff -r ca54c27a9045 Lib/subprocess.py
--- a/Lib/subprocess.py Tue Aug 21 14:54:22 2012 +0100
+++ b/Lib/subprocess.py Wed Aug 22 00:39:32 2012 +0100
@@ -1152,7 +1152,7 @@


         def _readerthread(self, fh, buffer):
-            buffer.append(fh.read())
+            buffer.append(io.RawIOBase.readall(fh))
             fh.close()

produces

amount = 1 MB; time taken = 0.01 secs; rate = 71.42 MB/s
amount = 2 MB; time taken = 0.02 secs; rate = 83.33 MB/s
amount = 4 MB; time taken = 0.04 secs; rate = 105.26 MB/s
amount = 8 MB; time taken = 0.06 secs; rate = 135.59 MB/s
amount = 16 MB; time taken = 0.09 secs; rate = 170.20 MB/s
amount = 32 MB; time taken = 0.17 secs; rate = 190.46 MB/s
amount = 64 MB; time taken = 0.32 secs; rate = 202.52 MB/s
amount = 128 MB; time taken = 0.61 secs; rate = 211.56 MB/s
msg168826 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-21 23:58
FileIO.readall() already has an overallocation mechanism which should yield linear complexity. Perhaps it needs to be tweaked a bit?
(look at new_buffersize in Modules/_io/fileio.c)

By the way, your results are bit weird. Why does the data rate increase with the transfered amount?
msg168829 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-22 00:09
> FileIO.readall() already has an overallocation mechanism which should 
> yield linear complexity. Perhaps it needs to be tweaked a bit?
> (look at new_buffersize in Modules/_io/fileio.c)

I think it needs a bit more than a tweak;-)  Looks like it increases by 12.5% at time.  Does not seem to work as intended though.

> By the way, your results are bit weird. Why does the data rate increase 
> with the transfered amount?

The rates are completely unreliable for those benchmarks which take less than 0.1 seconds.  They level off after 0.1 secs when the overheads become insignificant.
msg168862 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-22 09:41
In each loop before calling read() the buffer size is recalculated based on the amount of space used, causing a realloc *regardless* of how much empty space is left in the buffer.  And each read is only producing a smallish chunk (5120 bytes).

So assuming realloc fails to do an inplace realloc (and Windows is reputedly very bad at that) the amount of data copied will be O(n^2).
msg168863 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-22 09:58
The attached patch (readall-resize.patch) makes the resizes only happen when the buffer is full.  It produces

amount = 1 MB; time taken = 0.02 secs; rate = 64.10 MB/s
amount = 2 MB; time taken = 0.03 secs; rate = 64.10 MB/s
amount = 4 MB; time taken = 0.06 secs; rate = 64.10 MB/s
amount = 8 MB; time taken = 0.08 secs; rate = 102.56 MB/s
amount = 16 MB; time taken = 0.17 secs; rate = 93.24 MB/s
amount = 32 MB; time taken = 0.31 secs; rate = 102.56 MB/s
amount = 64 MB; time taken = 0.70 secs; rate = 91.17 MB/s
amount = 128 MB; time taken = 1.81 secs; rate = 70.73 MB/s
msg168865 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-22 10:08
Alternative patch (readall-chunks.patch) which delegates to RawIOBase.readall() if the file cannot be stat-ed or appears to have size zero.

amount = 1 MB; time taken = 0.02 secs; rate = 64.10 MB/s
amount = 2 MB; time taken = 0.02 secs; rate = 128.21 MB/s
amount = 4 MB; time taken = 0.03 secs; rate = 128.20 MB/s
amount = 8 MB; time taken = 0.05 secs; rate = 170.94 MB/s
amount = 16 MB; time taken = 0.09 secs; rate = 170.94 MB/s
amount = 32 MB; time taken = 0.16 secs; rate = 205.13 MB/s
amount = 64 MB; time taken = 0.30 secs; rate = 215.92 MB/s
amount = 128 MB; time taken = 0.58 secs; rate = 221.76 MB/s
msg168867 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-22 10:57
> Alternative patch (readall-chunks.patch) which delegates to
> RawIOBase.readall() if the file cannot be stat-ed or appears to have
> size zero.

It seems we could do both (fix the resizing logic, and fallback on
chunks if the size is unknown).
msg168876 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-22 12:39
> It seems we could do both (fix the resizing logic, and fallback on
> chunks if the size is unknown).

Combined patch attached.
msg168882 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-22 12:57
If new_buffersize() is only used with HAVE_FSTAT, then its code can probably be simplified.
msg168884 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-08-22 14:05
I'm lost, there are too much patches. Could you please remove old/useless patches? Patching the subprocess module is not a good solution: if there is an issue in the io module, the io module should be fixed, and so this patch can be removed.
msg168944 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-23 14:07
Here is the patch (with the old ones removed).

Note that the old code mishandled the case where _PyBytes_Resize() failed by assuming that the old bytes object would still be valid.

I have assumed that stream psuedo-files will never claim to have a size greater than zero.  The code will still work if this assumption is false.
msg168945 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-23 14:50
> Here is the patch (with the old ones removed).
> 
> Note that the old code mishandled the case where _PyBytes_Resize()
> failed by assuming that the old bytes object would still be valid.
> 
> I have assumed that stream psuedo-files will never claim to have a
> size greater than zero.  The code will still work if this assumption
> is false.

I haven't measured under Windows, but this looks ok to me.
msg169020 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-24 14:00
After playing with the patch on Linux it seems that Linux much prefers the realloc() scheme to the list-of-chunks scheme.

This new patch only does list-of-chunks on Windows.
msg169021 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-24 14:02
Attached is an alternative benchmark program which does not use subprocess.
msg169022 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-24 14:11
> After playing with the patch on Linux it seems that Linux much prefers
> the realloc() scheme to the list-of-chunks scheme.

What about the method call overhead in RawIO.readall(), and the
different progression of buffer sizes? (the realloc scheme uses larger
and larger read() sizes, while RawIO.readall() uses a constant read()
size).

By the way, not every non-Windows OS is Linux, so the patch is wrong.
msg169044 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-08-24 16:42
> What about the method call overhead in RawIO.readall(), and the
> different progression of buffer sizes? (the realloc scheme uses larger
> and larger read() sizes, while RawIO.readall() uses a constant read()
> size).

For this benchmark the call overhead does not seem to be noticeable, and using larger or adaptive read buffers does not seem to help either.  (I have tried both on Linux.)

> By the way, not every non-Windows OS is Linux, so the patch is wrong.

Wrong in the sense of not necessarily optimal for unknown platforms?  Well, the patch retains the old (intended) behaviour on other platforms, so I would call that conservative rather than wrong.  Are you suggesting switching behaviour depending on whether some macro is defined?


64-bit Linux with current patch (and using new benchmark):
amount = 1 MB; time taken = 0.003 secs; rate = 317.22 MB/s
amount = 2 MB; time taken = 0.007 secs; rate = 283.74 MB/s
amount = 4 MB; time taken = 0.011 secs; rate = 359.58 MB/s
amount = 8 MB; time taken = 0.020 secs; rate = 395.58 MB/s
amount = 16 MB; time taken = 0.030 secs; rate = 528.18 MB/s
amount = 32 MB; time taken = 0.051 secs; rate = 627.72 MB/s
amount = 64 MB; time taken = 0.088 secs; rate = 726.36 MB/s
amount = 128 MB; time taken = 0.133 secs; rate = 960.23 MB/s
amount = 256 MB; time taken = 0.258 secs; rate = 992.32 MB/s
amount = 512 MB; time taken = 0.482 secs; rate = 1062.30 MB/s

On 64-bit Linux with previous patch:
amount = 1 MB; time taken = 0.006 secs; rate = 158.07 MB/s
amount = 2 MB; time taken = 0.011 secs; rate = 177.23 MB/s
amount = 4 MB; time taken = 0.024 secs; rate = 169.32 MB/s
amount = 8 MB; time taken = 0.047 secs; rate = 170.39 MB/s
amount = 16 MB; time taken = 0.098 secs; rate = 163.65 MB/s
amount = 32 MB; time taken = 0.220 secs; rate = 145.19 MB/s
amount = 64 MB; time taken = 0.253 secs; rate = 253.32 MB/s
amount = 128 MB; time taken = 0.724 secs; rate = 176.80 MB/s
amount = 256 MB; time taken = 0.874 secs; rate = 293.02 MB/s
amount = 512 MB; time taken = 2.292 secs; rate = 223.38 MB/s
msg169047 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-24 16:49
> For this benchmark the call overhead does not seem to be noticeable,
> and using larger or adaptive read buffers does not seem to help
> either.  (I have tried both on Linux.)

Ok, thank you.

> > By the way, not every non-Windows OS is Linux, so the patch is wrong.
> 
> Wrong in the sense of not necessarily optimal for unknown platforms?
> Well, the patch retains the old (intended) behaviour on other
> platforms, so I would call that conservative rather than wrong.

Hmm, you are right, there is no regression indeed.
I guess I don't like very much the idea of switching code paths based on
the platform for pure optimization reasons, but in this case it seems
useful (and simple enough).

> Are you suggesting switching behaviour depending on whether some macro
> is defined?

No, that would definitely be overkill.
msg189384 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-05-16 15:21
I have done an updated patch.  It no longer special cases Windows, so realloc() is always used for enlarging the buffer (except when fstat() is missing).

Antoine, do you think this is ready to commit?
msg189408 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-16 20:31
I posted a couple of review comments.
msg189450 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-05-17 12:34
Updated patch adressing Antoine's comments.
msg189459 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2013-05-17 14:55
New patch.
msg189487 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-17 21:54
Looks fine to me.
msg189489 - (view) Author: Roundup Robot (python-dev) Date: 2013-05-17 22:36
New changeset e4c303d23d01 by Richard Oudkerk in branch 'default':
Issue #15758: Fix FileIO.readall() so it no longer has O(n**2) complexity.
http://hg.python.org/cpython/rev/e4c303d23d01
History
Date User Action Args
2013-05-18 14:19:37sbtsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2013-05-17 22:36:43python-devsetnosy: + python-dev
messages: + msg189489
2013-05-17 21:54:29pitrousetmessages: + msg189487
2013-05-17 15:10:15sbtsetfiles: - readall.patch
2013-05-17 15:10:09sbtsetfiles: + readall.patch
2013-05-17 14:55:07sbtsetmessages: + msg189459
2013-05-17 14:48:46sbtsetfiles: - readall.patch
2013-05-17 14:48:40sbtsetfiles: + readall.patch
2013-05-17 12:34:32sbtsetfiles: + readall-benchmark.py
2013-05-17 12:34:24sbtsetfiles: - readall-combined.patch
2013-05-17 12:34:18sbtsetfiles: - readall.patch
2013-05-17 12:34:12sbtsetfiles: + readall.patch

messages: + msg189450
2013-05-16 20:31:00pitrousetmessages: + msg189408
2013-05-16 15:21:16sbtsetfiles: - readall-benchmark.py
2013-05-16 15:21:03sbtsetfiles: + readall.patch

messages: + msg189384
2012-08-24 16:49:10pitrousetmessages: + msg169047
2012-08-24 16:42:37sbtsetmessages: + msg169044
2012-08-24 14:11:49pitrousetmessages: + msg169022
2012-08-24 14:02:38sbtsetfiles: - readall-combined.patch
2012-08-24 14:02:30sbtsetfiles: - push-thru-cat.py
2012-08-24 14:02:16sbtsetfiles: + readall-benchmark.py

messages: + msg169021
2012-08-24 14:00:36sbtsetfiles: + readall-combined.patch

messages: + msg169020
2012-08-23 14:50:11pitrousetmessages: + msg168945
2012-08-23 14:08:02sbtsetfiles: + readall-combined.patch

messages: + msg168944
title: 500x speed up for Popen.communicate() on Windows -> FileIO.readall() has worst case O(n^2) complexity
2012-08-23 14:02:09sbtsetfiles: - readall-combined.patch
2012-08-23 14:02:02sbtsetfiles: - readall-chunks.patch
2012-08-23 14:01:53sbtsetfiles: - readall-resize.patch
2012-08-23 14:01:44sbtsetfiles: - popen_communicate.patch
2012-08-22 14:21:05schmirsetnosy: + schmir
2012-08-22 14:05:29hayposetmessages: + msg168884
2012-08-22 14:02:59hayposetnosy: + haypo
2012-08-22 12:57:50pitrousetmessages: + msg168882
2012-08-22 12:39:36sbtsetfiles: + readall-combined.patch

messages: + msg168876
2012-08-22 10:57:01pitrousetmessages: + msg168867
2012-08-22 10:08:11sbtsetfiles: + readall-chunks.patch

messages: + msg168865
2012-08-22 09:58:03sbtsetfiles: + readall-resize.patch

messages: + msg168863
2012-08-22 09:41:31sbtsetmessages: + msg168862
2012-08-22 00:09:03sbtsetmessages: + msg168829
2012-08-21 23:58:42pitrousetmessages: + msg168826
2012-08-21 23:45:56sbtsetmessages: + msg168825
2012-08-21 22:59:42pitrousetnosy: + pitrou

messages: + msg168819
versions: + Python 3.4
2012-08-21 22:26:34sbtsetfiles: + popen_communicate.patch
keywords: + patch
2012-08-21 22:25:09sbtcreate