msg168813 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-21 22:25 |
Piping significant amounts of data through a subprocess using Popen.communicate() is crazily slow on Windows.
The attached program just pushes data through mingw's cat.exe.
Python 3.3:
amount = 1 MB; time taken = 0.07 secs; rate = 13.51 MB/s
amount = 2 MB; time taken = 0.31 secs; rate = 6.51 MB/s
amount = 4 MB; time taken = 1.30 secs; rate = 3.08 MB/s
amount = 8 MB; time taken = 5.43 secs; rate = 1.47 MB/s
amount = 16 MB; time taken = 21.64 secs; rate = 0.74 MB/s
amount = 32 MB; time taken = 87.36 secs; rate = 0.37 MB/s
Python 2.7:
amount = 1 MB; time taken = 0.02 secs; rate = 66.67 MB/s
amount = 2 MB; time taken = 0.03 secs; rate = 68.97 MB/s
amount = 4 MB; time taken = 0.05 secs; rate = 76.92 MB/s
amount = 8 MB; time taken = 0.10 secs; rate = 82.47 MB/s
amount = 16 MB; time taken = 0.27 secs; rate = 60.38 MB/s
amount = 32 MB; time taken = 0.88 secs; rate = 36.36 MB/s
amount = 64 MB; time taken = 3.20 secs; rate = 20.03 MB/s
amount = 128 MB; time taken = 12.36 secs; rate = 10.35 MB/s
For Python 3.3 this looks like O(n^2) complexity to me. 2.7 is better but still struggles for large amounts.
Changing Popen._readerthread() to read in chunks rather than using FileIO.readall() produces a huge speed up:
Python 3.3 with patch:
amount = 1 MB; time taken = 0.01 secs; rate = 76.92 MB/s
amount = 2 MB; time taken = 0.03 secs; rate = 76.92 MB/s
amount = 4 MB; time taken = 0.04 secs; rate = 111.10 MB/s
amount = 8 MB; time taken = 0.05 secs; rate = 148.14 MB/s
amount = 16 MB; time taken = 0.10 secs; rate = 156.85 MB/s
amount = 32 MB; time taken = 0.16 secs; rate = 198.75 MB/s
amount = 64 MB; time taken = 0.31 secs; rate = 205.78 MB/s
amount = 128 MB; time taken = 0.61 secs; rate = 209.82 MB/s
Maybe FileIO.readall() should do something similar for files whose size cannot be determined by stat().
|
msg168819 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-08-21 22:59 |
Yes, I think FileIO.readall() should be fixed rather than avoided.
|
msg168825 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-21 23:45 |
RawIOBase.readall() does the sensible thing already. Maybe FileIO should be allowed to inherit it.
The alternative patch (which probably only works for raw unbuffered case)
diff -r ca54c27a9045 Lib/subprocess.py
--- a/Lib/subprocess.py Tue Aug 21 14:54:22 2012 +0100
+++ b/Lib/subprocess.py Wed Aug 22 00:39:32 2012 +0100
@@ -1152,7 +1152,7 @@
def _readerthread(self, fh, buffer):
- buffer.append(fh.read())
+ buffer.append(io.RawIOBase.readall(fh))
fh.close()
produces
amount = 1 MB; time taken = 0.01 secs; rate = 71.42 MB/s
amount = 2 MB; time taken = 0.02 secs; rate = 83.33 MB/s
amount = 4 MB; time taken = 0.04 secs; rate = 105.26 MB/s
amount = 8 MB; time taken = 0.06 secs; rate = 135.59 MB/s
amount = 16 MB; time taken = 0.09 secs; rate = 170.20 MB/s
amount = 32 MB; time taken = 0.17 secs; rate = 190.46 MB/s
amount = 64 MB; time taken = 0.32 secs; rate = 202.52 MB/s
amount = 128 MB; time taken = 0.61 secs; rate = 211.56 MB/s
|
msg168826 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-08-21 23:58 |
FileIO.readall() already has an overallocation mechanism which should yield linear complexity. Perhaps it needs to be tweaked a bit?
(look at new_buffersize in Modules/_io/fileio.c)
By the way, your results are bit weird. Why does the data rate increase with the transfered amount?
|
msg168829 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-22 00:09 |
> FileIO.readall() already has an overallocation mechanism which should
> yield linear complexity. Perhaps it needs to be tweaked a bit?
> (look at new_buffersize in Modules/_io/fileio.c)
I think it needs a bit more than a tweak;-) Looks like it increases by 12.5% at time. Does not seem to work as intended though.
> By the way, your results are bit weird. Why does the data rate increase
> with the transfered amount?
The rates are completely unreliable for those benchmarks which take less than 0.1 seconds. They level off after 0.1 secs when the overheads become insignificant.
|
msg168862 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-22 09:41 |
In each loop before calling read() the buffer size is recalculated based on the amount of space used, causing a realloc *regardless* of how much empty space is left in the buffer. And each read is only producing a smallish chunk (5120 bytes).
So assuming realloc fails to do an inplace realloc (and Windows is reputedly very bad at that) the amount of data copied will be O(n^2).
|
msg168863 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-22 09:58 |
The attached patch (readall-resize.patch) makes the resizes only happen when the buffer is full. It produces
amount = 1 MB; time taken = 0.02 secs; rate = 64.10 MB/s
amount = 2 MB; time taken = 0.03 secs; rate = 64.10 MB/s
amount = 4 MB; time taken = 0.06 secs; rate = 64.10 MB/s
amount = 8 MB; time taken = 0.08 secs; rate = 102.56 MB/s
amount = 16 MB; time taken = 0.17 secs; rate = 93.24 MB/s
amount = 32 MB; time taken = 0.31 secs; rate = 102.56 MB/s
amount = 64 MB; time taken = 0.70 secs; rate = 91.17 MB/s
amount = 128 MB; time taken = 1.81 secs; rate = 70.73 MB/s
|
msg168865 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-22 10:08 |
Alternative patch (readall-chunks.patch) which delegates to RawIOBase.readall() if the file cannot be stat-ed or appears to have size zero.
amount = 1 MB; time taken = 0.02 secs; rate = 64.10 MB/s
amount = 2 MB; time taken = 0.02 secs; rate = 128.21 MB/s
amount = 4 MB; time taken = 0.03 secs; rate = 128.20 MB/s
amount = 8 MB; time taken = 0.05 secs; rate = 170.94 MB/s
amount = 16 MB; time taken = 0.09 secs; rate = 170.94 MB/s
amount = 32 MB; time taken = 0.16 secs; rate = 205.13 MB/s
amount = 64 MB; time taken = 0.30 secs; rate = 215.92 MB/s
amount = 128 MB; time taken = 0.58 secs; rate = 221.76 MB/s
|
msg168867 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-08-22 10:57 |
> Alternative patch (readall-chunks.patch) which delegates to
> RawIOBase.readall() if the file cannot be stat-ed or appears to have
> size zero.
It seems we could do both (fix the resizing logic, and fallback on
chunks if the size is unknown).
|
msg168876 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-22 12:39 |
> It seems we could do both (fix the resizing logic, and fallback on
> chunks if the size is unknown).
Combined patch attached.
|
msg168882 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-08-22 12:57 |
If new_buffersize() is only used with HAVE_FSTAT, then its code can probably be simplified.
|
msg168884 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2012-08-22 14:05 |
I'm lost, there are too much patches. Could you please remove old/useless patches? Patching the subprocess module is not a good solution: if there is an issue in the io module, the io module should be fixed, and so this patch can be removed.
|
msg168944 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-23 14:07 |
Here is the patch (with the old ones removed).
Note that the old code mishandled the case where _PyBytes_Resize() failed by assuming that the old bytes object would still be valid.
I have assumed that stream psuedo-files will never claim to have a size greater than zero. The code will still work if this assumption is false.
|
msg168945 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-08-23 14:50 |
> Here is the patch (with the old ones removed).
>
> Note that the old code mishandled the case where _PyBytes_Resize()
> failed by assuming that the old bytes object would still be valid.
>
> I have assumed that stream psuedo-files will never claim to have a
> size greater than zero. The code will still work if this assumption
> is false.
I haven't measured under Windows, but this looks ok to me.
|
msg169020 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-24 14:00 |
After playing with the patch on Linux it seems that Linux much prefers the realloc() scheme to the list-of-chunks scheme.
This new patch only does list-of-chunks on Windows.
|
msg169021 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-24 14:02 |
Attached is an alternative benchmark program which does not use subprocess.
|
msg169022 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-08-24 14:11 |
> After playing with the patch on Linux it seems that Linux much prefers
> the realloc() scheme to the list-of-chunks scheme.
What about the method call overhead in RawIO.readall(), and the
different progression of buffer sizes? (the realloc scheme uses larger
and larger read() sizes, while RawIO.readall() uses a constant read()
size).
By the way, not every non-Windows OS is Linux, so the patch is wrong.
|
msg169044 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2012-08-24 16:42 |
> What about the method call overhead in RawIO.readall(), and the
> different progression of buffer sizes? (the realloc scheme uses larger
> and larger read() sizes, while RawIO.readall() uses a constant read()
> size).
For this benchmark the call overhead does not seem to be noticeable, and using larger or adaptive read buffers does not seem to help either. (I have tried both on Linux.)
> By the way, not every non-Windows OS is Linux, so the patch is wrong.
Wrong in the sense of not necessarily optimal for unknown platforms? Well, the patch retains the old (intended) behaviour on other platforms, so I would call that conservative rather than wrong. Are you suggesting switching behaviour depending on whether some macro is defined?
64-bit Linux with current patch (and using new benchmark):
amount = 1 MB; time taken = 0.003 secs; rate = 317.22 MB/s
amount = 2 MB; time taken = 0.007 secs; rate = 283.74 MB/s
amount = 4 MB; time taken = 0.011 secs; rate = 359.58 MB/s
amount = 8 MB; time taken = 0.020 secs; rate = 395.58 MB/s
amount = 16 MB; time taken = 0.030 secs; rate = 528.18 MB/s
amount = 32 MB; time taken = 0.051 secs; rate = 627.72 MB/s
amount = 64 MB; time taken = 0.088 secs; rate = 726.36 MB/s
amount = 128 MB; time taken = 0.133 secs; rate = 960.23 MB/s
amount = 256 MB; time taken = 0.258 secs; rate = 992.32 MB/s
amount = 512 MB; time taken = 0.482 secs; rate = 1062.30 MB/s
On 64-bit Linux with previous patch:
amount = 1 MB; time taken = 0.006 secs; rate = 158.07 MB/s
amount = 2 MB; time taken = 0.011 secs; rate = 177.23 MB/s
amount = 4 MB; time taken = 0.024 secs; rate = 169.32 MB/s
amount = 8 MB; time taken = 0.047 secs; rate = 170.39 MB/s
amount = 16 MB; time taken = 0.098 secs; rate = 163.65 MB/s
amount = 32 MB; time taken = 0.220 secs; rate = 145.19 MB/s
amount = 64 MB; time taken = 0.253 secs; rate = 253.32 MB/s
amount = 128 MB; time taken = 0.724 secs; rate = 176.80 MB/s
amount = 256 MB; time taken = 0.874 secs; rate = 293.02 MB/s
amount = 512 MB; time taken = 2.292 secs; rate = 223.38 MB/s
|
msg169047 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2012-08-24 16:49 |
> For this benchmark the call overhead does not seem to be noticeable,
> and using larger or adaptive read buffers does not seem to help
> either. (I have tried both on Linux.)
Ok, thank you.
> > By the way, not every non-Windows OS is Linux, so the patch is wrong.
>
> Wrong in the sense of not necessarily optimal for unknown platforms?
> Well, the patch retains the old (intended) behaviour on other
> platforms, so I would call that conservative rather than wrong.
Hmm, you are right, there is no regression indeed.
I guess I don't like very much the idea of switching code paths based on
the platform for pure optimization reasons, but in this case it seems
useful (and simple enough).
> Are you suggesting switching behaviour depending on whether some macro
> is defined?
No, that would definitely be overkill.
|
msg189384 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2013-05-16 15:21 |
I have done an updated patch. It no longer special cases Windows, so realloc() is always used for enlarging the buffer (except when fstat() is missing).
Antoine, do you think this is ready to commit?
|
msg189408 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2013-05-16 20:31 |
I posted a couple of review comments.
|
msg189450 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2013-05-17 12:34 |
Updated patch adressing Antoine's comments.
|
msg189459 - (view) |
Author: Richard Oudkerk (sbt) * |
Date: 2013-05-17 14:55 |
New patch.
|
msg189487 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2013-05-17 21:54 |
Looks fine to me.
|
msg189489 - (view) |
Author: Roundup Robot (python-dev) |
Date: 2013-05-17 22:36 |
New changeset e4c303d23d01 by Richard Oudkerk in branch 'default':
Issue #15758: Fix FileIO.readall() so it no longer has O(n**2) complexity.
http://hg.python.org/cpython/rev/e4c303d23d01
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:35 | admin | set | github: 59962 |
2013-05-18 14:19:37 | sbt | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2013-05-17 22:36:43 | python-dev | set | nosy:
+ python-dev messages:
+ msg189489
|
2013-05-17 21:54:29 | pitrou | set | messages:
+ msg189487 |
2013-05-17 15:10:15 | sbt | set | files:
- readall.patch |
2013-05-17 15:10:09 | sbt | set | files:
+ readall.patch |
2013-05-17 14:55:07 | sbt | set | messages:
+ msg189459 |
2013-05-17 14:48:46 | sbt | set | files:
- readall.patch |
2013-05-17 14:48:40 | sbt | set | files:
+ readall.patch |
2013-05-17 12:34:32 | sbt | set | files:
+ readall-benchmark.py |
2013-05-17 12:34:24 | sbt | set | files:
- readall-combined.patch |
2013-05-17 12:34:18 | sbt | set | files:
- readall.patch |
2013-05-17 12:34:12 | sbt | set | files:
+ readall.patch
messages:
+ msg189450 |
2013-05-16 20:31:00 | pitrou | set | messages:
+ msg189408 |
2013-05-16 15:21:16 | sbt | set | files:
- readall-benchmark.py |
2013-05-16 15:21:03 | sbt | set | files:
+ readall.patch
messages:
+ msg189384 |
2012-08-24 16:49:10 | pitrou | set | messages:
+ msg169047 |
2012-08-24 16:42:37 | sbt | set | messages:
+ msg169044 |
2012-08-24 14:11:49 | pitrou | set | messages:
+ msg169022 |
2012-08-24 14:02:38 | sbt | set | files:
- readall-combined.patch |
2012-08-24 14:02:30 | sbt | set | files:
- push-thru-cat.py |
2012-08-24 14:02:16 | sbt | set | files:
+ readall-benchmark.py
messages:
+ msg169021 |
2012-08-24 14:00:36 | sbt | set | files:
+ readall-combined.patch
messages:
+ msg169020 |
2012-08-23 14:50:11 | pitrou | set | messages:
+ msg168945 |
2012-08-23 14:08:02 | sbt | set | files:
+ readall-combined.patch
messages:
+ msg168944 title: 500x speed up for Popen.communicate() on Windows -> FileIO.readall() has worst case O(n^2) complexity |
2012-08-23 14:02:09 | sbt | set | files:
- readall-combined.patch |
2012-08-23 14:02:02 | sbt | set | files:
- readall-chunks.patch |
2012-08-23 14:01:53 | sbt | set | files:
- readall-resize.patch |
2012-08-23 14:01:44 | sbt | set | files:
- popen_communicate.patch |
2012-08-22 14:21:05 | schmir | set | nosy:
+ schmir
|
2012-08-22 14:05:29 | vstinner | set | messages:
+ msg168884 |
2012-08-22 14:02:59 | vstinner | set | nosy:
+ vstinner
|
2012-08-22 12:57:50 | pitrou | set | messages:
+ msg168882 |
2012-08-22 12:39:36 | sbt | set | files:
+ readall-combined.patch
messages:
+ msg168876 |
2012-08-22 10:57:01 | pitrou | set | messages:
+ msg168867 |
2012-08-22 10:08:11 | sbt | set | files:
+ readall-chunks.patch
messages:
+ msg168865 |
2012-08-22 09:58:03 | sbt | set | files:
+ readall-resize.patch
messages:
+ msg168863 |
2012-08-22 09:41:31 | sbt | set | messages:
+ msg168862 |
2012-08-22 00:09:03 | sbt | set | messages:
+ msg168829 |
2012-08-21 23:58:42 | pitrou | set | messages:
+ msg168826 |
2012-08-21 23:45:56 | sbt | set | messages:
+ msg168825 |
2012-08-21 22:59:42 | pitrou | set | nosy:
+ pitrou
messages:
+ msg168819 versions:
+ Python 3.4 |
2012-08-21 22:26:34 | sbt | set | files:
+ popen_communicate.patch keywords:
+ patch |
2012-08-21 22:25:09 | sbt | create | |