classification
Title: Subprocess suffers 40% process creation overhead penalty
Type: resource usage Stage:
Components: Library (Lib) Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Aaron.Sherman, giampaolo.rodola, gregory.p.smith, neologix, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2011-02-24 23:22 by Aaron.Sherman, last changed 2011-03-02 22:15 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
subprocess_errpipe_buffer.patch vstinner, 2011-03-02 11:53
bench_subprocess.py vstinner, 2011-03-02 12:11
Messages (24)
msg129319 - (view) Author: Aaron Sherman (Aaron.Sherman) Date: 2011-02-24 23:22
I wrote some code a while back which used os.popen. I recently got a warning about popen being deprecated so I tried a test with the new subprocess module. In that test, subprocess.Popen appears to have a 40% process creation overhead penalty over os.popen, which really isn't small. It seems that the difference may be made up of some heavy mmap-related work that's happening in my version of python, and that might be highly platform specific, but the mmap/mremap/munmap calls being made in my sample subprocess code aren't being made at all in the os.popen equivalent.

Now, before someone says, "process creation is trivial, so a 40% hit is acceptable because it's 40% of a trivial part of your execution time," keep in mind that many Python applications are heavily process-creation focused. In my case that means monitoring, but I could also imagine this having a substantial impact on Web services and other applications that spend almost all of their time creating child processes. For a trivial script, subprocess is fine as is, but for these demanding applications, subprocess represents a significant source of pain.

Anyway my testing, results and conclusions are written up here:

http://essays.ajs.com/2011/02/python-subprocess-vs-ospopen-overhead.html
msg129320 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-02-24 23:43
Python 3.2 has a _posixsubprocess: some parts of subprocess are implemented in C. Can you try it?

Python 3.2 uses also pipe2(), if available, to avoid the extra fcntl(4, F_GETFD)+fcntl(4, F_SETFD, FD_CLOEXEC).

I suppose that the pipe and mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) is used by subprocess to transfer a Python exception from the child process. Python 2.7 encodes the exception value, type and traceback using pickle, and the parent process calls os.read(1048576). Python 3.2 only encodes the exception value and type using a simple string, the parent process uses bytearray with chunks of 50,000 bytes (stop when the total size is bigger than 50,000 bytes). So I suppose that Python 3.2 allocates less memory in the parent process to read the child exception (if any).

You may also try to change Popen buffer size, but it should not change anything if you test "exit 0".
msg129321 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-24 23:44
I think your analysis is wrong. These mmap() calls are for anonymous memory, most likely they are emitted by the libc's malloc() to get some memory from the kernel. In other words they will be blazingly fast.

I would suggest you try to dig deeper. For example, how much CPU time does the parent process take (excluding its children).

Of course, I also disagree with the idea that spawning "exit 0" subprocesses is a performance critical operation ;) Therefore, it would be useful to know the absolute overhead difference (in milliseconds) between subprocess and os.popen(), to decide if there is really a problem.
msg129333 - (view) Author: Aaron Sherman (Aaron.Sherman) Date: 2011-02-25 06:01
"Python 3.2 has a _posixsubprocess: some parts of subprocess are implemented in C. Can you try it?"

I don't have a Python 3 installation handy, but I can see what I can do tomorrow evening to get one set up and try it out.

"disagree with the idea that spawning "exit 0" subprocesses is a performance critical operation ;)"

How else would you performance test process creation overhead? By introducing as little additional overhead as possible, it's possible for me to get fairly close to measuring just the subprocess module's overhead.

If you stop to think about it, though, this is actually a shockingly huge percent increase. In any process creation scenario I'm familiar with, its overhead should be so small that you could bump it up several orders of magnitude and still not compete with executing a shell and asking it to do anything, even just exit.

And yet, here we are. 40%

I understand that most applications won't be running massive numbers of external commands in parallel, and that's the only way this overhead will really matter (at least that I can think of). But in the two scenarios I mentioned (monitoring and Web services such as CGI, neither of which is particularly rare), this is going to make quite a lot of difference, and if you're going to deprecate os.popen, I would think that making sure your proposed replacement was at least nearly as performant would be standard procedure, no?

"I think your analysis is wrong. These mmap() calls are for anonymous memory, most likely they are emitted by the libc's malloc() to get some memory from the kernel. In other words they will be blazingly fast."

The mremap might be a bit of a performance hit, but it's true that these calls should not be substantially slowing execution... then again, they might indicate that there's substantial amounts of work being done for which memory allocation is required, and as such may simply be a symptom of the actual problem.
msg129350 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-02-25 10:48
> If you stop to think about it, though, this is actually a shockingly huge percent increase. In any process creation scenario I'm familiar with, its overhead should be so small that you could bump it up several orders of magnitude and still not compete with executing a shell and asking it to do anything, even just exit.

os.popen just calls the popen(3) library call, which just performs a fork/execve and some dup/close in between.
subprocess.Popen is implemented in Python, so it doesn't come as a surprise that it's slower in your example.
But I think there's a point you're missing: you're not just creating a child process, but a shell.
subprocess is actually faster than popen to spawn executables (not scripts), because you don't have to spawn a shell.

For example:

$ cat /tmp/test.c 
int main(int argc, char *argv[])
{
        return 0;
}

$ cat /tmp/test_subprocess.py 
import subprocess

for i in range(10000):
    f = subprocess.Popen('/tmp/test')
    f.wait()

$ cat /tmp/test_popen.py 
import os

for i in range(10000):
    f = os.popen('/tmp/test')
    f.close()

$ time ./python /tmp/test_subprocess.py

real    0m13.933s
user    0m3.083s
sys     0m12.441s

$ time ./python /tmp/test_popen.py 

real    0m18.235s
user    0m4.293s
sys     0m15.176s

Given of important the subprocess overhead seems to you, I guess that the processes you're launching are not shell scripts, and thus you're probably better off using subprocess.
msg129377 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-25 15:08
> But in the two scenarios I mentioned (monitoring and Web services such 
> as CGI, neither of which is particularly rare), this is going to make
> quite a lot of difference

That's why I asked for absolute numbers for the overhead difference. A percentage doesn't tell me if launching a process costs 50µs more or 500ms more.

The only question is whether process launch overhead is significant compared to total process runtime (for non-trivial work, that is, not "echo 0" ;-)).
msg129379 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-25 15:12
Okay, thanks to Charles-François' measurements, we can deduce that each subprocess launch is at most 0.3ms of user CPU time and 1.2ms of system CPU time. IMO that's not a real problem.
msg129397 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-02-25 18:28
fyi - while I still need to update it now that 3.2 final has been released, the http://code.google.com/p/python-subprocess32/ project has a backport of the _posixsubprocess stuff from python 3.2 for use on Python 2.x.
msg129450 - (view) Author: Aaron Sherman (Aaron.Sherman) Date: 2011-02-25 22:50
"That's why I asked for absolute numbers for the overhead difference."

Did you not follow the link in my first post? I got pretty detailed, there.

"os.popen just calls the popen(3) library call, which just performs a fork/execve and some dup/close in between. subprocess.Popen is implemented in Python, so it doesn't come as a surprise that it's slower in your example."

Well, of course. I don't think I was ever trying to claim that os.popen vs. subprocess without a shell was going to compare favorably. I'm not advocating os.popen, here, I'm just trying to figure out where this massive overhead is coming from. I think the answer is just, "pure Python is fundamentally slower, and that's not a surprise."

Now, if the 3.x subprocess work that was described here, gets back-ported into 2.x and is included with future releases, that will definitely serve to improve the situation, and might well render much of this moot (testing will tell).

However, I do think that doing the performance testing before deprecating the previous interface would have been a good idea...
msg129633 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-02-27 15:05
> Aaron Sherman <ajs@ajs.com> added the comment:
>
> "That's why I asked for absolute numbers for the overhead difference."
>
> Did you not follow the link in my first post? I got pretty detailed, there.
>

By the way, strace has a '-T' flag that can be used to measure the
time spent in syscalls (but for a mmap/mremap/munmap it won't be
significant).

> However, I do think that doing the performance testing before deprecating
> the previous interface would have been a good idea...
>

Once again: subprocess.Popen is actually faster than os.pipe to create
a subprocess. Just don't pass shell=True. Unless you're spawning shell
scripts, but in that case subprocess' overhead is definitely not
biggest problem.
msg129878 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-02 11:53
subprocess_errpipe_buffer.patch: Patch for 2.7 which improve by 30% the dummy benchmark (run /bin/false 10,000 times). It avoids the creation of the 1 MB buffer: start with a buffer of a single byte (just to check if there is data or if the pipe was closed).
msg129879 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-02 12:09
> subprocess_errpipe_buffer.patch: Patch for 2.7 which improve by 30%
> the dummy benchmark (run /bin/false 10,000 times). It avoids the
> creation of the 1 MB buffer: start with a buffer of a single byte
> (just to check if there is data or if the pipe was closed).

Does it improve on non-dummy benchmarks? Otherwise I don't think it's
worth the trouble.
msg129880 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-02 12:11
Quick benchmark:
 - fork(), execv(), waitpid(): 19081.1 ms
 - os.popen(): 25769.8 ms
 - subprocess.Popen(): 40025.8 ms
 - subprocess.Popen() patched: 26155.2 ms

I tested Python 2.7 (debug build) on Linux (Debian Sid) with a Pentium 4 (Hyperthreading, 2 fake cores), 2 GB of memory.

Without the patch, subprocess is 55% slower than popen. With the patch, subprocess is 1% slower than popen. It is not a good news: subprocess should be faster before my benchmark tries Popen() without stdout=PIPE ;-)

pitrou> I think your analysis is wrong. These mmap() calls are for
pitrou> anonymous memory, most likely they are emitted by the libc's
pitrou> malloc() to get some memory from the kernel. In other words
pitrou> they will be blazingly fast.

Are you sure? :-)

Note that popen() is 35% slower than fork()+execv()+waitpid(). Patched subprocess is 37% slower than  fork()+execv()+waitpid().

Try attached bench_subprocess.py script to get your own results.
msg129881 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-02 12:17
Victor, your patch doesn't even apply on 3.x. That code doesn't exist anymore...
msg129883 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-02 12:27
Benchmark under 3.x (without obsolete patch):
- fork + execv + waitpid: 4794.4 ms
- os.popen: 19792.9 ms
- subprocess.popen: 10152.1 ms

Benchmark under 2.x (without patch:)
- fork + execv + waitpid: 4292.7 ms
- os.popen: 12697.6 ms
- subprocess.popen: 5496.3 ms

Looks like there's a regression on both os.popen and subprocess.popen.
msg129884 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-02 12:34
Here is a quick profile under 3.x:

Time: 10178.0 ms
         320812 function calls (320809 primitive calls) in 10.182 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    5.937    0.001    5.937    0.001 {built-in method read}
    10000    1.462    0.000    1.462    0.000 {built-in method waitpid}
    10000    1.042    0.000    1.042    0.000 {built-in method fork_exec}
    10000    0.752    0.000    8.131    0.001 subprocess.py:1119(_execute_child)
        1    0.228    0.228   10.182   10.182 bench_subprocess.py:1(<module>)
    10000    0.177    0.000    8.345    0.001 subprocess.py:631(__init__)
    20000    0.079    0.000    0.079    0.000 {built-in method close}
    10000    0.061    0.000    1.584    0.000 subprocess.py:1368(wait)


So it looks like our parent Python process spends its time (read(), waitpid()) waiting for its child...
msg129885 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-02 12:38
Interestingly, reducing from the max open file descriptors from 8192 (my default) to 512 halves the runtime and solves the regression:

$ ulimit -n
512
$ ./python bench_subprocess.py 
pid: 31631
Time: 4903.8 ms


So, even though implemented in C, the file descriptor closing logic is still quite costly!
msg129887 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-02 12:52
pitrou> Victor, your patch doesn't even apply on 3.x. 
pitrou> That code doesn't exist anymore...

subprocess.Popen() does still read errpipe_read, but using a buffer of 50,000 bytes instead of 1 MB (the traceback is not more send to the parent process).


Benchmark on Python 3.2 (debug build, same computer than msg129880):

- fork + execv + waitpid: 20052.0 ms
- os.popen: 40241.7 ms
- subprocess.Popen (C): 28467.2 ms
- subprocess.Popen (C, close_fds=False): 22145.4 ms
- subprocess.Popen (Python): 40351.5 ms


Bad:

- os.popen is 41% is slower than subprocess: I suppose that it is the usage of stdout=PIPE (creation of the pipe) which make it slower. But 41% is huge just to create a pipe (without writing into it)!
- subprocess(close_fds=True) (default) is 22% slower than subprocess(close_fds=False)
- os.popen of Python 3 is 56% slower than os.popen of Python 2


Good:

- subprocess of Python 3 is 29% faster than subprocess of Python 2.


Other results:

- subprocess of Python 3 is 9% slower than patched subprocess of Python 2.
- subprocess (default options) is 42% slower than fork+execv+waitpid (this is close to the Python 2 overhead).
- subprocess implemented in Python is 42% slower than the C implementation of subprocess.


pitrou> Looks like there's a regression on both os.popen and subprocess.popen.

os.popen() uses subprocess in Python 3. The worst regression is "os.popen of Python 3 is 56% slower than os.popen of Python 2". I don't think that it is related to Unicode because my benchmark doesn't write or read any data.
msg129888 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-02 13:00
> os.popen is 41% is slower than subprocess: I suppose that it 
> is the usage of stdout=PIPE (creation of the pipe) which make 
> it slower

Oh no, it's because os.popen() calls subprocess.Popen() with shell=True: the overhead is the shell. Nothing wrong here.
msg129889 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-03-02 13:01
> pitrou> I think your analysis is wrong. These mmap() calls are for
> pitrou> anonymous memory, most likely they are emitted by the libc's
> pitrou> malloc() to get some memory from the kernel. In other words
> pitrou> they will be blazingly fast.
>
> Are you sure? :-)

Well, it is fast. It's true that mmap is slower than brk, since the
kernel zero-fills the pages, but the overhead probably doesn't come
from this, but more likely from pymalloc or malloc, and also from the
call to _PyBytes_resize in posix_read when the number of bytes read is
smaller than what has been requested.
msg129890 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-03-02 13:13
> So, even though implemented in C, the file descriptor closing logic is still quite costly!

Yes, see this recent issue: http://bugs.python.org/issue11284

In the reporter's case, it's much worse, because FreeBSD (at least the version he's using) has a SC_OPEN_MAX of 655000, so that passing close_fds=True bumps the Popen runtime to 3 seconds!
Some Unix offer a closefrom() or similar function to close all open file descriptors (and only open ones), which should be much faster.
I'm not aware of anything equivalent under Linux, but it might be worth looking into.
msg129893 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-03-02 13:43
Benchmark on subprocess with a less trivial example. Run 100x python -c pass: 8.63 sec without my patch, 8.53 sec with my patch => only 1% faster, so the patch is just useless on a real world example.

Finally, I think that there is just nothing to do on Python 2: the overhead between fork(), os.popen and subprocess doesn't impact real world programs. On Python 3: the most critical issue is that close_fds=True is much slower than close_fds=False, but there is already an issue for that (#11284).

So let's close this issue. You can still comment it if you disagree. But if you would like to reopen it: please come with numbers of a benchmark on real programs (not on "exit 0" with a shell or /bin/false).
msg129919 - (view) Author: Aaron Sherman (Aaron.Sherman) Date: 2011-03-02 21:57
I think it's still safe to say that high performance applications which need to create many hundreds or thousands of children (e.g. large monitoring systems) will still need another solution that isn't subprocess. That being said, you're right that no one is going to care about the extra overhead of subprocess in a vacuum, and most applications fork one or a very small number of processes at a time and typically end up waiting for orders of magnitude more time for their output than they spend creating the process in the first place.

As I said when I opened this issue, I wasn't terribly concerned with most applications.

That being said, I can't really submit a full-blown monitoring system against this bug, so the best I could do would something that did lots of os.popens or subprocess.Popens in parallel in a contrived way and see how the performance scales as I tune "lots" to different values. Sadly, the time I have for doing that needs to be spent writing other code, so I'll leave this closed and let someone else raise the issue in the future if they run into it.

I can always build a dispatcher in C and communicate with it via IPC to get around the immediate concern of scalability.
msg129920 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-02 22:15
> That being said, I can't really submit a full-blown monitoring system
> against this bug,

Perhaps you can take a look at Shinken, it is a Nagios-lookalike written
in Python.
History
Date User Action Args
2011-03-02 22:15:26pitrousetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129920
2011-03-02 21:57:42Aaron.Shermansetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129919
2011-03-02 13:43:18vstinnersetstatus: open -> closed

messages: + msg129893
resolution: fixed
nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
2011-03-02 13:13:38neologixsetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129890
2011-03-02 13:01:27neologixsetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129889
2011-03-02 13:00:47vstinnersetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129888
2011-03-02 12:52:35vstinnersetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129887
2011-03-02 12:38:21pitrousetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129885
2011-03-02 12:34:56pitrousetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129884
2011-03-02 12:27:54pitrousetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129883
2011-03-02 12:17:47pitrousetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129881
2011-03-02 12:11:00vstinnersetfiles: + bench_subprocess.py
nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129880
2011-03-02 12:09:12pitrousetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129879
2011-03-02 11:53:05vstinnersetfiles: + subprocess_errpipe_buffer.patch

messages: + msg129878
keywords: + patch
nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
2011-02-27 15:05:22neologixsetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129633
2011-02-25 22:50:26Aaron.Shermansetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129450
2011-02-25 18:28:45gregory.p.smithsetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129397
2011-02-25 15:12:57pitrousetnosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
messages: + msg129379
2011-02-25 15:11:25giampaolo.rodolasetnosy: + giampaolo.rodola
2011-02-25 15:08:54pitrousetnosy: gregory.p.smith, pitrou, vstinner, neologix, Aaron.Sherman
messages: + msg129377
2011-02-25 10:48:37neologixsetnosy: + neologix
messages: + msg129350
2011-02-25 06:01:49Aaron.Shermansetnosy: gregory.p.smith, pitrou, vstinner, Aaron.Sherman
messages: + msg129333
2011-02-24 23:44:44pitrousetnosy: + pitrou
messages: + msg129321
2011-02-24 23:43:11vstinnersetnosy: + vstinner
messages: + msg129320
2011-02-24 23:38:23pitrousetnosy: + gregory.p.smith
2011-02-24 23:22:42Aaron.Shermancreate