Issue 11314: Subprocess suffers 40% process creation overhead penalty

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/55523

classification

Title:	Subprocess suffers 40% process creation overhead penalty
Type:	resource usage	Stage:
Components:	Library (Lib)	Versions:	Python 2.7, Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Aaron.Sherman, giampaolo.rodola, gregory.p.smith, neologix, pitrou, vstinner
Priority:	normal	Keywords:	patch

Created on 2011-02-24 23:22 by Aaron.Sherman, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
subprocess_errpipe_buffer.patch	vstinner, 2011-03-02 11:53
bench_subprocess.py	vstinner, 2011-03-02 12:11

Messages (24)
msg129319 - (view)	Author: Aaron Sherman (Aaron.Sherman)	Date: 2011-02-24 23:22
I wrote some code a while back which used os.popen. I recently got a warning about popen being deprecated so I tried a test with the new subprocess module. In that test, subprocess.Popen appears to have a 40% process creation overhead penalty over os.popen, which really isn't small. It seems that the difference may be made up of some heavy mmap-related work that's happening in my version of python, and that might be highly platform specific, but the mmap/mremap/munmap calls being made in my sample subprocess code aren't being made at all in the os.popen equivalent. Now, before someone says, "process creation is trivial, so a 40% hit is acceptable because it's 40% of a trivial part of your execution time," keep in mind that many Python applications are heavily process-creation focused. In my case that means monitoring, but I could also imagine this having a substantial impact on Web services and other applications that spend almost all of their time creating child processes. For a trivial script, subprocess is fine as is, but for these demanding applications, subprocess represents a significant source of pain. Anyway my testing, results and conclusions are written up here: http://essays.ajs.com/2011/02/python-subprocess-vs-ospopen-overhead.html
msg129320 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-02-24 23:43
Python 3.2 has a _posixsubprocess: some parts of subprocess are implemented in C. Can you try it? Python 3.2 uses also pipe2(), if available, to avoid the extra fcntl(4, F_GETFD)+fcntl(4, F_SETFD, FD_CLOEXEC). I suppose that the pipe and mmap(NULL, 1052672, PROT_READ\|PROT_WRITE, MAP_PRIVATE\|MAP_ANONYMOUS, -1, 0) is used by subprocess to transfer a Python exception from the child process. Python 2.7 encodes the exception value, type and traceback using pickle, and the parent process calls os.read(1048576). Python 3.2 only encodes the exception value and type using a simple string, the parent process uses bytearray with chunks of 50,000 bytes (stop when the total size is bigger than 50,000 bytes). So I suppose that Python 3.2 allocates less memory in the parent process to read the child exception (if any). You may also try to change Popen buffer size, but it should not change anything if you test "exit 0".
msg129321 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-02-24 23:44
I think your analysis is wrong. These mmap() calls are for anonymous memory, most likely they are emitted by the libc's malloc() to get some memory from the kernel. In other words they will be blazingly fast. I would suggest you try to dig deeper. For example, how much CPU time does the parent process take (excluding its children). Of course, I also disagree with the idea that spawning "exit 0" subprocesses is a performance critical operation ;) Therefore, it would be useful to know the absolute overhead difference (in milliseconds) between subprocess and os.popen(), to decide if there is really a problem.
msg129333 - (view)	Author: Aaron Sherman (Aaron.Sherman)	Date: 2011-02-25 06:01
"Python 3.2 has a _posixsubprocess: some parts of subprocess are implemented in C. Can you try it?" I don't have a Python 3 installation handy, but I can see what I can do tomorrow evening to get one set up and try it out. "disagree with the idea that spawning "exit 0" subprocesses is a performance critical operation ;)" How else would you performance test process creation overhead? By introducing as little additional overhead as possible, it's possible for me to get fairly close to measuring just the subprocess module's overhead. If you stop to think about it, though, this is actually a shockingly huge percent increase. In any process creation scenario I'm familiar with, its overhead should be so small that you could bump it up several orders of magnitude and still not compete with executing a shell and asking it to do anything, even just exit. And yet, here we are. 40% I understand that most applications won't be running massive numbers of external commands in parallel, and that's the only way this overhead will really matter (at least that I can think of). But in the two scenarios I mentioned (monitoring and Web services such as CGI, neither of which is particularly rare), this is going to make quite a lot of difference, and if you're going to deprecate os.popen, I would think that making sure your proposed replacement was at least nearly as performant would be standard procedure, no? "I think your analysis is wrong. These mmap() calls are for anonymous memory, most likely they are emitted by the libc's malloc() to get some memory from the kernel. In other words they will be blazingly fast." The mremap might be a bit of a performance hit, but it's true that these calls should not be substantially slowing execution... then again, they might indicate that there's substantial amounts of work being done for which memory allocation is required, and as such may simply be a symptom of the actual problem.
msg129350 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-02-25 10:48
> If you stop to think about it, though, this is actually a shockingly huge percent increase. In any process creation scenario I'm familiar with, its overhead should be so small that you could bump it up several orders of magnitude and still not compete with executing a shell and asking it to do anything, even just exit. os.popen just calls the popen(3) library call, which just performs a fork/execve and some dup/close in between. subprocess.Popen is implemented in Python, so it doesn't come as a surprise that it's slower in your example. But I think there's a point you're missing: you're not just creating a child process, but a shell. subprocess is actually faster than popen to spawn executables (not scripts), because you don't have to spawn a shell. For example: $ cat /tmp/test.c int main(int argc, char *argv[]) { return 0; } $ cat /tmp/test_subprocess.py import subprocess for i in range(10000): f = subprocess.Popen('/tmp/test') f.wait() $ cat /tmp/test_popen.py import os for i in range(10000): f = os.popen('/tmp/test') f.close() $ time ./python /tmp/test_subprocess.py real 0m13.933s user 0m3.083s sys 0m12.441s $ time ./python /tmp/test_popen.py real 0m18.235s user 0m4.293s sys 0m15.176s Given of important the subprocess overhead seems to you, I guess that the processes you're launching are not shell scripts, and thus you're probably better off using subprocess.
msg129377 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-02-25 15:08
> But in the two scenarios I mentioned (monitoring and Web services such > as CGI, neither of which is particularly rare), this is going to make > quite a lot of difference That's why I asked for absolute numbers for the overhead difference. A percentage doesn't tell me if launching a process costs 50µs more or 500ms more. The only question is whether process launch overhead is significant compared to total process runtime (for non-trivial work, that is, not "echo 0" ;-)).
msg129379 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-02-25 15:12
Okay, thanks to Charles-François' measurements, we can deduce that each subprocess launch is at most 0.3ms of user CPU time and 1.2ms of system CPU time. IMO that's not a real problem.
msg129397 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2011-02-25 18:28
fyi - while I still need to update it now that 3.2 final has been released, the http://code.google.com/p/python-subprocess32/ project has a backport of the _posixsubprocess stuff from python 3.2 for use on Python 2.x.
msg129450 - (view)	Author: Aaron Sherman (Aaron.Sherman)	Date: 2011-02-25 22:50
"That's why I asked for absolute numbers for the overhead difference." Did you not follow the link in my first post? I got pretty detailed, there. "os.popen just calls the popen(3) library call, which just performs a fork/execve and some dup/close in between. subprocess.Popen is implemented in Python, so it doesn't come as a surprise that it's slower in your example." Well, of course. I don't think I was ever trying to claim that os.popen vs. subprocess without a shell was going to compare favorably. I'm not advocating os.popen, here, I'm just trying to figure out where this massive overhead is coming from. I think the answer is just, "pure Python is fundamentally slower, and that's not a surprise." Now, if the 3.x subprocess work that was described here, gets back-ported into 2.x and is included with future releases, that will definitely serve to improve the situation, and might well render much of this moot (testing will tell). However, I do think that doing the performance testing before deprecating the previous interface would have been a good idea...
msg129633 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-02-27 15:05
> Aaron Sherman <ajs@ajs.com> added the comment: > > "That's why I asked for absolute numbers for the overhead difference." > > Did you not follow the link in my first post? I got pretty detailed, there. > By the way, strace has a '-T' flag that can be used to measure the time spent in syscalls (but for a mmap/mremap/munmap it won't be significant). > However, I do think that doing the performance testing before deprecating > the previous interface would have been a good idea... > Once again: subprocess.Popen is actually faster than os.pipe to create a subprocess. Just don't pass shell=True. Unless you're spawning shell scripts, but in that case subprocess' overhead is definitely not biggest problem.
msg129878 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-03-02 11:53
subprocess_errpipe_buffer.patch: Patch for 2.7 which improve by 30% the dummy benchmark (run /bin/false 10,000 times). It avoids the creation of the 1 MB buffer: start with a buffer of a single byte (just to check if there is data or if the pipe was closed).
msg129879 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-03-02 12:09
> subprocess_errpipe_buffer.patch: Patch for 2.7 which improve by 30% > the dummy benchmark (run /bin/false 10,000 times). It avoids the > creation of the 1 MB buffer: start with a buffer of a single byte > (just to check if there is data or if the pipe was closed). Does it improve on non-dummy benchmarks? Otherwise I don't think it's worth the trouble.
msg129880 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-03-02 12:11
Quick benchmark: - fork(), execv(), waitpid(): 19081.1 ms - os.popen(): 25769.8 ms - subprocess.Popen(): 40025.8 ms - subprocess.Popen() patched: 26155.2 ms I tested Python 2.7 (debug build) on Linux (Debian Sid) with a Pentium 4 (Hyperthreading, 2 fake cores), 2 GB of memory. Without the patch, subprocess is 55% slower than popen. With the patch, subprocess is 1% slower than popen. It is not a good news: subprocess should be faster before my benchmark tries Popen() without stdout=PIPE ;-) pitrou> I think your analysis is wrong. These mmap() calls are for pitrou> anonymous memory, most likely they are emitted by the libc's pitrou> malloc() to get some memory from the kernel. In other words pitrou> they will be blazingly fast. Are you sure? :-) Note that popen() is 35% slower than fork()+execv()+waitpid(). Patched subprocess is 37% slower than fork()+execv()+waitpid(). Try attached bench_subprocess.py script to get your own results.
msg129881 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-03-02 12:17
Victor, your patch doesn't even apply on 3.x. That code doesn't exist anymore...
msg129883 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-03-02 12:27
Benchmark under 3.x (without obsolete patch): - fork + execv + waitpid: 4794.4 ms - os.popen: 19792.9 ms - subprocess.popen: 10152.1 ms Benchmark under 2.x (without patch:) - fork + execv + waitpid: 4292.7 ms - os.popen: 12697.6 ms - subprocess.popen: 5496.3 ms Looks like there's a regression on both os.popen and subprocess.popen.
msg129884 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-03-02 12:34
Here is a quick profile under 3.x: Time: 10178.0 ms 320812 function calls (320809 primitive calls) in 10.182 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 10000 5.937 0.001 5.937 0.001 {built-in method read} 10000 1.462 0.000 1.462 0.000 {built-in method waitpid} 10000 1.042 0.000 1.042 0.000 {built-in method fork_exec} 10000 0.752 0.000 8.131 0.001 subprocess.py:1119(_execute_child) 1 0.228 0.228 10.182 10.182 bench_subprocess.py:1(<module>) 10000 0.177 0.000 8.345 0.001 subprocess.py:631(__init__) 20000 0.079 0.000 0.079 0.000 {built-in method close} 10000 0.061 0.000 1.584 0.000 subprocess.py:1368(wait) So it looks like our parent Python process spends its time (read(), waitpid()) waiting for its child...
msg129885 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-03-02 12:38
Interestingly, reducing from the max open file descriptors from 8192 (my default) to 512 halves the runtime and solves the regression: $ ulimit -n 512 $ ./python bench_subprocess.py pid: 31631 Time: 4903.8 ms So, even though implemented in C, the file descriptor closing logic is still quite costly!
msg129887 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-03-02 12:52
pitrou> Victor, your patch doesn't even apply on 3.x. pitrou> That code doesn't exist anymore... subprocess.Popen() does still read errpipe_read, but using a buffer of 50,000 bytes instead of 1 MB (the traceback is not more send to the parent process). Benchmark on Python 3.2 (debug build, same computer than msg129880): - fork + execv + waitpid: 20052.0 ms - os.popen: 40241.7 ms - subprocess.Popen (C): 28467.2 ms - subprocess.Popen (C, close_fds=False): 22145.4 ms - subprocess.Popen (Python): 40351.5 ms Bad: - os.popen is 41% is slower than subprocess: I suppose that it is the usage of stdout=PIPE (creation of the pipe) which make it slower. But 41% is huge just to create a pipe (without writing into it)! - subprocess(close_fds=True) (default) is 22% slower than subprocess(close_fds=False) - os.popen of Python 3 is 56% slower than os.popen of Python 2 Good: - subprocess of Python 3 is 29% faster than subprocess of Python 2. Other results: - subprocess of Python 3 is 9% slower than patched subprocess of Python 2. - subprocess (default options) is 42% slower than fork+execv+waitpid (this is close to the Python 2 overhead). - subprocess implemented in Python is 42% slower than the C implementation of subprocess. pitrou> Looks like there's a regression on both os.popen and subprocess.popen. os.popen() uses subprocess in Python 3. The worst regression is "os.popen of Python 3 is 56% slower than os.popen of Python 2". I don't think that it is related to Unicode because my benchmark doesn't write or read any data.
msg129888 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-03-02 13:00
> os.popen is 41% is slower than subprocess: I suppose that it > is the usage of stdout=PIPE (creation of the pipe) which make > it slower Oh no, it's because os.popen() calls subprocess.Popen() with shell=True: the overhead is the shell. Nothing wrong here.
msg129889 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-03-02 13:01
> pitrou> I think your analysis is wrong. These mmap() calls are for > pitrou> anonymous memory, most likely they are emitted by the libc's > pitrou> malloc() to get some memory from the kernel. In other words > pitrou> they will be blazingly fast. > > Are you sure? :-) Well, it is fast. It's true that mmap is slower than brk, since the kernel zero-fills the pages, but the overhead probably doesn't come from this, but more likely from pymalloc or malloc, and also from the call to _PyBytes_resize in posix_read when the number of bytes read is smaller than what has been requested.
msg129890 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-03-02 13:13
> So, even though implemented in C, the file descriptor closing logic is still quite costly! Yes, see this recent issue: http://bugs.python.org/issue11284 In the reporter's case, it's much worse, because FreeBSD (at least the version he's using) has a SC_OPEN_MAX of 655000, so that passing close_fds=True bumps the Popen runtime to 3 seconds! Some Unix offer a closefrom() or similar function to close all open file descriptors (and only open ones), which should be much faster. I'm not aware of anything equivalent under Linux, but it might be worth looking into.
msg129893 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-03-02 13:43
Benchmark on subprocess with a less trivial example. Run 100x python -c pass: 8.63 sec without my patch, 8.53 sec with my patch => only 1% faster, so the patch is just useless on a real world example. Finally, I think that there is just nothing to do on Python 2: the overhead between fork(), os.popen and subprocess doesn't impact real world programs. On Python 3: the most critical issue is that close_fds=True is much slower than close_fds=False, but there is already an issue for that (#11284). So let's close this issue. You can still comment it if you disagree. But if you would like to reopen it: please come with numbers of a benchmark on real programs (not on "exit 0" with a shell or /bin/false).
msg129919 - (view)	Author: Aaron Sherman (Aaron.Sherman)	Date: 2011-03-02 21:57
I think it's still safe to say that high performance applications which need to create many hundreds or thousands of children (e.g. large monitoring systems) will still need another solution that isn't subprocess. That being said, you're right that no one is going to care about the extra overhead of subprocess in a vacuum, and most applications fork one or a very small number of processes at a time and typically end up waiting for orders of magnitude more time for their output than they spend creating the process in the first place. As I said when I opened this issue, I wasn't terribly concerned with most applications. That being said, I can't really submit a full-blown monitoring system against this bug, so the best I could do would something that did lots of os.popens or subprocess.Popens in parallel in a contrived way and see how the performance scales as I tune "lots" to different values. Sadly, the time I have for doing that needs to be spent writing other code, so I'll leave this closed and let someone else raise the issue in the future if they run into it. I can always build a dispatcher in C and communicate with it via IPC to get around the immediate concern of scalability.
msg129920 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-03-02 22:15
> That being said, I can't really submit a full-blown monitoring system > against this bug, Perhaps you can take a look at Shinken, it is a Nagios-lookalike written in Python.

History
Date	User	Action	Args
2022-04-11 14:57:13	admin	set	github: 55523
2011-03-02 22:15:26	pitrou	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129920
2011-03-02 21:57:42	Aaron.Sherman	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129919
2011-03-02 13:43:18	vstinner	set	status: open -> closed messages: + msg129893 resolution: fixed nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
2011-03-02 13:13:38	neologix	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129890
2011-03-02 13:01:27	neologix	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129889
2011-03-02 13:00:47	vstinner	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129888
2011-03-02 12:52:35	vstinner	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129887
2011-03-02 12:38:21	pitrou	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129885
2011-03-02 12:34:56	pitrou	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129884
2011-03-02 12:27:54	pitrou	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129883
2011-03-02 12:17:47	pitrou	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129881
2011-03-02 12:11:00	vstinner	set	files: + bench_subprocess.py nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129880
2011-03-02 12:09:12	pitrou	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129879
2011-03-02 11:53:05	vstinner	set	files: + subprocess_errpipe_buffer.patch messages: + msg129878 keywords: + patch nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman
2011-02-27 15:05:22	neologix	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129633
2011-02-25 22:50:26	Aaron.Sherman	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129450
2011-02-25 18:28:45	gregory.p.smith	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129397
2011-02-25 15:12:57	pitrou	set	nosy: gregory.p.smith, pitrou, vstinner, giampaolo.rodola, neologix, Aaron.Sherman messages: + msg129379
2011-02-25 15:11:25	giampaolo.rodola	set	nosy: + giampaolo.rodola
2011-02-25 15:08:54	pitrou	set	nosy: gregory.p.smith, pitrou, vstinner, neologix, Aaron.Sherman messages: + msg129377
2011-02-25 10:48:37	neologix	set	nosy: + neologix messages: + msg129350
2011-02-25 06:01:49	Aaron.Sherman	set	nosy: gregory.p.smith, pitrou, vstinner, Aaron.Sherman messages: + msg129333
2011-02-24 23:44:44	pitrou	set	nosy: + pitrou messages: + msg129321
2011-02-24 23:43:11	vstinner	set	nosy: + vstinner messages: + msg129320
2011-02-24 23:38:23	pitrou	set	nosy: + gregory.p.smith
2011-02-24 23:22:42	Aaron.Sherman	create