Issue 1731717: race condition in subprocess module

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45043

classification

Title:	race condition in subprocess module
Type:	behavior	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 3.1, Python 3.2, Python 2.7, Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	gregory.p.smith	Nosy List:	Krishna Oza, abo, astrand, benjamin.peterson, christian.heimes, collinwinter, djc, dsagal, exarkun, flox, giampaolo.rodola, gjb1002, gregory.p.smith, grossetti, jlamanna, jonash, jyasskin, kanaka, matejcik, nnorwitz, oefe, pitrou, r.david.murray, santoni, shaphiro, siemer, tom_culliton, vitaly, vstinner, yonas
Priority:	normal	Keywords:

Created on 2007-06-05 22:19 by dsagal, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
mylib.c	yonas, 2009-05-27 08:09
main.c	yonas, 2009-05-27 08:10
exim_local_scan2.py	yonas, 2009-05-31 23:28

Messages (59)
msg32210 - (view)	Author: dsagal (dsagal)	Date: 2007-06-05 22:19
Python's subprocess module has a race condition: Popen() constructor has a call to global "_cleanup()" function on whenever a Popen object gets created, and that call causes a check for all pending Popen objects whether their subprocess has exited - i.e. the poll() method is called for every active Popen object. Now, if I create Popen object "foo" in one thread, and then a.wait(), and meanwhile I create another Popen object "bar" in another thread, then a.poll() might get called by _clean() right at the time when my first thread is running a.wait(). But those are not synchronized - each calls os.waitpid() if returncode is None, but the section which checks returncode and calls os.waitpid() is not protected as a critical section should be. The following code illustrates the problem, and is known to break reasonably consistenly with Python2.4. Changes to subprocess in Python2.5 seems to address a somewhat related problem, but not this one. import sys, os, threading, subprocess, time class X(threading.Thread): def __init__(self, args, kwargs): super(X, self).__init__(args, *kwargs) self.start() def tt(): s = subprocess.Popen(("/bin/ls", "-a", "/tmp"), stdout=subprocess.PIPE, universal_newlines=True) # time.sleep(1) s.communicate()[0] for i in xrange(1000): X(target = tt) This typically gives a few dozen errors like these: Exception in thread Thread-795: Traceback (most recent call last): File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap self.run() File "/usr/lib/python2.4/threading.py", line 422, in run self.__target(self.__args, **self.__kwargs) File "z.py", line 21, in tt s.communicate()[0] File "/usr/lib/python2.4/subprocess.py", line 1083, in communicate self.wait() File "/usr/lib/python2.4/subprocess.py", line 1007, in wait pid, sts = os.waitpid(self.pid, 0) OSError: [Errno 10] No child processes Note that uncommenting time.sleep(1) fixes the problem. So does wrapping subprocess.poll() and wait() with a lock. So does removing the call to global _cleanup() in Popen constructor.
msg32211 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2007-06-07 05:26
Peter, could you take a look at this? Thanks.
msg32212 - (view)	Author: Donovan Baarda (abo) *	Date: 2007-08-01 17:05
It appears that subprocess calls a module global "_cleanup()" whenever opening a new subprocess. This method is meant to reap any child processes that have terminated and have not explicitly cleaned up. These are processes you would expect to be cleaned up by GC, however, subprocess keeps a list of of all spawned subprocesses in _active until they are reaped explicitly so it can cleanup any that nolonger referenced anywhere else. The problem is lots of methods, including poll() and wait(), check self.returncode and then modify it. Any non-atomic read/modify action is inherently non-threadsafe. And _cleanup() calls poll() on all un-reaped child processes. If two threads happen to try and spawn subprocesses at once, these _cleanup() calls collide.. The way to fix this depends on how thread-safe you want to make it. If you want to share popen objects between threads to wait()/poll() with impunity from any thread, you should add a recursive lock attribute to the Popen instance and have it lock/release it at the start/end of every method call. If you only care about using popen objects in one thread at a time, then all you need to fix is the nasty "every popen created calls poll() on every other living popen object regardless of what thread started them, and poll() is not threadsafe" behaviour. Removing _cleanup() is one way, but it will then not reap child processes that you del'ed all references to (except the one in subprocess._active) before you checked they were done. Probably another good idea is to not append and remove popen objects to _active directly, instead and add a popen.__del__() method that defers GC'ing of non-finished popen objects by adding them to _active. This way, _active only contains un-reaped child processes that were due to be GC'ed. _cleanup() will then be responsible for polling and removing these popen objects from _active when they are done. However, this alone will not fix things because you are still calling _cleanup() from different threads, it is still calling poll() on all these un-reaped processes, and poll() is not threadsafe. So... you either have to make poll() threadsafe (lock/unlock at the beginning/end of the method), or make _cleanup() threadsafe. The reason you can get away with making only _cleanup() threadsafe this way is _active will contain a list of processes that are not referenced anywhere else, so you know the only thing that will call poll() on them is the _cleanup() method.
msg32213 - (view)	Author: Donovan Baarda (abo) *	Date: 2007-08-01 17:37
Having just gone through that waffly description of the problems and various levels of fix, I think there are really only two fixes worth considering; 1) Make Popen instances fully threadsafe. Give them a recursive lock attribute and have every method acquire the lock at the start, and release it at the end. 2) Decide the "try to reap abandoned children at each Popen" idea was an ugly hack and abandon it. Remove _active and _cleanup(), and document that any child process not explicitly handled to completion will result in zombie child processes.
msg32214 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-08-02 00:45
I like #2. I don't see any use for threadsafe Popen instances, and I think that any self-respecting long-running server using Popen should properly manage its subprocesses anyway. (And for short-running processes it doesn't really matter if you have a few zombies.) One could add a __del__ method that registers zombies to be reaped later, but I don't think it's worth it, and __del__ has some serious issues of its own. (If you really want to do this, use a weak reference callback instead of __del__ to do the zombie registration.)
msg32215 - (view)	Author: Peter Åstrand (astrand) *	Date: 2007-08-02 09:15
Suddenly starting to leave Zombies is NOT an option for me. To prevent zombies, all applications using the subprocess module would need to be changed. This is a major breakage of backwards compatibility, IMHO. subprocess (as well as the popen2 module) has prevented zombies automatically from the beginning and I believe they should continue to do so (at least by default). A little bit of history: When I wrote the subprocess module, I stole the idea of calling _cleanup whenever a new instance is created from the popen2 module, since I thought it was nice, lightweight and avoided having a __del__ method (which I have some bad experience with, GC-related). This worked great for a long time. At 2006-04-10, however, martin.v.loewis committed patch 1467770 (revision r45234), to solve bug 1460493. It introduces a __del__ method and some more complexity. I must admit that I didn't review this patch in detail, but it looked like thread safeness was being taken care of. But apparently it isn't. Perhaps reverting to the original popen2 approach would solve the problem? Or does the popen2 module suffer from this bug as well? I think that we need to fix this once and for all, this time. We should probably also consider adding the option of not having subprocess auto-wait while we are at it, for those what would like to wait() themselves (there should be a tracker item for this, but I can't find it right now). Are we tight on time for fixing this bug? I'm out in the forest right now...
msg32216 - (view)	Author: Donovan Baarda (abo) *	Date: 2007-08-03 19:14
I just looked at popen2 and it has the same bug. I don't think __del__() related changes have anything to do with this. Popen.poll() and _cleanup() are non-threadsafe. When __init__() is called to create subprocesses simultaniously in two different threads, they both call _cleanup() and trample all over each other. Removing _cleanup() will not leave zombies for popened processes that are correctly handled to completion. Using methods like communicate() and wait() will handle the process to completion and reap the child. Unfortunately, I think a fairly common use-case is people using popen to fire-and-forget subprocesses without using communicate() or wait(). These will not get reaped, and will end up as zombies. I cannot think of an easy way to reap abandoned popen instances that is thread safe without using locks. At times like this I wish that the GIL was exposed as a recursive lock... we could have __cleanup__() acquire/release the GIL and make it atomic... no more race condition :-)
msg32217 - (view)	Author: Donovan Baarda (abo) *	Date: 2007-08-03 19:26
Actually, after thinking about it, there may be a way to make _cleanup() threadsafe without using explicit locks by leveraging off the GIL. If _active.pop() and _active.append() are atomic because of the GIL, then _cleanup() can be made threadsafe. To be truely threadsafe it also needs to make sure that _active contains only abandoned popen instances so that _cleanup() doesn't have it's inst.poll() calls collide with other threads that are still using those popen instances. This can be done by havin the popen instances added to _active only by Popen.__del__(), and only removed by _cleanup() when they are done.
msg32218 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-08-03 20:11
If you want this fixed in 2.5.x, a new release is just around the corner, and time is very tight. Otherwise, 2.6a1 is half a year off.
msg32219 - (view)	Author: Donovan Baarda (abo) *	Date: 2007-08-03 21:29
I just looked at subprocess in svs trunk and it looks like it might already be fixed in both subprocess.py and popen2.py. The part where __del__() adds abandoned Popen instances to _active and _cleanup() removes them is already there. _cleanup() has been made more thread resistant by catching and ignoring exceptions that arise when two _cleanups() try to remove the same instances. Popen.poll() has been made more robust by having it set self.returncode to an optional _deadstate argument value when poll gets an os.error from os.pidwait() on a child that gets cleaned up by another thread. I think there are still a few minor race conditions around Popen.poll(), but it will only affect what non-zero returncode you get... it's not so much "thread safe" as "thread robust". I think the _deadstate argument approach to try and make Popen.poll() thread-robust is a bit hacky. I would rather see _cleanup() be properly threadsafe by having it remove the inst from _active before processing it and re-adding it back if it is still not finished. However, as it is now it looks like it is fixed...
msg32220 - (view)	Author: Donovan Baarda (abo) *	Date: 2007-08-06 19:45
I can create a patch to make current head a bit cleaner, if anyone is interested...
msg32221 - (view)	Author: Peter Åstrand (astrand) *	Date: 2007-08-06 20:02
>I can create a patch to make current head a bit cleaner, if anyone is >interested... Sure, this would be great. I would also love a test case that reproduces this problem.
msg32222 - (view)	Author: Geoffrey Bache (gjb1002)	Date: 2007-08-22 17:53
There should be test cases aplenty - this bug was first reported in March 2006 and this is one of four incarnations of it. See also 1753891, 1754642, 1183780 I think all of these contain testcases (though two roughly coincide) and any fix would be wise to try them all out...
msg56245 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-10-05 18:31
See http://bugs.python.org/issue1236 for a good repeatable testcase.
msg57716 - (view)	Author: Tom Culliton (tom_culliton)	Date: 2007-11-20 22:40
This or some variant also shows up with scons (http://scons.tigris.org/issues/show_bug.cgi?id=1839) leading to some nasty intermittent build failures. Neal may remember how we addressed this for a Process class in a past life. Basically it's OK to collect all the child exit codes if you record the results and return them when requested. This would mean that waitpid and the like would have to check a cached list of PIDs and exit statuses before actually waiting. Don't know if that's possible/any help in this case or not... Traceback (most recent call last): File "/usr/lib/scons-0.97.0d20070918/SCons/Taskmaster.py", line 194, in execute self.targets[0].build() File "/usr/lib/scons-0.97.0d20070918/SCons/Node/__init__.py", line 365, in build stat = apply(executor, (self,), kw) File "/usr/lib/scons-0.97.0d20070918/SCons/Executor.py", line 139, in __call__ return self.do_execute(target, kw) File "/usr/lib/scons-0.97.0d20070918/SCons/Executor.py", line 129, in do_execute status = apply(act, (self.targets, self.sources, env), kw) File "/usr/lib/scons-0.97.0d20070918/SCons/Action.py", line 332, in __call__ stat = self.execute(target, source, env) File "/usr/lib/scons-0.97.0d20070918/SCons/Action.py", line 479, in execute result = spawn(shell, escape, cmd_line[0], cmd_line, ENV) File "/usr/lib/scons-0.97.0d20070918/SCons/Platform/posix.py", line 104, in spawnvpe_spawn return exec_spawnvpe([sh, '-c', string.join(args)], env) File "/usr/lib/scons-0.97.0d20070918/SCons/Platform/posix.py", line 68, in exec_spawnvpe stat = os.spawnvpe(os.P_WAIT, l[0], l, env) File "/sw/pkgs/python-2.3.5_00/lib/python2.3/os.py", line 553, in spawnvpe return _spawnvef(mode, file, args, env, execvpe) File "/sw/pkgs/python-2.3.5_00/lib/python2.3/os.py", line 504, in _spawnvef wpid, sts = waitpid(pid, 0) OSError: [Errno 10] No child processes scons: building terminated because of errors.
msg57888 - (view)	Author: Tom Culliton (tom_culliton)	Date: 2007-11-27 21:30
Looking at the subprocess.py code it occurred to me that it never checks if the value of self.pid returned by os.fork is -1, this would mean that later it runs waitpid with -1 as the first argument, putting it into promiscuous (wait for any process in the group) mode. This seems like a bad idea...
msg57890 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-11-27 22:24
> Looking at the subprocess.py code it occurred to me that it never > checks if the value of self.pid returned by os.fork is -1 What makes you think os.fork(0 can return -1? It's not C you know...
msg57892 - (view)	Author: Tom Culliton (tom_culliton)	Date: 2007-11-27 22:33
Good question. The documentation I was reading was mute on the subject so I made a "reasonable" guess. Does it throw an exception instead? Guido van Rossum wrote: > Guido van Rossum added the comment: > > >> Looking at the subprocess.py code it occurred to me that it never >> checks if the value of self.pid returned by os.fork is -1 >> > > What makes you think os.fork(0 can return -1? It's not C you know... > > _____________________________________ > Tracker <report@bugs.python.org> > <http://bugs.python.org/issue1731717> > _____________________________________ >
msg57894 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-11-27 22:39
Yes, like all system calls in the os module.
msg64180 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-03-20 16:40
"""Basically it's OK to collect all the child exit codes if you record the results and return them when requested. This would mean that waitpid and the like would have to check a cached list of PIDs and exit statuses before actually waiting.""" note that this would still have problems. PIDs are recycled by the OS so as soon as you've waited on one and the process dies, the OS is free to launch a new process using it. If the new process happens to be another one of ours launched by subprocess that would be a problem. Make the cached map of pids -> exit codes be a map of pids -> [list of exit codes] instead?
msg64181 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-03-20 16:45
I am unable to reproduce this problem in today's trunk (2.6a) on OSX 10.4.
msg64421 - (view)	Author: Tom Culliton (tom_culliton)	Date: 2008-03-24 16:56
I'm not sure what the POSIX standards say on this (and MS-Windows may go it's own contrary way), but for most real systems the PID is a running count (generally 32 bit or larger today) which would have to cycle all the way around to repeat. It's actually done this way on purpose to provide the longest possible time between IDs getting reused. I suspect that having it cycle and reuse an ID isn't an issue in practice, and keeping a list of results leaves you with the problem of figuring out which PID 55367 they're talking about... Not to mention that if you're leaving child process results unchecked for long enough for the PID counter to cycle, there are other problems with the application. ;-) Gregory P. Smith wrote: > Gregory P. Smith <greg@krypto.org> added the comment: > > """Basically it's OK to collect > all the child exit codes if you record the results and return them when > requested. This would mean that waitpid and the like would have to check > a cached list of PIDs and exit statuses before actually waiting.""" > > note that this would still have problems. PIDs are recycled by the OS > so as soon as you've waited on one and the process dies, the OS is free > to launch a new process using it. If the new process happens to be > another one of ours launched by subprocess that would be a problem. > Make the cached map of pids -> exit codes be a map of pids -> [list of > exit codes] instead? > > _____________________________________ > Tracker <report@bugs.python.org> > <http://bugs.python.org/issue1731717> > _____________________________________ >
msg64422 - (view)	Author: Jean-Paul Calderone (exarkun) *	Date: 2008-03-24 17:03
Which system uses a 32 bit PID? Not Linux, or FreeBSD, or OS X - they all use 16 bit PIDs.
msg64424 - (view)	Author: Tom Culliton (tom_culliton)	Date: 2008-03-24 17:24
AIX, HP-UX, Solaris, 64 bit Linux, ... Even in the Linux x86 header files there's a mix of int and short. The last time I had to do the math on how long it would take the PID to cycle was probably on an AIX box and it was a very long time. Jean-Paul Calderone wrote: > Jean-Paul Calderone <exarkun@divmod.com> added the comment: > > Which system uses a 32 bit PID? Not Linux, or FreeBSD, or OS X - they > all use 16 bit PIDs. > > ---------- > nosy: +exarkun > > _____________________________________ > Tracker <report@bugs.python.org> > <http://bugs.python.org/issue1731717> > _____________________________________ >
msg64426 - (view)	Author: Jean-Paul Calderone (exarkun) *	Date: 2008-03-24 17:35
PIDs cycle quite frequently. You must be thinking of something else. Even 64 bit Linux has a maximum PID of 2 ** 15 - 1.
msg64428 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-03-24 18:36
fwiw, I see pids cycle in a reasonable amount of time on modern production linux systems today. its a fact of life, we should deal with it. :)
msg64660 - (view)	Author: Robert Siemer (siemer)	Date: 2008-03-28 23:51
Bad design stays bad design: the hope that pids don't get reused soon breaks two assumptions: 1) I don't have to wait() for a child process soon. It's the programs business. 2) Pids cycle: there are security patches to make pids of future processes hard to predict by assigning random free pids. I actually was about to report a bug on subprocess when I bumped into this one. My problem is pretty thread-less but related: How do I kill() a process if it's pid got recycled behind my back?? In my opinion the module should work the "Python" way of doing things, otherwise programmers will be surprised, as I was: a) don't do my work (no wait() for things I care of/have reference to) b) do your work (clean up things I don't care of/have no reference to) If that's not possible, how does subprocess actually "intends to replace os.spawn*"? [Library Reference] It is still possible to extend the Popen() call to reflect if the caller wants a P_NOWAITO or P_NOWAIT behavior, but the closer we get to the os, the less sense does it make to replace some other modules...
msg64737 - (view)	Author: Robert Siemer (siemer)	Date: 2008-03-29 23:40
Update for my comment: python2.5.2 does not show that problem. python2.4.5 did. I repeat, 2.5.2 does not clean up processes which are still referenced, but it does clean unreferenced ones.
msg65610 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2008-04-18 11:42
This popped up on the mailing list today. Please give this bug a proper review. I believe it's fixed but I'm not sure.
msg70471 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2008-07-31 02:06
Any more information on this?
msg82663 - (view)	Author: Gabriel (grossetti)	Date: 2009-02-24 10:09
I had this happen to me today, I used Popen and os.waitpid(p.pid, 0) and I got an "OSError: [Errno 10] No child processes". I haven't tried adding a sleep, since it's unreliable. I'm using python 2.5.2, Windows XP. I haven't tried this on linux yet as the problem arose while testing my app for windows.
msg88406 - (view)	Author: Yonas (yonas)	Date: 2009-05-27 08:09
I always get a subprocess error when using embedded python 2.6.2: File "/usr/lib/python2.6/subprocess.py", line 1123, in wait: pid, sts = os.waitpid(self.pid, 0): OSError: [Errno 10] No child processes Example library and main program are attached.
msg88617 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-05-31 22:58
I have confirmed that none of the original test cases referenced here or in the referenced issues fails on python 2.5.4, 2.6.2, trunk, or py3k on linux. The test case attached to this ticket I cannot test as I don't understand how one would run it (and exim appears to be involved). Given the discussion (specifically msg32219) I think this ticket should be closed as fixed; unless, Yonas, you can provide a reproducible test case that does not involve third party software, or someone else can reproduce your exim case. If someone wants to propose a patch to refactor subprocess, IMO that should be opened as a new feature request ticket.
msg88619 - (view)	Author: Yonas (yonas)	Date: 2009-05-31 23:22
To test with exim4 is easy. To reproduce on Ubuntu Jaunty: 1. apt-get install exim4-daemon-heavy 2. echo "local_scan = /usr/lib/exim4/local_scan/libpyexim.so" > /etc/exim4/conf.d/main/15_py-exim_plugin_path 3. cd /usr/lib/exim4/local_scan 4. Compile mylib, output will be "libpyexim.so": gcc `python2.6-config --cflags` -c -fPIC mylib.c -o mylib.o gcc -Xlinker -export-dynamic -lpython2.6 -lnsl -lpthread -ldl -lutil -lm -lz -shared -Wl,-soname,libpyexim.so -o libpyexim.so mylib.o 5. Restart server: /etc/init.d/exim4 restart 6. Send some mail: cat mail.txt \| sendmail -t (contents of mail.txt): Content-type: text/plain To: your_username@localhost From: foobar@example.com Subject: test Hello world.
msg88621 - (view)	Author: Yonas (yonas)	Date: 2009-05-31 23:28
Also, copy exim_local_scan2.py to /usr/lib/python2.6/
msg88623 - (view)	Author: Yonas (yonas)	Date: 2009-05-31 23:45
I'm assuming that exim4 is reading config files from /etc/exim4/conf.d/*. To make sure, you can enforce "split file mode" by running `sudo dpkg-reconfigure exim4-config`. It should be one of the last questions present to you.
msg88625 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-06-01 00:17
I'm afraid I'm not going to be installing exim in order to test this. Perhaps someone else will be interested enough to do so, perhaps not :) Copying files into /usr/lib/pythonx.x is a very odd thing to do, by the way (though it should have no impact on the bug).
msg92772 - (view)	Author: Joel Martin (kanaka)	Date: 2009-09-17 15:37
I can reproduce the problem (or at least get the same symptom) by doing this (in 2.4.6, 2.5.4 and 2.6.2): import subprocess, signal signal.signal(signal.SIGCLD, signal.SIG_IGN) subprocess.Popen(['echo','foo']).wait() The echo command completes, but the subprocess command throws the "no child" exception. It seems like the subprocess command should either: - detect that SIGCLD is not set correctly and throw a more informative exception before running the command. - override SIGCLD during the running of the sub-command. Not sure what the UNIX correctness implications of this are. - or allow the child to zombie without throwing an exception. The fact that the programmer has overridden SIGCLD sort of implies that reaping of zombie children has been switched off. I don't have good enough understanding of the underlying implementation to know if this is a reproducer as requested or if this should be a new bug. Please advise and I will file a new bug if requested. This is a follow-up to trying to resolve this problem in the PEP 3143 python-daemon module. See this thread: http://groups.google.com/group/comp.lang.python/browse_thread/thread/9a853d0308c8e55a
msg100430 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-04 22:56
Some buildbot hit this bug: http://bugs.python.org/issue7805#msg100428
msg100523 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-06 09:42
Sometimes IA64 Ubuntu bot fails on this one. http://www.python.org/dev/buildbot/all/builders/ia64%20Ubuntu%20trunk/builds/571
msg100643 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-08 14:18
Previous buildbot failures were in test_multiprocessing: http://bugs.python.org/issue1731717#msg100430 Now it should be fixed: - r78777, r78787, r78790 on 2.x - r78798 on 3.x
msg100646 - (view)	Author: Yonas (yonas)	Date: 2010-03-08 14:30
Florent, Have you tested any of the sample test programs mentioned in this bug report? For example, the one by Joel Martin (kanaka). I'd suggest to test those first before marking this issue as fixed. - Yonas
msg100647 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-08 14:52
>>> import subprocess, signal >>> signal.signal(signal.SIGCLD, signal.SIG_IGN) 0 >>> subprocess.Popen(['echo','foo']).wait() foo Traceback (most recent call last): File "<stdin>", line 1, in <module> File "./Lib/subprocess.py", line 1229, in wait pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0) File "./Lib/subprocess.py", line 482, in _eintr_retry_call return func(*args) OSError: [Errno 10] No child processes You're right, I fixed something giving the same ECHILD error in multiprocessing. But it was not related.
msg100648 - (view)	Author: Yonas (yonas)	Date: 2010-03-08 15:02
http://groups.google.com/group/comp.lang.python/browse_thread/thread/9a853d0308c8e55a "I'm also glad to see a test case that causes exactly the same error with or without the presence of a ‘daemon.DaemonContext’. Further research shows that handling of ‘SIGCLD’ (or ‘SIGCLD’) is fairly OS-specific, with “ignore it” or “handle it specifically” being correct on different systems. I think Python's default handling of this signal is already good (modulo bug #1731717 to be addressed in ‘subprocess’). So I will apply a change similar to Joel Martin's suggestion, to default to avoid touching the ‘SIGCLD’ signal at all, and with extra notes in the documentation that anyone using child processes needs to be wary of signal handling. This causes the above test case to succeed; the output file contains:: ===== Child process via os.system. Child process via 'subprocess.Popen'. Parent daemon process. Parent daemon process done. =====" -- By Ben Finney
msg100650 - (view)	Author: Yonas (yonas)	Date: 2010-03-08 15:06
Ben Finney's comment suggests to me that this bug is being ignored. Am I wrong? "with extra notes in the documentation that anyone using child processes needs to be wary of signal handling." Why should they be wary? We should just fix this bug.
msg100651 - (view)	Author: Yonas (yonas)	Date: 2010-03-08 15:08
By the way, in three months from today, this bug will be 3 years old.....
msg100657 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2010-03-08 17:12
I really don't care about the age of a bug. This bug is young. I've fixed many bugs over twice its age in the past. Regardless, I've got some serious subprocess internal refactoring changes coming in the very near future to explicitly deal with thread safety issues. I'm adding this one to my list of things to tackle.
msg100659 - (view)	Author: Yonas (yonas)	Date: 2010-03-08 18:03
Gregory, Awesome! Approx. how long until we hear back from you in this report?
msg106519 - (view)	Author: Stefan (santoni)	Date: 2010-05-26 09:58
I have exactly the same problem. Is there a thread-safe alternative to execute subprocesses in threads?
msg115210 - (view)	Author: red (shaphiro)	Date: 2010-08-30 12:12
I'm using an old Plone/Zope Product, PHParser, that uses the popen2 call ... same problem for me. Is there a thread-safe alternative to execute subprocesses in threads? I need a patch!!! thanks in advance!!!
msg117985 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2010-10-04 21:50
A workaround for those still having problems with this: stub out subprocess._cleanup with a no-op method. It it only useful if your app is ever using subprocess and forgetting to call wait() on Popen objects before they are deleted. If you are, you can keep a reference to the old _cleanup() method and periodically call it on your own. http://bugs.python.org/issue1236 effectively demonstrates this.
msg122556 - (view)	Author: James Lamanna (jlamanna)	Date: 2010-11-27 20:54
stubbing out subprocess._cleanup does not work around the problem from this example on 2.6.5: import subprocess, signal subprocess._cleanup = lambda: None signal.signal(signal.SIGCLD, signal.SIG_IGN) subprocess.Popen(['echo','foo']).wait() james@hyla:~$ python tt.py foo Traceback (most recent call last): File "tt.py", line 5, in <module> subprocess.Popen(['echo','foo']).wait() File "/usr/lib/python2.6/subprocess.py", line 1170, in wait pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0) File "/usr/lib/python2.6/subprocess.py", line 465, in _eintr_retry_call return func(*args) OSError: [Errno 10] No child processes This bug still prevents subprocess from being used inside of a daemon where SIGCLD is being caught to reap zombie processes.
msg123946 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2010-12-14 14:39
r87233 fixes the OSError escaping from wait() issue when SIGCLD is set to be ignored. (to appear in 3.2beta1; it is a candidate for backporting to 3.1 and 2.7)
msg123951 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2010-12-14 15:17
sorry, i meant 3.2beta2 above. release27-maint: r87234 targeting 2.7.2 release31-maint: r87235 targeting 3.1.4
msg123957 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-12-14 15:58
It seems the canonical spelling is SIGCHLD. SIGCLD doesn't exist everywhere and it produces test failures under OS X: http://www.python.org/dev/buildbot/all/builders/AMD64%20Leopard%203.x http://www.python.org/dev/buildbot/all/builders/AMD64%20Snow%20Leopard%203.x FWIW, this is what POSIX says: “Some implementations, including System V, have a signal named SIGCLD, which is similar to SIGCHLD in 4.2 BSD. POSIX.1 permits implementations to have a single signal with both names. POSIX.1 carefully specifies ways in which conforming applications can avoid the semantic differences between the two different implementations. The name SIGCHLD was chosen for POSIX.1 because most current application usages of it can remain unchanged in conforming applications. SIGCLD in System V has more cases of semantics that POSIX.1 does not specify, and thus applications using it are more likely to require changes in addition to the name change.” http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html
msg142215 - (view)	Author: jan matejek (matejcik) *	Date: 2011-08-16 18:21
please check my logic here, but the patched code seems to throw away perfectly valid return codes: in wait(), self._handle_exitstatus(sts) gets called unconditionally, and it resets self.returncode also unconditionally. now, if a _cleanup() already did _internal_poll and set self.returncode that way, it is lost when wait() catches the ECHILD, in the one place where it actually matters, by setting sts=0 for the _handle_exitstatus call IMHO it could be fixed by moving _handle_exitstatus to the try: section, and returning "self.returncode or 0" or something like that
msg251268 - (view)	Author: Vitaly (vitaly)	Date: 2015-09-21 22:06
Is this issue fixed in python 2.7.10? I am experiencing strange race conditions in code that combines threads with multiprocessing pool, whereby a thread is spawned by each pool worker. Running on latest Mac OS X.
msg260455 - (view)	Author: Krishna Oza (Krishna Oza) *	Date: 2016-02-18 12:53
Hi, Could anyone help here to identify in which Python release the bug is fixed. I am unable to deduce from the bug tracker interface in which release this issue is fixed. Regards.
msg260457 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-18 13:05
"Hi, Could anyone help here to identify in which Python release the bug is fixed. I am unable to deduce from the bug tracker interface in which release this issue is fixed." Oh, this is an old issue :-) msg123946: "r87233 fixes the OSError escaping from wait() issue when SIGCLD is set to be ignored. (to appear in 3.2beta1; it is a candidate for backporting to 3.1 and 2.7)" msg123951: "sorry, i meant 3.2beta2 above. release27-maint: r87234 targeting 2.7.2 release31-maint: r87235 targeting 3.1.4" => the fix is in Python 2 >= 2.7.2, Python 3.1 >= v3.1.4, Python 3 >= 3.2.4. I hope that you are using a more recent Python version than these versions :-)

History
Date	User	Action	Args
2022-04-11 14:56:24	admin	set	github: 45043
2016-02-18 13:05:37	vstinner	set	nosy: + vstinner messages: + msg260457
2016-02-18 12:53:45	Krishna Oza	set	nosy: + Krishna Oza messages: + msg260455
2015-09-21 22:06:32	vitaly	set	nosy: + vitaly messages: + msg251268
2012-02-14 19:46:49	gregory.p.smith	link	issue9127 dependencies
2011-08-16 18:21:51	matejcik	set	nosy: + matejcik messages: + msg142215
2010-12-14 15:58:58	pitrou	set	nosy: + pitrou messages: + msg123957
2010-12-14 15:17:05	gregory.p.smith	set	status: open -> closed resolution: fixed messages: + msg123951
2010-12-14 14:39:59	gregory.p.smith	set	messages: + msg123946
2010-11-27 20:54:00	jlamanna	set	nosy: + jlamanna messages: + msg122556
2010-10-04 21:50:30	gregory.p.smith	set	messages: + msg117985
2010-09-07 04:06:58	gvanrossum	set	nosy: - gvanrossum
2010-09-06 15:21:26	jonash	set	nosy: + jonash
2010-08-30 12:12:53	shaphiro	set	nosy: + shaphiro messages: + msg115210
2010-07-31 22:42:43	collinwinter	set	nosy: + collinwinter, jyasskin
2010-06-18 00:14:23	giampaolo.rodola	set	nosy: + giampaolo.rodola
2010-05-26 09:58:06	santoni	set	nosy: + santoni messages: + msg106519
2010-03-08 18:03:18	yonas	set	messages: + msg100659
2010-03-08 17:12:51	gregory.p.smith	set	assignee: astrand -> gregory.p.smith messages: + msg100657
2010-03-08 15:08:05	yonas	set	messages: + msg100651
2010-03-08 15:06:26	yonas	set	messages: + msg100650
2010-03-08 15:02:34	yonas	set	messages: + msg100648
2010-03-08 14:53:34	flox	set	keywords: - buildbot
2010-03-08 14:52:37	flox	set	resolution: fixed -> (no value) messages: + msg100647 stage: resolved -> needs patch
2010-03-08 14:30:50	yonas	set	status: pending -> open messages: + msg100646
2010-03-08 14:18:59	flox	set	status: open -> pending resolution: fixed messages: + msg100643 stage: test needed -> resolved
2010-03-06 09:42:15	flox	set	keywords: + buildbot messages: + msg100523
2010-03-04 22:58:08	flox	link	issue6122 superseder
2010-03-04 22:56:26	flox	set	versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.5, Python 3.0 nosy: + flox messages: + msg100430 resolution: fixed -> (no value) stage: test needed
2010-03-04 22:49:01	flox	link	issue1753891 superseder
2009-09-17 15:37:28	kanaka	set	status: pending -> open nosy: + kanaka messages: + msg92772
2009-06-01 00:17:03	r.david.murray	set	status: open -> pending priority: critical -> normal messages: + msg88625
2009-05-31 23:45:59	yonas	set	messages: + msg88623
2009-05-31 23:28:34	yonas	set	files: + exim_local_scan2.py messages: + msg88621
2009-05-31 23:22:56	yonas	set	status: pending -> open messages: + msg88619
2009-05-31 22:58:03	r.david.murray	set	status: open -> pending nosy: + r.david.murray messages: + msg88617 resolution: fixed
2009-05-27 08:10:28	yonas	set	files: + main.c
2009-05-27 08:09:54	yonas	set	files: + mylib.c nosy: + yonas messages: + msg88406
2009-05-07 07:44:09	djc	set	nosy: + djc
2009-02-24 10:09:47	grossetti	set	nosy: + grossetti messages: + msg82663
2009-01-11 16:45:03	oefe	set	nosy: + oefe
2008-07-31 02:06:37	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg70471
2008-04-18 11:42:11	christian.heimes	set	priority: normal -> critical nosy: + christian.heimes type: behavior messages: + msg65610 versions: + Python 2.6, Python 2.5, Python 3.0, - Python 2.4
2008-03-29 23:40:09	siemer	set	messages: + msg64737
2008-03-28 23:51:29	siemer	set	nosy: + siemer messages: + msg64660
2008-03-24 18:36:36	gregory.p.smith	set	messages: + msg64428
2008-03-24 17:35:19	exarkun	set	messages: + msg64426
2008-03-24 17:24:30	tom_culliton	set	messages: + msg64424
2008-03-24 17:03:37	exarkun	set	nosy: + exarkun messages: + msg64422
2008-03-24 16:56:59	tom_culliton	set	messages: + msg64421
2008-03-20 16:45:23	gregory.p.smith	set	messages: + msg64181
2008-03-20 16:40:10	gregory.p.smith	set	messages: + msg64180
2008-01-16 21:23:40	gregory.p.smith	set	nosy: + gregory.p.smith
2007-11-27 22:39:52	gvanrossum	set	messages: + msg57894
2007-11-27 22:33:43	tom_culliton	set	messages: + msg57892
2007-11-27 22:24:22	gvanrossum	set	messages: + msg57890
2007-11-27 21:30:55	tom_culliton	set	messages: + msg57888
2007-11-20 22:40:04	tom_culliton	set	nosy: + tom_culliton messages: + msg57716
2007-10-05 18:32:05	gvanrossum	link	issue1236 superseder
2007-10-05 18:31:57	gvanrossum	set	messages: + msg56245
2007-06-05 22:19:54	dsagal	create