Issue 7123: Multiprocess Process does not always exit when run from a thread.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51372

classification

Title:	Multiprocess Process does not always exit when run from a thread.
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.1, Python 3.2

process

Status:	closed	Resolution:
Dependencies:		Superseder:	Locks in the standard library should be sanitized on fork View: 6721
Assigned To:	jnoller	Nosy List:	asksol, avian, jnoller, neologix, pajs@fodder.org.uk, pitrou
Priority:	normal	Keywords:

Created on 2009-10-14 12:14 by pajs@fodder.org.uk, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
testing.py	pajs@fodder.org.uk, 2009-10-14 12:14
dtrace.txt	pajs@fodder.org.uk, 2009-10-14 14:17
testing.py	pajs@fodder.org.uk, 2009-10-14 15:47
results1.txt	pajs@fodder.org.uk, 2009-10-15 10:00
testing.py	avian, 2011-06-25 14:17	Cleaned up test case

Messages (14)
msg93978 - (view)	Author: Peter Saunders (pajs@fodder.org.uk)	Date: 2009-10-14 12:14
I have an example code that works fine on Python 2.6.3, but when run in Python 3.1.1 - after a very short period of time, will go wrong. Summary: We have a queue, and when the queue has something in it (a list), we start a thread to deal with that entry. The thread then runs a Process for every entry in the list we popped off the queue. The Process target takes a Pipe, and it sends "True" back down the pipe, and then exits. However, this Process never goes defunct, and thus the Thread that started the Process never reaps it. When doing a truss on the Process, it sits there in: lwp_park(0x00000000, 0) (sleeping...)
msg93981 - (view)	Author: Peter Saunders (pajs@fodder.org.uk)	Date: 2009-10-14 12:51
Further information: it doesn't fail everytime in Python 3.1 - usually 1 in 4, or 1 in 5 times. It never fails with Python 2.6.3 Example output from the script when its failing (python 3.1): Starting data1 Starting data2 Started subproc: PID: 20209 : args: data1 poll Started subproc: PID: 20210 : args: data2 poll Child has sent:True poll Child has sent:True poll isalive: False isalive: True Reaping PID: 20210 isalive: True poll Finishing: data2 isalive: True poll isalive: True poll isalive: True poll isalive: True poll <repeat isalive and poll lines until>: isalive: True poll isalive: True terminating Child had to be killed due to a timeout Finishing: data1 True Working output (consistantly works on Python 2.6.3): Starting data1 Starting data2 Started subproc: PID: 20252 : args: data1 poll Started subproc: PID: 20253 : args: data2 poll Child has sent:True Child has sent:True poll Child has sent:True poll isalive: False isalive: False isalive: False Reaping PID: 20252 Reaping PID: 20253 Finishing: data1 Finishing: data2 True
msg93987 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-10-14 14:03
Acccording to http://blogs.sun.com/chrisg/entry/lwp_park_and_lwp_unpark, the lwp_park() call could point to a mutex which is waiting to be acquired.
msg93988 - (view)	Author: Peter Saunders (pajs@fodder.org.uk)	Date: 2009-10-14 14:17
Well, if it helps, here is the output of the dtrace script from starting of a loop with the failure, and stopping during the failure.
msg93990 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-10-14 14:19
Does it not happen if you call your checkAlive() function directly from your main() function?
msg93995 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-10-14 15:12
As for the dtrace output: I'm not a Solaris expert unfortunately, I was just trying to suggest a possible direction for diagnosing this problem.
msg93996 - (view)	Author: Peter Saunders (pajs@fodder.org.uk)	Date: 2009-10-14 15:18
If you mean, in main() instead of doing: while True: q.put(["data1", "data2"]) t = Process(target=popJobs, args=(q, )) t.start() t.join() and doing: while True: q.put(["data1", "data2"]) popJobs(q) instead. Then, the bug does indeed occur the same way. It did take more iterations before it occured however. If however, you meant: while True: fail=failureObject() for x in ["data1", "data2"]: checkAlive(fail, x) print(fail.status()) Then the bug never occurs.
msg93997 - (view)	Author: Peter Saunders (pajs@fodder.org.uk)	Date: 2009-10-14 15:21
Same thing occurs, as you would expect, when I do: while True: fail=failureObject() tlist = [] for x in ["data1", "data2"]: t = threading.Thread(target=checkAlive, args = (fail, x), name=x) t.start() tlist.append(t) for x in tlist: t.join()
msg93998 - (view)	Author: Peter Saunders (pajs@fodder.org.uk)	Date: 2009-10-14 15:47
Further oddness: When running the script (i've reduced its size further now, see attached): I get the following output: Reaping PID: 23215 True Started subproc: PID: 23216 : args: data1 Started subproc: PID: 23216 : args: data1 Started subproc: PID: 23217 : args: data2 poll isalive: True isalive: True poll Notice the Started subprod of the SAME PID and the Same args twice, yet this print only occurs once in the code, and I can't see how this should happen?
msg94000 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-10-14 15:54
> Notice the Started subprod of the SAME PID and the Same args twice, yet > this print only occurs once in the code, and I can't see how this should > happen? This is a thread-safety issue in sys.stdout/stderr, it will be fixed in 3.1.2 (see issue6750).
msg94002 - (view)	Author: Peter Saunders (pajs@fodder.org.uk)	Date: 2009-10-14 16:04
Sorry for the spam on the updates :) - but, its the same thread printing this out too. I changed the print line to: curThread = threading.current_thread() print("Started subproc: PID: %d : args: %s Thread ID: %s" %(newJob.pid, str(args), str(curThread.ident))) and got the output: Reaping PID: 23707 True Started subproc: PID: 23709 : args: data1 Thread ID: 12 Started subproc: PID: 23709 : args: data1 Thread ID: 12 poll isalive: True Started subproc: PID: 23710 : args: data2 Thread ID: 13 poll isalive: True Again, I can't see how this could pring it out twice? Could this be related to the problem we are having? Altgough I do seem to get the line twice on 2.6.3 as well - but nothing like as often..
msg94081 - (view)	Author: Peter Saunders (pajs@fodder.org.uk)	Date: 2009-10-15 10:00
Tested on Linux, and repeated the issue - just to confirm this isn't a Solaris specific issue. Tested using dummy_threading - bug does not appear when that is used. Added _verbose=True to threads, and log_to_stderr(SUBDEBUG) set for Process. Attached is the output of the script with those set. Other odd things to notice, and some point, more than 2 children sometimes exist.. isalive PID: 7646 : True isalive PID: 7652 : True isalive PID: 7653 : True isalive PID: 7646 : True isalive PID: 7652 : True isalive PID: 7653 : True isalive PID: 7646 : True isalive PID: 7653 : True The code starts 2 threads, which start 2 processes. Both threads are joined before new ones are started.. So, in theroy, only 2 should be able to exist at once.. ?
msg139068 - (view)	Author: Tomaž Šolc (avian)	Date: 2011-06-25 14:17
I am pretty sure this is another instance of issue 6721. multiprocessing is forking a new process and if that happens while the other thread is holding the lock for stdout, the process will deadlock because the lock state is replicated in the subprocess while the thread that is supposed unlock it isn't (even if you remove all prints, the deadlock will happen on a stdout flush). Attaching a gdb to a hung subprocess confirms that it is waiting for a lock in the bufferedio module. I've reproduced this on Python trunk running on Linux (using pthreads) and can confirm that the patch attached to issue 6721 fixes it. I'm also attaching a shorter test case that omits unused code from the original example.
msg139679 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-07-03 09:41
> I am pretty sure this is another instance of issue 6721. Indeed. Closing as duplicate.

History
Date	User	Action	Args
2022-04-11 14:56:53	admin	set	github: 51372
2011-07-03 09:41:58	neologix	set	status: open -> closed superseder: Locks in the standard library should be sanitized on fork messages: + msg139679
2011-06-30 18:37:01	pitrou	set	nosy: + neologix
2011-06-25 14:17:17	avian	set	files: + testing.py nosy: + avian messages: + msg139068
2010-08-27 14:11:51	asksol	set	nosy: + asksol
2009-10-15 10:00:18	pajs@fodder.org.uk	set	files: + results1.txt messages: + msg94081
2009-10-14 16:04:27	pajs@fodder.org.uk	set	messages: + msg94002
2009-10-14 15:54:44	pitrou	set	messages: + msg94000
2009-10-14 15:47:23	pajs@fodder.org.uk	set	files: + testing.py messages: + msg93998
2009-10-14 15:21:23	pajs@fodder.org.uk	set	messages: + msg93997
2009-10-14 15:18:26	pajs@fodder.org.uk	set	messages: + msg93996
2009-10-14 15:12:41	pitrou	set	messages: + msg93995
2009-10-14 14:19:56	pitrou	set	messages: + msg93990
2009-10-14 14:17:17	pajs@fodder.org.uk	set	files: + dtrace.txt messages: + msg93988
2009-10-14 14:03:42	pitrou	set	nosy: + pitrou messages: + msg93987
2009-10-14 13:50:10	pitrou	set	nosy: + jnoller versions: + Python 3.2 priority: normal assignee: jnoller components: + Library (Lib) type: behavior
2009-10-14 12:51:19	pajs@fodder.org.uk	set	messages: + msg93981
2009-10-14 12:14:51	pajs@fodder.org.uk	create