Message 334939 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	pitrou
Recipients	christian.heimes, pablogsal, pitrou, rhettinger, vtsozik
Date	2019-02-06.14:40:28
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1549464029.37.0.389973337973.issue35902@roundup.psfhosted.org>
In-reply-to

Content
It is actually quite an intricate problem. What happens is that child process main thread ends, but not its background sleeping thread (the `lambda: time.sleep(3600)`). To diagnose it, you can display the process tree: ``` $ ps fu USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND antoine 12634 0.0 0.0 28308 9208 pts/0 Ss 15:21 0:00 bash antoine 2520 0.0 0.0 179072 10684 pts/0 Sl+ 15:29 0:00 \_ ./python threadforkmodel.py antoine 2522 0.0 0.0 0 0 pts/0 Zl+ 15:29 0:00 \_ [python] <defunct> ``` Then you can display all threads for the child process (here with pid 2522): ``` $ ps -T -p 2522 PID SPID TTY TIME CMD 2522 2522 pts/0 00:00:00 python <defunct> 2522 2525 pts/0 00:00:00 python ``` The main thread is marked zombie ("defunct") but thread 2525 is still running... What is it doing? Let's attach gdb: ``` $ gdb ./python --pid 2525 ``` And display the call stack: ``` (gdb) bt #0 0x00007f1fb3ca503f in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7f1fb23553c0) at ../sysdeps/unix/sysv/linux/select.c:41 #1 0x000055e6fc4fcf7e in pysleep (secs=<optimized out>) at ./Modules/timemodule.c:1864 #2 0x000055e6fc4fd022 in time_sleep (self=self@entry=<module at remote 0x7f1fb4a03398>, obj=<optimized out>) at ./Modules/timemodule.c:366 #3 0x000055e6fc3a02e7 in _PyMethodDef_RawFastCallKeywords (method=0x55e6fc887ee0 <time_methods+288>, self=<module at remote 0x7f1fb4a03398>, args=args@entry=0x7f1fb336a8f8, nargs=nargs@entry=1, kwnames=0x0) at Objects/call.c:646 #4 0x000055e6fc3a04c7 in _PyCFunction_FastCallKeywords ( func=func@entry=<built-in method sleep of module object at remote 0x7f1fb4a03398>, args=args@entry=0x7f1fb336a8f8, nargs=nargs@entry=1, kwnames=kwnames@entry=0x0) at Objects/call.c:732 #5 0x000055e6fc4506e9 in call_function (pp_stack=pp_stack@entry=0x7f1fb2355570, oparg=oparg@entry=1, kwnames=kwnames@entry=0x0) at Python/ceval.c:4607 #6 0x000055e6fc45c678 in _PyEval_EvalFrameDefault (f=Frame 0x7f1fb336a770, for file threadforkmodel.py, line 36, in <lambda> (), throwflag=<optimized out>) at Python/ceval.c:3195 #7 0x000055e6fc451110 in PyEval_EvalFrameEx (f=f@entry=Frame 0x7f1fb336a770, for file threadforkmodel.py, line 36, in <lambda> (), throwflag=throwflag@entry=0) at Python/ceval.c:581 #8 0x000055e6fc451d21 in _PyEval_EvalCodeWithName (_co=_co@entry=<code at remote 0x7f1fb4989700>, globals=globals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='threadforkmodel.py') at remote 0x7f1fb49d4710>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7f1fb4adf8c0>, '__file__': 'threadforkmodel.py', '__cached__': None, 'threading': <module at remote 0x7f1fb36ca668>, 'time': <module at remote 0x7f1fb4a03398>, 'os': <module at remote 0x7f1fb49e5050>, 'atexit': <module at remote 0x7f1fb36d3aa0>, 'signal': <module at remote 0x7f1fb36cc500>, 'run': <function at remote 0x7f1fb4a93e10>, 'start': <function at remote 0x7f1fb33699f0>, 'join': <function at remote 0x7f1fb3369aa0>, 'runFork': <function at remote 0x7f1fb3369b50>, 'handleExit': <function at remote 0x7f1fb3369c00>, 'handleChildExit': <function at remote 0x7f1fb3369cb0>, 'main': <function at remote 0x7f1fb3369d60>}, locals=locals@entry=0x0, args=args@entry=0x7f1fb4aec078, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0, kwargs=0x0, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name='<lambda>', qualname='runFork.<locals>.<lambda>') at Python/ceval.c:3969 [...] ``` So basically the sleep() call wasn't woken up by the main thread's death... even though we might have expected it to. This is indeed a case of weird interaction between threads and processes. The only reference I could find is a single comment in a StackOverflow question: """ Be aware that infinite waits on semaphores, handles etc can cause your process to become a zombie in both Windows and Linux. """ The reason I'm posting this detailed explanation is that I hit the exact same issue when trying to debug the PEP 556 implementation, and it took me quite some time (and Pablo's help) to finally understand and workaround the issue. In the end, I would recommend you don't use fork() but use multiprocessing with the "forkserver" start method, which will eliminate such problems: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

It is actually quite an intricate problem.  What happens is that child process *main thread* ends, but not its background sleeping thread (the `lambda: time.sleep(3600)`).

To diagnose it, you can display the process tree:
```
$ ps fu
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
antoine  12634  0.0  0.0  28308  9208 pts/0    Ss   15:21   0:00 bash
antoine   2520  0.0  0.0 179072 10684 pts/0    Sl+  15:29   0:00  \_ ./python threadforkmodel.py
antoine   2522  0.0  0.0      0     0 pts/0    Zl+  15:29   0:00      \_ [python] <defunct>
```

Then you can display all threads for the child process (here with pid 2522):
```
$ ps -T -p 2522
  PID  SPID TTY          TIME CMD
 2522  2522 pts/0    00:00:00 python <defunct>
 2522  2525 pts/0    00:00:00 python
```

The main thread is marked zombie ("defunct") but thread 2525 is still running... What is it doing?  Let's attach gdb:
```
$ gdb ./python --pid 2525
```

And display the call stack:
```
(gdb) bt
#0  0x00007f1fb3ca503f in __GI___select (nfds=nfds@entry=0, readfds=readfds@entry=0x0, writefds=writefds@entry=0x0, 
    exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7f1fb23553c0) at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x000055e6fc4fcf7e in pysleep (secs=<optimized out>) at ./Modules/timemodule.c:1864
#2  0x000055e6fc4fd022 in time_sleep (self=self@entry=<module at remote 0x7f1fb4a03398>, obj=<optimized out>)
    at ./Modules/timemodule.c:366
#3  0x000055e6fc3a02e7 in _PyMethodDef_RawFastCallKeywords (method=0x55e6fc887ee0 <time_methods+288>, 
    self=<module at remote 0x7f1fb4a03398>, args=args@entry=0x7f1fb336a8f8, nargs=nargs@entry=1, kwnames=0x0) at Objects/call.c:646
#4  0x000055e6fc3a04c7 in _PyCFunction_FastCallKeywords (
    func=func@entry=<built-in method sleep of module object at remote 0x7f1fb4a03398>, args=args@entry=0x7f1fb336a8f8, 
    nargs=nargs@entry=1, kwnames=kwnames@entry=0x0) at Objects/call.c:732
#5  0x000055e6fc4506e9 in call_function (pp_stack=pp_stack@entry=0x7f1fb2355570, oparg=oparg@entry=1, kwnames=kwnames@entry=0x0)
    at Python/ceval.c:4607
#6  0x000055e6fc45c678 in _PyEval_EvalFrameDefault (f=Frame 0x7f1fb336a770, for file threadforkmodel.py, line 36, in <lambda> (), 
    throwflag=<optimized out>) at Python/ceval.c:3195
#7  0x000055e6fc451110 in PyEval_EvalFrameEx (f=f@entry=Frame 0x7f1fb336a770, for file threadforkmodel.py, line 36, in <lambda> (), 
    throwflag=throwflag@entry=0) at Python/ceval.c:581
#8  0x000055e6fc451d21 in _PyEval_EvalCodeWithName (_co=_co@entry=<code at remote 0x7f1fb4989700>, 
    globals=globals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='threadforkmodel.py') at remote 0x7f1fb49d4710>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7f1fb4adf8c0>, '__file__': 'threadforkmodel.py', '__cached__': None, 'threading': <module at remote 0x7f1fb36ca668>, 'time': <module at remote 0x7f1fb4a03398>, 'os': <module at remote 0x7f1fb49e5050>, 'atexit': <module at remote 0x7f1fb36d3aa0>, 'signal': <module at remote 0x7f1fb36cc500>, 'run': <function at remote 0x7f1fb4a93e10>, 'start': <function at remote 0x7f1fb33699f0>, 'join': <function at remote 0x7f1fb3369aa0>, 'runFork': <function at remote 0x7f1fb3369b50>, 'handleExit': <function at remote 0x7f1fb3369c00>, 'handleChildExit': <function at remote 0x7f1fb3369cb0>, 'main': <function at remote 0x7f1fb3369d60>}, locals=locals@entry=0x0, 
    args=args@entry=0x7f1fb4aec078, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0, kwargs=0x0, kwcount=0, kwstep=2, 
    defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name='<lambda>', qualname='runFork.<locals>.<lambda>') at Python/ceval.c:3969

[...]
```

So basically the sleep() call wasn't woken up by the main thread's death... even though we might have expected it to.  This is indeed a case of weird interaction between threads and processes.  The only reference I could find is a single comment in a StackOverflow question:
"""
Be aware that infinite waits on semaphores, handles etc can cause your process to become a zombie in both Windows and Linux.
"""

The reason I'm posting this detailed explanation is that I hit the exact same issue when trying to debug the PEP 556 implementation, and it took me quite some time (and Pablo's help) to finally understand and workaround the issue.


In the end, I would recommend you don't use fork() but use multiprocessing with the "forkserver" start method, which will eliminate such problems:
https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

History
Date	User	Action	Args
2019-02-06 14:40:31	pitrou	set	recipients: + pitrou, rhettinger, christian.heimes, pablogsal, vtsozik
2019-02-06 14:40:29	pitrou	set	messageid: <1549464029.37.0.389973337973.issue35902@roundup.psfhosted.org>
2019-02-06 14:40:29	pitrou	link	issue35902 messages
2019-02-06 14:40:28	pitrou	create