classification
Title: pool worker can't be terminated
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: davin, huangzhesi, pitrou, ppperry, terry.reedy
Priority: normal Keywords:

Created on 2021-01-07 04:00 by huangzhesi, last changed 2021-01-15 21:53 by ppperry.

Messages (3)
msg384560 - (view) Author: Zhesi Huang (huangzhesi) Date: 2021-01-07 04:00
i see a case, the worker proc of pool can't be terminated, i try kill -SIGTERM or -SIGINT, but it still can't terminated the worker proc, then the pool __exit__ will hang forever

```
class NonDaemonProcess(multiprocessing.Process):
    # https://stackoverflow.com/questions/6974695/python-process-pool-non-daemonic

    # make 'daemon' attribute always return False
    def _get_daemon(self):
        return False

    def _set_daemon(self, value):
        pass

    daemon = property(_get_daemon, _set_daemon)


def wrapper_copy_parallel(output_name, local_output_path, obs_path):
    log_thread('upload the content of [%s] outputs' % output_name)
    try:
        if os.path.exists(local_output_path):
            log_thread('%s has %d files to be uploaded' % (output_name, len(file_list)))
            
            xxxxxx

            log_thread('upload the content of [%s] outputs successfully' % output_name)
            log_thread('it can be accessed at obs path [%s]' % obs_path)
        else:
            log_thread('local output path is not found, skip upload the content of [%s] outputs' % output_name)
    except Exception as upload_exception:
        err_thread('upload the content of [%s] outputs failed: %s' % (output_name, str(upload_exception)))
        return 255

    return 0


def upload_to_s3():
    """
    upload the content of local path to s3, handle action [on_completed]
    :return:
    """
    outputs = []
    for local_output_path, (output_name, obs_path, action, _, _) in local_to_target.items():
        if action == ACTION_ON_COMPLETED:
            outputs.append((output_name, local_output_path, obs_path))

    if len(outputs) == 0:
        return 0

    with NonDaemonPool(processes=len(outputs)) as pool:
        results = pool.starmap(wrapper_copy_parallel, outputs)

    for result in results:
        if result != 0:
            return result

    return 0
```

stdout/stderr log

[ma-test Service Log][INFO][2021/01/05 03:07:49,218]: registered signal handler
[ma-test Service Log][INFO][2021/01/05 03:09:40,299]: output-handler finalizing
[ma-test Service Log][INFO][2021-01-05 03:09:40,309][NonDaemonPoolWorker-1]: child process calling self.run()
[ma-test Service Log][INFO][2021-01-05 03:09:40,311][NonDaemonPoolWorker-1]: upload the content of [] outputs
[ma-test Service Log][INFO][2021-01-05 03:09:41,331][Process-1:1]: child process calling self.run()
[ma-test Service Log][INFO][2021-01-05 03:09:41,333][Process-1:2]: child process calling self.run()
[ma-test Service Log][INFO][2021-01-05 03:09:41,338][Process-1:3]: child process calling self.run()
[ma-test Service Log][INFO][2021-01-05 03:09:41,351][Process-1:1]: process shutting down
[ma-test Service Log][INFO][2021-01-05 03:09:41,351][Process-1:1]: process exiting with exitcode 0
[ma-test Service Log][INFO][2021-01-05 03:09:41,386][Process-1:2]: process shutting down
[ma-test Service Log][INFO][2021-01-05 03:09:41,386][Process-1:2]: process exiting with exitcode 0
[ma-test Service Log][INFO][2021-01-05 03:09:41,410][Process-1:3]: process shutting down
[ma-test Service Log][INFO][2021-01-05 03:09:41,410][Process-1:3]: process exiting with exitcode 0
[ma-test Service Log][INFO][2021-01-05 03:09:41,415][NonDaemonPoolWorker-1]: upload the content of [] outputs successfully
[ma-test Service Log][INFO][2021-01-05 03:09:41,415][NonDaemonPoolWorker-1]: it can be accessed at obs path [s3://ma-test-algorancher-intel/model_evaluation/6e5746ff-2839-400a-ba93-df38311415f4/dac957b0-b43b-43e2-ab19-0b45672a7ea0/]

18 (pid) proc python stacktrace

>>>
Interrupting process at following point:
  File "/home/ma-user/runtime-scripts-v2/init-container/outputs-handler.py", line 396, in <module>
    ret_code = upload_to_obs()
  File "/home/ma-user/runtime-scripts-v2/init-container/outputs-handler.py", line 287, in upload_to_obs
    results = pool.starmap(wrapper_copy_parallel, outputs)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/pool.py", line 623, in __exit__
    self.terminate()
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/pool.py", line 548, in terminate
    self._terminate()
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/util.py", line 201, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/pool.py", line 617, in _terminate_pool
    p.join()
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/process.py", line 140, in join
    res = self._popen.wait(timeout)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)

18 (pid) proc gdb bt

#0  0x00007f15c2c5ff7b in waitpid () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055f54a8a5180 in os_waitpid_impl.isra.78 (options=0, pid=28) at /tmp/build/80754af9/python_1588882889832/work/Modules/posixmodule.c:7066
#2  os_waitpid () at /tmp/build/80754af9/python_1588882889832/work/Modules/clinic/posixmodule.c.h:3049
#3  0x000055f54a7f1b19 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:655
#4  0x000055f54a7f1db1 in _PyCFunction_FastCallKeywords (func=0x7f15c1f175f0, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>)
    at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:734

28 (pid) proc python stacktrace

Interrupting process at following point:
  File "/home/ma-user/runtime-scripts-v2/init-container/outputs-handler.py", line 396, in <module>
    ret_code = upload_to_obs()
  File "/home/ma-user/runtime-scripts-v2/init-container/outputs-handler.py", line 286, in upload_to_obs
    with NonDaemonPool(processes=len(outputs)) as pool:
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/pool.py", line 176, in __init__
    self._repopulate_pool()
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/pool.py", line 241, in _repopulate_pool
    w.start()
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 74, in _launch
    code = process_obj._bootstrap()
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/queues.py", line 351, in get
    with self._rlock:
  File "/home/ma-user/miniconda3/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()

28 (pid) proc gdb bt

#0  0x00007f15c2c5e827 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f15c2c5e8d4 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f15c2c5e97a in sem_wait@@GLIBC_2.2.5 () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f15a98058b1 in semlock_acquire () at /usr/local/src/conda/python-3.7.7/Modules/_multiprocessing/semaphore.c:319
#4  0x000055f54a7f1c94 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/call.c:694
#5  0x000055f54a7f8aef in _PyMethodDescr_FastCallKeywords () at /tmp/build/80754af9/python_1588882889832/work/Objects/descrobject.c:288

proc list

ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
ma-user      1     0  0 Jan05 ?        00:00:00 /bin/bash /home/ma-user/sidecar.sh
ma-user     16     1  0 Jan05 ?        00:00:00 tail -n 0 -F /home/ma-user/ma-test/log/-0---ma-platform-sidecar.log
ma-user     18     1  0 Jan05 ?        00:00:00 python /home/ma-user/runtime-scripts-v2/init-container/outputs-handler.py
ma-user     28    18  0 Jan05 ?        00:00:00 python /home/ma-user/runtime-scripts-v2/init-container/outputs-handler.py
ma-user    102     0  0 Jan05 pts/0    00:00:00 bash
ma-user    911     0  0 14:56 pts/1    00:00:00 bash
ma-user    931   911  0 14:57 pts/1    00:00:00 ps -ef
ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ 

28 fd

ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ ls -alh /proc/28/fd
total 0
dr-x------ 2 ma-user ma-group  0 Jan  6 20:23 .
dr-xr-xr-x 9 ma-user ma-group  0 Jan  5 17:27 ..
lr-x------ 1 ma-user ma-group 64 Jan  6 20:23 0 -> /dev/null
l-wx------ 1 ma-user ma-group 64 Jan  6 20:23 1 -> /home/ma-user/ma-test/log/-0---ma-platform-sidecar.log
l-wx------ 1 ma-user ma-group 64 Jan  6 20:23 2 -> /home/ma-user/ma-test/log/-0---ma-platform-sidecar.log
lr-x------ 1 ma-user ma-group 64 Jan  6 20:23 3 -> /dev/urandom
lr-x------ 1 ma-user ma-group 64 Jan  6 20:23 4 -> pipe:[952852668]
lrwx------ 1 ma-user ma-group 64 Jan  6 20:23 5 -> socket:[952857499]
lrwx------ 1 ma-user ma-group 64 Jan  6 20:23 6 -> /home/ma-user/conf/scc_logger.conf
l-wx------ 1 ma-user ma-group 64 Jan  6 20:23 7 -> pipe:[952852671]
lr-x------ 1 ma-user ma-group 64 Jan  6 20:23 8 -> /dev/null
l-wx------ 1 ma-user ma-group 64 Jan  6 20:23 9 -> pipe:[952852675]

18 fd

ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ ls -alh /proc/18/fd
total 0
dr-x------ 2 ma-user ma-group  0 Jan  6 20:26 .
dr-xr-xr-x 9 ma-user ma-group  0 Jan  5 03:07 ..
lr-x------ 1 ma-user ma-group 64 Jan  6 20:26 0 -> /dev/null
l-wx------ 1 ma-user ma-group 64 Jan  6 20:26 1 -> /home/ma-user/ma-test/log/-0---ma-platform-sidecar.log
l-wx------ 1 ma-user ma-group 64 Jan  6 20:26 2 -> /home/ma-user/ma-test/log/-0---ma-platform-sidecar.log
lr-x------ 1 ma-user ma-group 64 Jan  6 20:26 3 -> /dev/urandom
lr-x------ 1 ma-user ma-group 64 Jan  6 20:26 4 -> pipe:[952852668]
l-wx------ 1 ma-user ma-group 64 Jan  6 20:26 5 -> pipe:[952852668]
lr-x------ 1 ma-user ma-group 64 Jan  6 20:26 6 -> pipe:[952852671]
l-wx------ 1 ma-user ma-group 64 Jan  6 20:26 7 -> pipe:[952852671]
lr-x------ 1 ma-user ma-group 64 Jan  6 20:26 8 -> pipe:[952852675]

SIGTERM / SIGINT not work

ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ ps -ef 
UID        PID  PPID  C STIME TTY          TIME CMD
ma-user      1     0  0 Jan05 ?        00:00:00 /bin/bash /home/ma-user/sidecar.sh
ma-user     16     1  0 Jan05 ?        00:00:00 tail -n 0 -F /home/ma-user/ma-test/log/-0---ma-platform-sidecar.log
ma-user     18     1  0 Jan05 ?        00:00:00 python /home/ma-user/runtime-scripts-v2/init-container/outputs-handler.py
ma-user     28    18  0 Jan05 ?        00:00:00 python /home/ma-user/runtime-scripts-v2/init-container/outputs-handler.py
ma-user    102     0  0 Jan05 pts/0    00:00:00 bash
ma-user   1166     0  0 Jan06 pts/1    00:00:00 bash
ma-user   1216  1166  0 10:51 pts/1    00:00:00 ps -ef
ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ kill -SIGTERM 28
ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ 
ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ kill -SIGINT 28
ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ 
ma-user@ma-test-eval-dac957b0-b43b-43e2-ab19-0b45672a7ea0-worker-0:/$ 

finally, pool hang
msg384698 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-01-08 23:42
3.7 has not gotten bug fixes for a couple of years.  This needs to be verified on a current release.
msg385126 - (view) Author: (ppperry) Date: 2021-01-15 21:53
duplicate of issue22393?
History
Date User Action Args
2021-01-15 21:53:53ppperrysetnosy: + ppperry
messages: + msg385126
2021-01-08 23:42:29terry.reedysetnosy: + terry.reedy, pitrou, davin
messages: + msg384698
2021-01-07 04:00:06huangzhesicreate