Issue 25906: Worker stall in multiprocessing.Pool

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70094

classification

Title:	Worker stall in multiprocessing.Pool
Type:	behavior	Stage:
Components:		Versions:	Python 3.3, Python 3.4, Python 3.5, Python 2.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	chroxvi, davin, sbt
Priority:	normal	Keywords:

Created on 2015-12-18 14:40 by chroxvi, last changed 2022-04-11 14:58 by admin.

Messages (1)
msg256684 - (view)	Author: Christian Schou Oxvig (chroxvi)	Date: 2015-12-18 14:40
I am experiencing some seemingly random stalls in my scientific simulations that make use of a multiprocessing.Pool for parallelization. It has been incredibly difficult for me to come up with an example that consistently reproduces the problem. It seems more or less random if and when the problem occurs. The below snippet is my best shot at something that has a good chance at hitting the problem. I know it is unfortunate to have PyTables in the mix but it is the only example I have been able to come up with that almost always hit the problem. I have been able to reproduce the problem (once!) by simply removing the with-statement (and thus PyTables) in the work function. However, by doing so (at least in my runs), the chance of hitting the problem almost vanishes. Also, judging from the output of the script, it seems that the cause of the problem is to be found in Python and not in PyTables. import os import multiprocessing as mp import tables _hdf_db_name = 'join_crash_test.hdf' _lock = mp.Lock() class File(): def __init__(self, args, kwargs): self._args = args self._kwargs = kwargs if len(args) > 0: self._filename = args[0] else: self._filename = kwargs['filename'] def __enter__(self): _lock.acquire() self._file = tables.open_file(self._args, *self._kwargs) return self._file def __exit__(self, type, value, traceback): self._file.close() _lock.release() def work(task): worker_num, iteration = task with File(_hdf_db_name, mode='a') as h5_file: h5_file.create_array('/', 'a{}_{}'.format(worker_num, iteration), obj=task) print('Worker {} finished writing to HDF table at iteration {}'.format( worker_num, iteration)) return (worker_num, iteration) iterations = 10 num_workers = 24 maxtasks = 1 if os.path.exists(_hdf_db_name): os.remove(_hdf_db_name) for iteration in range(iterations): print('Now processing iteration: {}'.format(iteration)) tasks = zip(range(num_workers), num_workers [iteration]) try: print('Spawning worker pool') workers = mp.Pool(num_workers, maxtasksperchild=maxtasks) print('Mapping tasks') results = workers.map(work, tasks, chunksize=1) finally: print('Cleaning up') workers.close() print('Workers closed - joining') workers.join() print('Process terminated') In most of my test runs, this example stalls at "Workers closed - joining" in one of the iterations. Hitting C-c and inspecting the stack shows that the main process is waiting for a single worker that never stops executing. I have tested the example on various combinations of the below listed operating systems and Python version. Ubuntu 14.04.1 LTS Ubuntu 14.04.3 LTS ArchLinux (updated as of December 14, 2015) Python 2.7.10 :: Anaconda 2.2.0 (64-bit) Python 2.7.11 :: Anaconda 2.4.0 (64-bit) Python 2.7.11 (Arch Linux 64-bit build) Python 3.3.5 :: Anaconda 2.1.0 (64-bit) Python 3.4.3 :: Anaconda 2.3.0 (64-bit) Python 3.5.0 :: Anaconda 2.4.0 (64-bit) Python 3.5.1 (Arch Linux 64-bit build) Python 3.5.1 :: Anaconda 2.4.0 (64-bit) It seems that some combinations are more likely to reproduce the problem than others. In particular, all the Python 3 builds reproduce the problem on almost every run, whereas I have not been able to reproduce the problem with the above example on any version of Python 2. I have, however, seen what appears to be the same problem in one of my simulations using Python 2.7.11. After 5 hours it stalled very close to the point of closing a Pool. Inspecting the HDF database holding the results showed that all but a single of the 4000 tasks submitted to the Pool finished. To me, this suggests that a single worker never finished executing. The problem I am describing here might very well be related to issue9205 as well as issue22393. However, I am not sure how to verify if this is indeed the case or not.

msg256684 - (view)

Author: Christian Schou Oxvig (chroxvi)

Date: 2015-12-18 14:40

I am experiencing some seemingly random stalls in my scientific simulations that make use of a multiprocessing.Pool for parallelization. It has been incredibly difficult for me to come up with an example that consistently reproduces the problem. It seems more or less random if and when the problem occurs. The below snippet is my best shot at something that has a good chance at hitting the problem. I know it is unfortunate to have PyTables in the mix but it is the only example I have been able to come up with that almost always hit the problem. I have been able to reproduce the problem (once!) by simply removing the with-statement (and thus PyTables) in the work function. However, by doing so (at least in my runs), the chance of hitting the problem almost vanishes. Also, judging from the output of the script, it seems that the cause of the problem is to be found in Python and not in PyTables.


import os
import multiprocessing as mp
import tables

_hdf_db_name = 'join_crash_test.hdf'
_lock = mp.Lock()


class File():

    def __init__(self, *args, **kwargs):
        self._args = args
        self._kwargs = kwargs

        if len(args) > 0:
            self._filename = args[0]
        else:
            self._filename = kwargs['filename']

    def __enter__(self):
        _lock.acquire()
        self._file = tables.open_file(*self._args, **self._kwargs)
        return self._file

    def __exit__(self, type, value, traceback):
        self._file.close()
        _lock.release()


def work(task):
    worker_num, iteration = task

    with File(_hdf_db_name, mode='a') as h5_file:
        h5_file.create_array('/', 'a{}_{}'.format(worker_num, iteration),
                             obj=task)
    print('Worker {} finished writing to HDF table at iteration {}'.format(
        worker_num, iteration))

    return (worker_num, iteration)

iterations = 10
num_workers = 24
maxtasks = 1

if os.path.exists(_hdf_db_name):
    os.remove(_hdf_db_name)

for iteration in range(iterations):
    print('Now processing iteration: {}'.format(iteration))
    tasks = zip(range(num_workers), num_workers * [iteration])
    try:
        print('Spawning worker pool')
        workers = mp.Pool(num_workers, maxtasksperchild=maxtasks)
        print('Mapping tasks')
        results = workers.map(work, tasks, chunksize=1)
    finally:
        print('Cleaning up')
        workers.close()
        print('Workers closed - joining')
        workers.join()
        print('Process terminated')


In most of my test runs, this example stalls at "Workers closed - joining" in one of the iterations. Hitting C-c and inspecting the stack shows that the main process is waiting for a single worker that never stops executing. I have tested the example on various combinations of the below listed operating systems and Python version.

Ubuntu 14.04.1 LTS
Ubuntu 14.04.3 LTS
ArchLinux (updated as of December 14, 2015)

Python 2.7.10 :: Anaconda 2.2.0 (64-bit)
Python 2.7.11 :: Anaconda 2.4.0 (64-bit)
Python 2.7.11 (Arch Linux 64-bit build)
Python 3.3.5 :: Anaconda 2.1.0 (64-bit)
Python 3.4.3 :: Anaconda 2.3.0 (64-bit)
Python 3.5.0 :: Anaconda 2.4.0 (64-bit)
Python 3.5.1 (Arch Linux 64-bit build)
Python 3.5.1 :: Anaconda 2.4.0 (64-bit)

It seems that some combinations are more likely to reproduce the problem than others. In particular, all the Python 3 builds reproduce the problem on almost every run, whereas I have not been able to reproduce the problem with the above example on any version of Python 2. I have, however, seen what appears to be the same problem in one of my simulations using Python 2.7.11. After 5 hours it stalled very close to the point of closing a Pool. Inspecting the HDF database holding the results showed that all but a single of the 4000 tasks submitted to the Pool finished. To me, this suggests that a single worker never finished executing.

The problem I am describing here might very well be related to issue9205 as well as issue22393. However, I am not sure how to verify if this is indeed the case or not.

History
Date	User	Action	Args
2022-04-11 14:58:25	admin	set	github: 70094
2015-12-18 16:38:10	r.david.murray	set	nosy: + sbt, davin
2015-12-18 14:40:44	chroxvi	create