classification
Title: concurrent.futures.ProcessPoolExecutor freezes depending on complexity
Type: Stage:
Components: macOS Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: DanilZ, bquinlan, kj, ned.deily, pitrou, ronaldoussoren
Priority: normal Keywords:

Created on 2020-11-02 13:33 by DanilZ, last changed 2020-11-19 15:43 by ronaldoussoren.

Files
File name Uploaded Description Edit
concur_fut_freeze.py DanilZ, 2020-11-02 13:33 code to reproduce.
Messages (13)
msg380220 - (view) Author: DanilZ (DanilZ) Date: 2020-11-02 13:33
Note: problem occurs only after performing the RandomizedSearchCV...

When applying a function in a multiprocess using concurrent.futures if the function includes anything else other than print(), it is not executed and the process freezes.

Here is the code to reproduce.

from xgboost import XGBRegressor
from sklearn.model_selection import KFold
import concurrent.futures
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

# STEP 1
# ----------------------------------------------------------------------------
# simulate RandomizedSearchCV

data = make_regression(n_samples=500, n_features=100, n_informative=10, n_targets=1, random_state=5)
X = pd.DataFrame(data[0])
y = pd.Series(data[1])
kf = KFold(n_splits = 3, shuffle = True, random_state = 5)
model = XGBRegressor(n_jobs = -1)
params = {
        'min_child_weight':     [0.1, 1, 5],
        'subsample':            [0.5, 0.7, 1.0],
        'colsample_bytree':     [0.5, 0.7, 1.0],
        'eta':                  [0.005, 0.01, 0.1],
        'n_jobs':               [-1]
        }
random_search = RandomizedSearchCV(
        model,
        param_distributions =   params,
        n_iter =                50,
        n_jobs =                -1,
        refit =                 True, # necessary for random_search.best_estimator_
        cv =                    kf.split(X,y),
        verbose =               1,
        random_state =          5
        )
random_search.fit(X, np.array(y))

# STEP 2.0
# ----------------------------------------------------------------------------
# test if multiprocessing is working in the first place

def just_print():
    print('Just printing')

with concurrent.futures.ProcessPoolExecutor() as executor:
    results_temp = [executor.submit(just_print) for i in range(0,12)]
# ----------------------------------------------------------------------------


# STEP 2.1
# ----------------------------------------------------------------------------
# test on a slightly more complex function

def fit_model():
    # JUST CREATING A DATASET, NOT EVEN FITTING ANY MODEL!!! AND IT FREEZES
    data = make_regression(n_samples=500, n_features=100, n_informative=10, n_targets=1, random_state=5)
    # model = XGBRegressor(n_jobs = -1)
    # model.fit(data[0],data[1])
    print('Fit complete')

with concurrent.futures.ProcessPoolExecutor() as executor:
    results_temp = [executor.submit(fit_model) for i in range(0,12)]
# ----------------------------------------------------------------------------


Attached this code in a .py file.
msg380225 - (view) Author: Ken Jin (kj) * Date: 2020-11-02 14:34
Hello, it would be great if you can you provide more details. Like your Operating System and version, how many logical CPU cores there are on your machine, and lastly the exact Python version with major and minor versions included (eg. Python 3.8.2). Multiprocessing behaves differently depending on those factors.

FWIW I reduced your code down to make it easier to read, and removed all the unused variables:

import concurrent.futures
from sklearn.datasets import make_regression

def just_print():
    print('Just printing')

def fit_model():
    data = make_regression(n_samples=500, n_features=100, n_informative=10, n_targets=1, random_state=5)
    print('Fit complete')

if __name__ == '__main__':
    with concurrent.futures.ProcessPoolExecutor() as executor:
        results_temp = [executor.submit(just_print) for i in range(0,12)]

    with concurrent.futures.ProcessPoolExecutor() as executor:
        results_temp = [executor.submit(fit_model) for i in range(0,12)]

The problem is that I am *unable* to reproduce the bug you are reporting on Windows 10 64-bit, Python 3.7.6. The code runs till completion for both examples. I have a hunch that your problem lies elsewhere in one of the many libraries you imported.

>>> Note: problem occurs only after performing the RandomizedSearchCV...

Like you have noted, I went to skim through RandomizedSearchCV's source code and docs. RandomizedSearchCV is purportedly able to use multiprocessing backend for parallel tasks. By setting `n_jobs=-1` in your params, you're telling it to use all logical CPU cores. I'm unsure of how many additional processes and pools RandomizedSearchCV's spawns after calling it, but this sounds suspicious. concurrent.futures specifically warns that this may exhaust available workers and cause tasks to never complete. See https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor (the docs here are for ThreadPoolExecutor, but they still apply).

A temporary workaround might be to reduce n_jobs OR even better: use scikit-learn's multiprocessing parallel backend that's dedicated for that, and should have the necessary protections in place against such behavior. https://joblib.readthedocs.io/en/latest/parallel.html#joblib.parallel_backend 


TLDR: I don't think this is a Python bug and I'm in favor of this bug being closed as `not a bug`.
msg380228 - (view) Author: DanilZ (DanilZ) Date: 2020-11-02 15:06
Hi Ken, thanks for a quick reply.

Here are the requested specs.
System:
Python 3.7.6
OS X 10.15.7

Packages:
XGBoost 1.2.0
sklearn 0.22.2
pandas 1.0.5
numpy 1.18.1

I can see that you have reduced the code, which now excludes the RandomizedSearchCV part. This (reduced) code runs without any problems on my side as well, but if running it after the RandomizedSearchCV, the last function fit_model() freezes in a multiprocess.

I will read through the docs, but at first it looks as the actual problem is in the concurrent.futures module, because the easy function just_print() runs without issues. So the freeze is triggered by adding minor complexity into the fit_model() function running in a multiprocess.

> On 2 Nov 2020, at 17:34, Ken Jin <report@bugs.python.org> wrote:
> 
> 
> Ken Jin <kenjin4096@gmail.com> added the comment:
> 
> Hello, it would be great if you can you provide more details. Like your Operating System and version, how many logical CPU cores there are on your machine, and lastly the exact Python version with major and minor versions included (eg. Python 3.8.2). Multiprocessing behaves differently depending on those factors.
> 
> FWIW I reduced your code down to make it easier to read, and removed all the unused variables:
> 
> import concurrent.futures
> from sklearn.datasets import make_regression
> 
> def just_print():
>    print('Just printing')
> 
> def fit_model():
>    data = make_regression(n_samples=500, n_features=100, n_informative=10, n_targets=1, random_state=5)
>    print('Fit complete')
> 
> if __name__ == '__main__':
>    with concurrent.futures.ProcessPoolExecutor() as executor:
>        results_temp = [executor.submit(just_print) for i in range(0,12)]
> 
>    with concurrent.futures.ProcessPoolExecutor() as executor:
>        results_temp = [executor.submit(fit_model) for i in range(0,12)]
> 
> The problem is that I am *unable* to reproduce the bug you are reporting on Windows 10 64-bit, Python 3.7.6. The code runs till completion for both examples. I have a hunch that your problem lies elsewhere in one of the many libraries you imported.
> 
>>>> Note: problem occurs only after performing the RandomizedSearchCV...
> 
> Like you have noted, I went to skim through RandomizedSearchCV's source code and docs. RandomizedSearchCV is purportedly able to use multiprocessing backend for parallel tasks. By setting `n_jobs=-1` in your params, you're telling it to use all logical CPU cores. I'm unsure of how many additional processes and pools RandomizedSearchCV's spawns after calling it, but this sounds suspicious. concurrent.futures specifically warns that this may exhaust available workers and cause tasks to never complete. See https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor (the docs here are for ThreadPoolExecutor, but they still apply).
> 
> A temporary workaround might be to reduce n_jobs OR even better: use scikit-learn's multiprocessing parallel backend that's dedicated for that, and should have the necessary protections in place against such behavior. https://joblib.readthedocs.io/en/latest/parallel.html#joblib.parallel_backend 
> 
> 
> TLDR: I don't think this is a Python bug and I'm in favor of this bug being closed as `not a bug`.
> 
> ----------
> nosy: +kj
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue42245>
> _______________________________________
msg380229 - (view) Author: DanilZ (DanilZ) Date: 2020-11-02 15:14
Here is a gif of what’s going on in my ActivityMonitor on a Mac while this code is executed:
https://gfycat.com/unselfishthatgraysquirrel <https://gfycat.com/unselfishthatgraysquirrel>
msg380233 - (view) Author: DanilZ (DanilZ) Date: 2020-11-02 15:48
FYI: I’ve tried all the three of the possible backends: ‘loky’ (default) / ’threading’ / ‘multiprocessing’. None of them solved the problem.

> On 2 Nov 2020, at 17:34, Ken Jin <report@bugs.python.org> wrote:
> 
> A temporary workaround might be to reduce n_jobs OR even better: use scikit-learn's multiprocessing parallel backend that's dedicated for that, and should have the necessary protections in place against such behavior. https://joblib.readthedocs.io/en/latest/parallel.html#joblib.parallel_backend <https://joblib.readthedocs.io/en/latest/parallel.html#joblib.parallel_backend>
msg380236 - (view) Author: Ken Jin (kj) * Date: 2020-11-02 16:03
Hmm apologies I'm stumped then. The only things I managed to surmise from xgboost's and scikit-learn's GitHub issues is that this is a recurring issue specifically when using GridSearchCV :

Threads with discussions on workarounds:
https://github.com/scikit-learn/scikit-learn/issues/6627
https://github.com/scikit-learn/scikit-learn/issues/5115

Issues reported:
https://github.com/dmlc/xgboost/issues/2163
https://github.com/scikit-learn/scikit-learn/issues/10533
https://github.com/scikit-learn/scikit-learn/issues/10538 (this looks quite similar to your issue)

Some quick workarounds I saw were:
1. Remove n_jobs argument from GridSearchCV
2. Use parallel_backend from sklearn.externals.joblib rather than concurrent.futures so that the pools from both libraries don't have weird interactions.

I recommend opening an issue on scikit-learn/XGBoost's GitHub. This seems like a common problem that they face.
msg380240 - (view) Author: DanilZ (DanilZ) Date: 2020-11-02 17:28
Thank you so much for the input! I will study all the links you have sent:

Here is a screen recording of some additional experiments:
https://vimeo.com/user50681456/review/474733642/b712c12c2c <https://vimeo.com/user50681456/review/474733642/b712c12c2c>
msg380764 - (view) Author: DanilZ (DanilZ) Date: 2020-11-11 14:59
I have managed to solve the problem by inserting in the beginning of my program:

import multiprocessing
multiprocessing.set_start_method('forkserver')
as this is explained here: https://scikit-learn.org/stable/faq.html#why-do-i-sometime-get-a-crash-freeze-with-n-jobs-1-under-osx-or-linux <https://scikit-learn.org/stable/faq.html#why-do-i-sometime-get-a-crash-freeze-with-n-jobs-1-under-osx-or-linux>
It works, but the shell looses some level of interactivity as the results intermediate results don't get printed as the program is executed.

> On 2 Nov 2020, at 19:03, Ken Jin <report@bugs.python.org> wrote:
> 
> 
> Ken Jin <kenjin4096@gmail.com> added the comment:
> 
> Hmm apologies I'm stumped then. The only things I managed to surmise from xgboost's and scikit-learn's GitHub issues is that this is a recurring issue specifically when using GridSearchCV :
> 
> Threads with discussions on workarounds:
> https://github.com/scikit-learn/scikit-learn/issues/6627
> https://github.com/scikit-learn/scikit-learn/issues/5115
> 
> Issues reported:
> https://github.com/dmlc/xgboost/issues/2163
> https://github.com/scikit-learn/scikit-learn/issues/10533
> https://github.com/scikit-learn/scikit-learn/issues/10538 (this looks quite similar to your issue)
> 
> Some quick workarounds I saw were:
> 1. Remove n_jobs argument from GridSearchCV
> 2. Use parallel_backend from sklearn.externals.joblib rather than concurrent.futures so that the pools from both libraries don't have weird interactions.
> 
> I recommend opening an issue on scikit-learn/XGBoost's GitHub. This seems like a common problem that they face.
> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue42245>
> _______________________________________
msg380766 - (view) Author: Ken Jin (kj) * Date: 2020-11-11 15:12
Danil, thanks for finding the cause behind this. Could you check if the new behavior in Python 3.8 and higher has the same problem on your machine (without your fix)? multiprocessing on MacOS started using spawn in 3.8, and I was wondering if it that fixed it.

What's new entry for 3.8 : 
https://docs.python.org/3/whatsnew/3.8.html#multiprocessing

The bug tracked:
https://bugs.python.org/issue33725

The PR for that
https://github.com/python/cpython/pull/13603/files
msg380768 - (view) Author: DanilZ (DanilZ) Date: 2020-11-11 15:23
Hi Ken, 

Thanks for your comment.

Unfortunately at the time I can not upgrade to 3.8 to run this test. My whole system depends on 3.7 and some peculiarities of 3.8 need to be dealt with.

It would be great if someone with OSX and 3.8 could test this out, otherwise I will dig into this later creating a new environment.

> On 11 Nov 2020, at 18:12, Ken Jin <report@bugs.python.org> wrote:
> 
> 
> Ken Jin <kenjin4096@gmail.com> added the comment:
> 
> Danil, thanks for finding the cause behind this. Could you check if the new behavior in Python 3.8 and higher has the same problem on your machine (without your fix)? multiprocessing on MacOS started using spawn in 3.8, and I was wondering if it that fixed it.
> 
> What's new entry for 3.8 : 
> https://docs.python.org/3/whatsnew/3.8.html#multiprocessing
> 
> The bug tracked:
> https://bugs.python.org/issue33725
> 
> The PR for that
> https://github.com/python/cpython/pull/13603/files
> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue42245>
> _______________________________________
msg380793 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-11 19:54
The script as-is doesn't work with 3.8 because it uses the "spawn" strategy. I haven't tried to tweak the script to get it to work on 3.8 because the scripts works fine for me with 3.7.

The smaller script in msg380225 works for me on both python 3.7.4 and 3.8.3

Pip list says:

Package         Version
--------------- -------
joblib          0.17.0 
numpy           1.19.4 
pandas          1.1.4  
pip             19.0.3 
python-dateutil 2.8.1  
pytz            2020.4 
scikit-learn    0.23.2 
scipy           1.5.4  
setuptools      40.8.0 
six             1.15.0 
sklearn         0.0    
threadpoolctl   2.1.0  
xgboost         1.2.1
msg381414 - (view) Author: DanilZ (DanilZ) Date: 2020-11-19 13:51
Dear All,

Thanks for the great input. As described above it appears to be a MacOS problem.
msg381438 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2020-11-19 15:43
Could someone that runs into this issue with python 3.7 please test if the issue is still present in 3.8 or 3.9?

BTW. I'm not convinced this is a macOS specific problem, see issue40379 which claims that fork-without-exec strategy is inherently broken.
History
Date User Action Args
2020-11-19 15:43:32ronaldoussorensetmessages: + msg381438
2020-11-19 13:51:58DanilZsetmessages: + msg381414
2020-11-11 19:54:00ronaldoussorensetmessages: + msg380793
2020-11-11 15:23:24DanilZsetmessages: + msg380768
2020-11-11 15:12:14kjsetmessages: + msg380766
2020-11-11 14:59:08DanilZsetmessages: + msg380764
2020-11-02 17:28:06DanilZsetmessages: + msg380240
2020-11-02 16:03:10kjsetmessages: + msg380236
2020-11-02 15:48:29DanilZsetmessages: + msg380233
2020-11-02 15:14:56DanilZsetmessages: + msg380229
2020-11-02 15:06:01DanilZsetmessages: + msg380228
2020-11-02 14:34:21kjsetnosy: + kj
messages: + msg380225
2020-11-02 13:33:55DanilZcreate