classification
Title: Exception in multiprocessing/context.py under load
Type: crash Stage: resolved
Components: Versions: Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Multiprocessing docs don't describe thread-safety
View: 40815
Assigned To: Nosy List: Arkady M, davin, iritkatriel, pitrou
Priority: normal Keywords:

Created on 2020-06-04 03:06 by Arkady M, last changed 2020-10-20 11:16 by iritkatriel. This issue is now closed.

Messages (12)
msg370695 - (view) Author: Arkady (Arkady M) Date: 2020-06-04 03:06
I am running an HTTP server (socketserver.ThreadingMixIn, http.server.HTTPServer) in a Docker container (FROM ubuntu:19.10)

Occasionally I get an exception:

Exception happened during processing of request from ('172.17.0.1', 35756)
Traceback (most recent call last):
  File "/usr/lib/python3.7/socketserver.py", line 650, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python3.7/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "service.py", line 221, in __init__
    super(UrlExtractorServer, self).__init__(*args, **kwargs)
  File "/usr/lib/python3.7/socketserver.py", line 720, in __init__
    self.handle()
  File "/usr/lib/python3.7/http/server.py", line 426, in handle
    self.handle_one_request()
  File "/usr/lib/python3.7/http/server.py", line 414, in handle_one_request
    method()
  File "service.py", line 488, in do_POST
    self._post_extract(url)
  File "service.py", line 459, in _post_extract
    extracted_links, err_msg = self._extract_links(transaction_id, attachment_id, zip_password, data)
  File "service.py", line 403, in _extract_links
    error, results = call_timeout(process_deadline, extractor.extract_links_binary_multiprocess, args=data)
  File "service.py", line 175, in call_timeout
    manager = multiprocessing.Manager()
  File "/usr/lib/python3.7/multiprocessing/context.py", line 56, in Manager
    m.start()
  File "/usr/lib/python3.7/multiprocessing/managers.py", line 563, in start
    self._process.start()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 111, in start
    _cleanup()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 56, in _cleanup
    if p._popen.poll() is not None:
AttributeError: 'NoneType' object has no attribute 'poll'


I am in the process of preparingof of a reasonably simple piece of code demonstrating the problem.

Meanwhile the following can be important. In the code below I ma getting the elapse < timeout (20 times from 70K). In all case psutil.Process() returned psutil.NoSuchProcess

    time_start = time.time()
    job = multiprocessing.Process(target=func, args=(args, results), kwargs=kwargs)
    job.start()
    job.join(timeout)
    elapsed = time.time()-time_start
    if job.is_alive():
        try:
            process = psutil.Process(job.pid)
            process_error = f"pid {job.pid} status {process.status} {process}"
        except Exception as e:
            process_error = f"psutil.Process() failed {e}"
        if elapsed < timeout:
            print("elapsed < timeout")
msg370700 - (view) Author: Arkady (Arkady M) Date: 2020-06-04 08:17
This code reproduces the problem https://github.com/larytet-py/multiprocess
I assume that my use of join() is not correct.
msg370708 - (view) Author: Arkady (Arkady M) Date: 2020-06-04 13:27
The problem is likely in the call to multiprocessing.Process.join() with timeout. If I use timeout=None the code works.
msg370889 - (view) Author: Arkady (Arkady M) Date: 2020-06-07 11:40
Is there any news about?

This 50 lines sample reproduces the problem https://github.com/larytet-py/multiprocess
Please let me know if more information is needed.
msg371166 - (view) Author: Arkady (Arkady M) Date: 2020-06-10 06:25
Update. I have reproduced the problem in the code not calling Process.join() at all. It require more time and more load, but eventually Process.start() crashes.

Posted a question on https://stackoverflow.com/questions/62276345/call-to-pythons-mutliprocessing-process-join-fails
The problem reproduces in MacOS as well.
msg371184 - (view) Author: Arkady (Arkady M) Date: 2020-06-10 12:13
A workaround is to synchronize the call to Process.start()

diff --git a/main.py b/main.py
index d09dc53..49d68f0 100644
--- a/main.py
+++ b/main.py
@@ -26,17 +26,24 @@ def load_cpu(deadline):
     while time.time() - start < 0.2*deadline:
         math.pow(random.randint(0, 1), random.randint(0, 1))

+def join_process(job, timeout):
+    time_start = time.time()
+    while time.time()-time_start < timeout and job.is_alive():
+        time.sleep(0.1   * timeout)
+        continue
+
 job_counter = 0
+lock = threading.Lock()
 def spawn_job(deadline):
     '''
     Creat a new Process, call join(), process errors
     '''    
     global job_counter
     time_start = time.time()
-    job = multiprocessing.Process(target=load_cpu, args=(deadline, ))
-    job.start()
-    # timeout=None in the call to join() solves the problem
-    job.join(deadline)
+    with lock:
+        job = multiprocessing.Process(target=load_cpu, args=(deadline, ))
+        job.start()
+    join_process(job, deadline)
msg371859 - (view) Author: Arkady (Arkady M) Date: 2020-06-19 09:26
There is a memory leak every time call to join() fails (is https://bugs.python.org/issue37788 relevant?)
msg375087 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-08-09 23:08
The source code for the Process class is here: https://github.com/python/cpython/blob/master/Lib/multiprocessing/process.py

You can see that join and start both modify the global, non thread safe _children set. I'm guessing this is where you're seeing interference between threads.

I'm not sure a lock on start is enough - I think you need to get the lock for the join_process(job, deadline) call as well, because join can modify _children too. (Or, alternatively, manage all processes from a single thread.)
msg375092 - (view) Author: Arkady (Arkady M) Date: 2020-08-10 03:04
I have switched to os.fork() I am doing something like this https://gist.github.com/larytet/3ca9f9a32b1dc089a24cb7011455141f
msg375102 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-08-10 08:44
There is an open ticket to improve documentation of thread safety of the multiprocessing module module: https://bugs.python.org/issue40815

Is there anything remaining to do on this ticket?
msg375104 - (view) Author: Arkady (Arkady M) Date: 2020-08-10 09:04
"documentation of thread safety" 

I find it surprising that a module called "multiprocessing" has not a thread safe API.

If this is inevitable, I guess that's the life. I expect nothing less that a bold bright red font at the top of the document page. 

Protecting a call to join() with a mutex would impact latency of the API: the slowest subprocess can win the race.
msg375294 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-08-13 08:34
I agree that it would be better not to lock join, and instead manage each process from one thread. 

I think this ticket can be closed as a duplicate of issue40815.
History
Date User Action Args
2020-10-20 11:16:26iritkatrielsetstatus: open -> closed
superseder: Multiprocessing docs don't describe thread-safety
resolution: duplicate
stage: resolved
2020-08-13 08:34:48iritkatrielsetmessages: + msg375294
2020-08-10 09:04:14Arkady Msetmessages: + msg375104
2020-08-10 08:44:05iritkatrielsetmessages: + msg375102
2020-08-10 03:04:08Arkady Msetmessages: + msg375092
2020-08-09 23:08:48iritkatrielsetnosy: + iritkatriel
messages: + msg375087
2020-06-19 09:26:17Arkady Msetmessages: + msg371859
2020-06-10 12:13:49Arkady Msetmessages: + msg371184
2020-06-10 06:56:22ned.deilysetnosy: + pitrou, davin
2020-06-10 06:25:33Arkady Msetmessages: + msg371166
2020-06-07 11:40:48Arkady Msetmessages: + msg370889
2020-06-04 13:27:42Arkady Msetmessages: + msg370708
2020-06-04 08:17:55Arkady Msetmessages: + msg370700
2020-06-04 03:06:13Arkady Mcreate