classification
Title: Make semaphore_tracker track other system resources
Type: resource usage Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Olivier.Grisel, pablogsal, pierreglaser, pitrou, vstinner, xtreak
Priority: normal Keywords: patch

Created on 2019-05-09 17:36 by pierreglaser, last changed 2019-06-12 11:47 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 13222 merged pierreglaser, 2019-05-09 18:12
PR 13276 merged pierreglaser, 2019-05-13 11:27
PR 13290 merged pitrou, 2019-05-13 17:09
PR 13292 merged pierreglaser, 2019-05-13 17:25
Messages (15)
msg341987 - (view) Author: Pierre Glaser (pierreglaser) * Date: 2019-05-09 17:36
Hi all,

Olivier Grisel, Thomas Moreau and myself are currently working on increasing
the range of action of the semaphore_tracker in Python.

multiprocessing.semaphore_tracker is a little known module, that launches a
server process used to track the life cycle of semaphores created in a python
session, and potentially cleanup those semaphores after all python processes of
the session terminated. Normally, python processes cleanup semaphores they
create. This is however not not guaranteed if the processes get violently
interrupted (using for example the bash command "killall python")

A note on why the semaphore_tracker was introduced: Cleaning up semaphores
after termination is important because the system only supports a limited
number of named semaphores, and they will not be automatically removed till the
next reboot.

Now, Python 3.8 introduces shared memory segments creation. Shared memory is
another sensitive global system resource. Currently, unexpected termination of
processes that created memory segments will result in leaking those memory
segments. This can be problematic for large compute clusters with many users
and that are rebooted rarely.

For this reason, we expanded the semaphore_tracker to also track shared memory
segments, and renamed it resource_tracker. Shared memory segments get
automatically tracked by the resource tracker when they are created. This is a
first, self-contained fix. (1)

Additionally, supporting shared memory tracking led to a more generic design
for the resource_tracker. The resource_tracker can be now easily extended
to track arbitrary resource types.
A public API could potentially be exposed for users willing to track other
types.  One for example may want to add tracking for temporary folders creating
during python sessions.  Another use case is the one of joblib, which
is a widely-used parallel-computing package, and also the backend of
scikit-learn. Joblib relies heavily on memmapping. A public API could extend
the resource_tracker to track memmap-ed objects with very little code.

Therefore, this issue serves two purposes:
- referencing the semaphore_tracker enhancement mentioned in (1)
- discussing a potentially public resource_tracker API.
msg342133 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2019-05-10 20:59
New changeset f22cc69b012f52882d434a5c44a004bc3aa5c33c by Antoine Pitrou (Pierre Glaser) in branch 'master':
bpo-36867: Make semaphore_tracker track other system resources (GH-13222)
https://github.com/python/cpython/commit/f22cc69b012f52882d434a5c44a004bc3aa5c33c
msg342201 - (view) Author: Pierre Glaser (pierreglaser) * Date: 2019-05-11 15:58
Shared memory segments are now tracked by the brand new resource_tracker!
Thanks Antoine for the review.

Does anyone have an opinion on introducing a public API for users to make the
resource_tracker track resources of their choice?

What We have in mind is:
- making public the existing resource_tracker.register/unregister
- adding a new function, resource_tracker.make_trackable(resource_type,
  cleanup_func), where:

    * resource_type is a string (an identifier that will be used each time a
      resource of the type resource_type needs tracking, via the call
      resource_tracker.register(resource_name, resource_type)
    * cleanup_func must be a callable taking a single string as argument (the
      name of the resource that needs tracking). This function will be called
      after the end of a process for reach resource of type resource_type the
      process did not clean properly

Under the hood, make_trackable simply populates resource_tracker._CLEANUP_FUNCS
with new items.

Here is a simple example:
	import os
	import resource_tracker
	import shutil
	from multiprocessing import util

	class ClassCreatingAFolder:
	    """Class where each instance creates a temporary folder.

	    Each temporary folder is supposed to exist for the duration of the instance
	    that created it.
	    """
	    def __init__(self, folder_name):
		self.folder_name = folder_name
		os.mkdir(folder_name)

		# any instance normally garbage-collected should remove its folder, and
		# notice the resource_tracker that its folder was correctly removed.
		util.Finalize(self, ClassCreatingAFolder.cleanup, args=(folder_name,))

		# If this session quits abruptly, the finalizer will not be called for
		# the instances of ClassCreatingAFolder that were still alive
		# before the shutdown. The resource_tracker comes into play, and removes
		# the folders associated to each of these resources.
		resource_tracker.register(
		    folder_name, # argument to shutil.rmtree
		    "ClassCreatingAFolder")

	    @staticmethod
	    def cleanup(folder_name):
		resource_tracker.unregister(folder_name, "ClassCreatingAFolder")
		shutil.rmtree(folder_name)

	# Tell the resource_tracker how to cleanup resources created by
	# ClassCreatingAFolder instances
	resource_tracker.make_trackable("ClassCreatingAFolder", shutil.rmtree)


Typical resources that can be made trackable include memmaped objects,
temporary folders. Our use-case is joblib that has its own mmap type that we
would like to track using the semaphore_tracker.

Any thoughts?
msg342246 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2019-05-12 08:27
Since the commit there is a warning in CI and locally while running tests.

Travis CI : 

* https://travis-ci.org/python/cpython/jobs/531304357#L2099 (test_multiprocessing_forkserver)
* https://travis-ci.org/python/cpython/jobs/531304357#L2296 (test_multiprocessing_fork)

Before commit : 

./python.exe -X tracemalloc -m test --fail-env-changed test_multiprocessing_forkserver
Run tests sequentially
0:00:00 load avg: 4.71 [1/1] test_multiprocessing_forkserver
test_multiprocessing_forkserver passed in 2 min 21 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 2 min 21 sec
Tests result: SUCCESS

After commit f22cc69b012f52882d434a5c44a004bc3aa5c33c : 

./python.exe -X tracemalloc -m test --fail-env-changed test_multiprocessing_forkserver
Run tests sequentially
0:00:00 load avg: 4.08 [1/1] test_multiprocessing_forkserver
/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/multiprocessing/resource_tracker.py:198: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/unittest/case.py:683: ResourceWarning: unclosed file <_io.BufferedReader name=7>
  testMethod()
Object allocated at (most recent call last):
  File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/subprocess.py", lineno 819
    self.stdout = io.open(c2pread, 'rb', bufsize)
test_multiprocessing_forkserver passed in 2 min 20 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 2 min 20 sec
Tests result: SUCCESS

== Tests result: SUCCESS ==

1 test OK.

Total duration: 2 min 26 sec
Tests result: SUCCESS
msg342307 - (view) Author: Pierre Glaser (pierreglaser) * Date: 2019-05-13 10:42
Yes, one test I wrote in an unrelated commit does not unlink a memory segment. Now the ResourceTracker complains. Fixing it now.
msg342308 - (view) Author: Pierre Glaser (pierreglaser) * Date: 2019-05-13 10:47
Actually, I was properly unlinking the shared_memory segments. The warning messages are due to bad interactions between the ResourceTracker and the SharedMemoryManager object. In this particular case, it's easy to change a little bit the  problematic test to avoid the warnings. I will focus on solving those bad interactions right after.
msg342309 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-05-13 10:58
test_shared_memory_cleaned_after_process_termination() uses time as a weak synchronization primitive:


        # killing abruptly processes holding reference to a shared memory
        # segment should not leak the given memory segment.
        p.terminate()
        p.wait()
        time.sleep(1.0)  # wait for the OS to collect the segment

        with self.assertRaises(FileNotFoundError):
            smm = shared_memory.SharedMemory(name, create=False)

Would it be possible to use a more reliable synchronization? Such test usually fail randomly.
https://pythondev.readthedocs.io/unstable_tests.html
msg342319 - (view) Author: Olivier Grisel (Olivier.Grisel) * Date: 2019-05-13 12:50
As Victor said, the `time.sleep(1.0)` might lead to Heisen failures. I am not sure how to write proper strong synchronization in this case but we could instead go for something intermediate such as the following pattern:


        ...
        p.terminate()
        p.wait()
        for i in range(60):
            try:
                shared_memory.SharedMemory(name, create=False)
            except FileNotFoundError:
                # the OS successfully collected the segment as expected
                break

            time.sleep(1.0)  # wait for the OS to collect the segment

         else:
             raise AssertionError(f"Failed to collect shared_memory segment {name}")


What do you think?
msg342321 - (view) Author: Pierre Glaser (pierreglaser) * Date: 2019-05-13 12:54
We can do that, or maybe we can try to wait on the `resource_tracker's` pid?
msg342326 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-05-13 13:09
I like Olivier's pattern. Maybe we can slowly increase the sleep to stop shortly if the resource goes away shortly.

deadline = time.monotonic() + 60.0
sleep = 0.010
while ...:
   if ....: break
   if time.monotonic() > deadline: ... assert error ...
   sleep = min(sleep * 2, 5.0)
   time.sleep(1.0)

It's kind of a common pattern. Maybe it should be an helper in test.support module.


> We can do that, or maybe we can try to wait on the `resource_tracker's` pid?

I prefer to make sure that the resource goes away without inspecting multiprocessing internals.
msg342365 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2019-05-13 17:20
New changeset 50466c66509de556a8579172f82d1abb1560d8e4 by Antoine Pitrou (Pierre Glaser) in branch 'master':
bpo-36867: DOC update multiprocessing.rst (GH-13289)
https://github.com/python/cpython/commit/50466c66509de556a8579172f82d1abb1560d8e4
msg342374 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2019-05-13 19:15
New changeset b1dfcad6f0d3a52c9ac31fb9763fc7962a84b27c by Antoine Pitrou (Pierre Glaser) in branch 'master':
bpo-36867: Create the resource_tracker before launching SharedMemoryManagers (GH-13276)
https://github.com/python/cpython/commit/b1dfcad6f0d3a52c9ac31fb9763fc7962a84b27c
msg342749 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-05-17 18:20
New changeset cbe72d842646ded2454784679231e3d1e6252e72 by Victor Stinner (Pierre Glaser) in branch 'master':
bpo-36867: _test_multiprocessing: avoid weak sync primitive (GH-13292)
https://github.com/python/cpython/commit/cbe72d842646ded2454784679231e3d1e6252e72
msg342752 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-05-17 18:22
It seems like all known bugs are fixed, I close again the issue. Thanks!
msg345322 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-12 11:47
The new test is not reliable, see: bpo-37244 "test_multiprocessing_forkserver: test_resource_tracker() failed on  x86 Gentoo Refleaks 3.8".
History
Date User Action Args
2019-06-12 11:47:54vstinnersetmessages: + msg345322
2019-05-17 18:22:48vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg342752

stage: patch review -> resolved
2019-05-17 18:20:30vstinnersetmessages: + msg342749
2019-05-13 19:15:51pitrousetmessages: + msg342374
2019-05-13 17:25:47pierreglasersetpull_requests: + pull_request13203
2019-05-13 17:20:53pitrousetmessages: + msg342365
2019-05-13 17:09:06pitrousetpull_requests: + pull_request13200
2019-05-13 13:09:26vstinnersetmessages: + msg342326
2019-05-13 12:54:01pierreglasersetmessages: + msg342321
2019-05-13 12:50:35Olivier.Griselsetnosy: + Olivier.Grisel
messages: + msg342319
2019-05-13 11:27:10pierreglasersetpull_requests: + pull_request13182
2019-05-13 10:58:46vstinnersetnosy: + vstinner
messages: + msg342309
2019-05-13 10:47:12pierreglasersetmessages: + msg342308
2019-05-13 10:42:26pierreglasersetmessages: + msg342307
2019-05-12 08:27:47xtreaksetnosy: + xtreak
messages: + msg342246
2019-05-11 15:58:49pierreglasersetmessages: + msg342201
2019-05-10 20:59:12pitrousetmessages: + msg342133
2019-05-09 18:12:28pierreglasersetkeywords: + patch
stage: patch review
pull_requests: + pull_request13132
2019-05-09 17:36:01pierreglasercreate