Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support different contexts in multiprocessing #63199

Closed
lars mannequin opened this issue Sep 10, 2013 · 26 comments
Closed

Support different contexts in multiprocessing #63199

lars mannequin opened this issue Sep 10, 2013 · 26 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@lars
Copy link
Mannequin

lars mannequin commented Sep 10, 2013

BPO 18999
Nosy @vstinner, @ogrisel
Files
  • mp_getset_start_method.patch
  • context.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-04-27.15:45:22.897>
    created_at = <Date 2013-09-10.14:53:41.227>
    labels = ['type-feature', 'library']
    title = 'Support different contexts in multiprocessing'
    updated_at = <Date 2020-04-27.15:45:22.896>
    user = 'https://bugs.python.org/lars'

    bugs.python.org fields:

    activity = <Date 2020-04-27.15:45:22.896>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-04-27.15:45:22.897>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2013-09-10.14:53:41.227>
    creator = 'lars'
    dependencies = []
    files = ['31722', '32034']
    hgrepos = []
    issue_num = 18999
    keywords = ['patch']
    message_count = 26.0
    messages = ['197444', '197447', '197450', '197453', '197467', '197468', '197486', '197518', '197523', '197524', '197532', '197534', '197562', '197669', '199389', '199390', '199706', '199709', '199710', '200060', '200061', '200497', '200499', '200507', '213100', '367431']
    nosy_count = 6.0
    nosy_names = ['vstinner', 'jnoller', 'python-dev', 'sbt', 'lars', 'Olivier.Grisel']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue18999'
    versions = ['Python 3.4']

    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Sep 10, 2013

    The new multiprocessing based on forkserver (bpo-8713) looks great, but it has two problems. The first:

    "set_start_method() should not be used more than once in the program."

    The documentation does not explain what the effect of calling it twice would be. Judging from the documentation, it should be possible to do

    start_method = get_start_method()
    if start_method is None:
        set_start_method('forkserver')

    but this code shows the second problem: it always succeeds with the (undocumented!) side-effect of setting the start method in get_start_method, to the system default, if it hasn't been set already.

    I was just going to forge a patch for joblib (http://pythonhosted.org/joblib/) to set the start method to forkserver at import time. But in the current state of affairs, it would be impossible for the user to safely override the start method before importing joblib, because joblib can't figure out if it's been set already without setting it.

    The enclosed patch solves the problem by making the new functions more robust:

    • get_start_method no longer sets anything, but returns None if the start method has not been set already;
    • set_start_method raises a RuntimeError (for want of a better exception type) when resetting the start method is attempted.

    Unfortunately, I had to hack up the tests a bit, because they were violating the set_start_method contract. There is a test for the new set_start_method behavior, though, and all {fork,forkserver,spawn} tests pass on Linux.

    @lars lars mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Sep 10, 2013
    @lars lars mannequin changed the title Allow multiple calls to multiprocessing.set_start_method Robustness issues in multiprocessing.{get,set}_start_method Sep 10, 2013
    @ogrisel
    Copy link
    Mannequin

    ogrisel mannequin commented Sep 10, 2013

    Related question: is there any good reason that would prevent to pass a custom start_method kwarg to the Pool constructor to make it use an alternative Popen instance (that is an instance different from the multiprocessing._Popen singleton)?

    This would allow libraries such as joblib to keep minimal side effect by try to impact the default multiprocessing runtime as low as possible.

    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Sep 10, 2013

    Cleaned up the patch.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Sep 10, 2013

    With your patch, I think if you call get_start_method() without later calling set_start_method() then the helper process(es) will never be started.

    With the current code, popen.Popen() automatically starts the helper processes if they have not already been started.

    Also, set_start_method() can have the side-effect of starting helper process(es). I do not really approve of new processes being started as a side-effect of importing a library. But it is reasonable for a library to want a specific start method unless the user demands otherwise.

    I will have to think this over.

    BTW, the reason for discouraging using set_start_method() more than once is because some shared resources are created differently depending on what the current start method is.

    For instance using the fork method semaphores are created and then immediately unlinked. But with the other start methods we must not unlink the semaphore until we are finished with it (while being paranoid about cleanup).

    Maybe it would be better to have separate contexts for each start method. That way joblib could use the forkserver context without interfering with the rest of the user's program.

        from multiprocessing import forkserver_context as mp
        l = mp.Lock()
        p = mp.Process(...)
        with mp.Pool() as pool:
            ...

    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Sep 10, 2013

    In my patched version, the private popen.get_start_method gets a kwarg set_if_needed=True. popen.Popen calls that as before, so its behavior should not change, while the public get_start_method sets the kwarg to False.

    I realise now that this has the side effect that get_start_method's output changes when multiprocessing has first been used, but then that reflects how the library works. Maybe this should be documented.

    As for the contexts, those would be great.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Sep 10, 2013

    In my patched version, the private popen.get_start_method gets a kwarg
    set_if_needed=True. popen.Popen calls that as before, so its behavior
    should not change, while the public get_start_method sets the kwarg to
    False.

    My mistake.

    @ogrisel
    Copy link
    Mannequin

    ogrisel mannequin commented Sep 11, 2013

    Maybe it would be better to have separate contexts for each start method. That way joblib could use the forkserver context without interfering with the rest of the user's program.

    Yes in general it would be great if libraries could customize the multiprocessing default options without impacting any of the module singletons. That also include the ForkingPickler registry for custom: now it's a class attribute. It would be great to be able to scope custom reducers registration to a given multiprocessing.Pool or multiprocessing.Process instance.

    Now how to implement that kind of isolation: it could either be done by adding new constructor parameters or new public methods to the Process and Pool classes to be able to customize their behavior while sticking to the OOP paradigm if possible or by using a context manager as you suggested.

    I am not sure which option is best. Prototyping both is probably the best way to feel the tradeoffs.

    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Sep 12, 2013

    I don't really see the benefit of a context manager over an argument. It's a power user feature anyway, and context managers (at least to me) signal cleanup actions, rather than construction options.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Sep 12, 2013

    By "context" I did not really mean a context manager. I just meant an object (possibly a singleton or module) which implements the same interface as multiprocessing.

    (However, it may be a good idea to also make it a context manager whose __enter__() method starts the helper processes, and whose __exit__() method shuts them down.)

    @ogrisel
    Copy link
    Mannequin

    ogrisel mannequin commented Sep 12, 2013

    The process pool executor [1] from the concurrent futures API would be suitable to explicitly start and stop the helper process for the forkserver mode.

    [1] http://docs.python.org/3.4/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor

    The point would be to have as few state as possible encoded in the multiprocessing module (and its singletons) and move that state information to be directly managed by multiprocessing Process and Pool class instances so that libraries could customize the behavior (start_method, executable, ForkingPIckler reducers registry and so on) without mutating the state of the multiprocessing module singletons.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Sep 12, 2013

    There are lots of things that behave differently depending on the currently set start method: Lock(), Semaphore(), Queue(), Value(), ... It is not just when creating a Process or Pool that you need to know the start method.

    Passing a context or start_method argument to all of these constructors would be very awkward, which is why I think it is better to treat the context as an object with methods Process(), Pool(), Lock(), Semaphore(), etc.

    Unfortunately, I do not have time to work on this just now...

    @ogrisel
    Copy link
    Mannequin

    ogrisel mannequin commented Sep 12, 2013

    Richard Oudkerk: thanks for the clarification, that makes sense. I don't have the time either in the coming month, maybe later.

    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Sep 13, 2013

    Ok. Do you (or jnoller?) have time to review my proposed patch, at least before 3.4 is released? I didn't see it in the release schedule, so it's probably not planned soon, but I wouldn't want the API to change *again* in 3.5.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Sep 13, 2013

    I'll review the patch. (According to http://www.python.org/dev/peps/pep-0429/ feature freeze is expected in late November, so there is not too much of rush.)

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Oct 10, 2013

    Attached is a patch which allows the use of separate contexts. For example

    try:
        ctx = multiprocessing.get_context('forkserver')
    except ValueError:
        ctx = multiprocessing.get_context('spawn')
    
        q = ctx.Queue()
        p = ctx.Process(target=foo, args=(q,))
        p.start()
        ...

    Also, get_start_method(allow_none=True) will return None if the start method has not yet been fixed.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Oct 10, 2013

    BTW, the context objects are singletons.

    I could not see a sensible way to make ctx.Process be a picklable class (rather than a method) if there can be multiple instances of a context type. This means that the helper processes survive until the program closes down.

    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Oct 13, 2013

    BTW, the context objects are singletons.

    I haven't read all of your patch yet, but does this mean a forkserver will be started regardless of whether it is later used?

    That would be a good thing, since starting the fork server after reading in large data sets would mean the fork server would hold on to large swaths of memory even when the data set is deallocated in the master process.

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Oct 13, 2013

    I haven't read all of your patch yet, but does this mean a forkserver
    will be started regardless of whether it is later used?

    No, it is started on demand. But since it is started using _posixsbuprocess.fork_exec(), nothing is inherited from the main process.

    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Oct 13, 2013

    Ok, great.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 16, 2013

    New changeset 72a5ac909c7a by Richard Oudkerk in branch 'default':
    Issue bpo-18999: Make multiprocessing use context objects.
    http://hg.python.org/cpython/rev/72a5ac909c7a

    @sbt sbt mannequin changed the title Robustness issues in multiprocessing.{get,set}_start_method Support different contexts in multiprocessing Oct 16, 2013
    @sbt sbt mannequin added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Oct 16, 2013
    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Oct 16, 2013

    Thanks, much better than my solution!

    @sbt sbt mannequin closed this as completed Oct 16, 2013
    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Oct 19, 2013

    Strange, I can't actually get it to work:

    >>> from multiprocessing import Pool, get_context
    >>> forkserver = get_context('forkserver')
    >>> Pool(context=forkserver)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: Pool() got an unexpected keyword argument 'context'

    @lars
    Copy link
    Mannequin Author

    lars mannequin commented Oct 19, 2013

    I also tried

    from multiprocessing.pool import Pool

    but that died with

    ImportError: cannot import name get_context

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Oct 19, 2013

    I guess this should be clarified in the docs, but multiprocessing.pool.Pool is a *class* whose constructor takes a context argument, where as multiprocessing.Pool() is a *bound method* of the default context. (In previous versions multiprocessing.Pool was a *function*.)

    The only reason you might need the context argument is if you have subclassed multiprocessing.pool.Pool.

    >>> from multiprocessing import pool, get_context
    >>> forkserver = get_context('forkserver')
    >>> p = forkserver.Pool()
    >>> q = pool.Pool(context=forkserver)
    >>> p, q
    (<multiprocessing.pool.Pool object at 0xb71f3eec>, <multiprocessing.pool.Pool object at 0xb6edb06c>)

    I suppose we could just make the bound methods accept a context argument which (if not None) is used instead of self.

    @sbt sbt mannequin reopened this Oct 19, 2013
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 10, 2014

    New changeset b941a320601a by R David Murray in branch 'default':
    whatsnew: multiprocessing start methods and context (bpo-8713 and bpo-18999)
    http://hg.python.org/cpython/rev/b941a320601a

    @vstinner
    Copy link
    Member

    It seems like this issue has been fixed, so I set its status to closed.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant