Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threading module can deadlock after fork #39802

Closed
mikemccand mannequin opened this issue Jan 11, 2004 · 54 comments
Closed

threading module can deadlock after fork #39802

mikemccand mannequin opened this issue Jan 11, 2004 · 54 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error

Comments

@mikemccand
Copy link
Mannequin

mikemccand mannequin commented Jan 11, 2004

BPO 874900
Nosy @gpshead, @jcea, @amauryfa, @pitrou, @benjaminp
Files
  • fork_threading.py: demo script
  • fork-and-threading5-gps01.patch: fixes fork-and-thread4, pitrou's fork_threading.py works with this patch.
  • threading-874900-improvement-gps01.diff: is this an improvement? is it needed?
  • forkthread.patch
  • forkthread2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2008-09-06.23:04:57.802>
    created_at = <Date 2004-01-11.14:28:06.000>
    labels = ['interpreter-core', 'type-bug', 'release-blocker']
    title = 'threading module can deadlock after fork'
    updated_at = <Date 2011-07-11.06:04:45.798>
    user = 'https://bugs.python.org/mikemccand'

    bugs.python.org fields:

    activity = <Date 2011-07-11.06:04:45.798>
    actor = 'nirai'
    assignee = 'none'
    closed = True
    closed_date = <Date 2008-09-06.23:04:57.802>
    closer = 'pitrou'
    components = ['Interpreter Core']
    creation = <Date 2004-01-11.14:28:06.000>
    creator = 'mikemccand'
    dependencies = []
    files = ['9288', '10889', '10932', '11378', '11379']
    hgrepos = []
    issue_num = 874900
    keywords = ['patch', 'needs review']
    message_count = 54.0
    messages = ['60453', '60454', '60455', '61693', '61695', '69423', '69437', '69466', '69467', '69468', '69470', '69476', '69477', '69478', '69479', '69492', '69512', '69549', '69550', '69569', '69601', '69603', '69607', '69696', '69702', '69784', '69793', '69795', '69825', '69867', '69868', '69878', '69888', '69889', '69893', '69894', '69928', '70174', '71297', '71298', '71301', '71315', '71818', '71824', '72546', '72549', '72550', '72551', '72552', '72591', '72716', '72717', '72721', '72727']
    nosy_count = 10.0
    nosy_names = ['gregory.p.smith', 'jcea', 'mikemccand', 'tzot', 'amaury.forgeotdarc', 'Rhamphoryncus', 'pitrou', 'benjamin.peterson', 'jnoller', 'bobbyi']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue874900'
    versions = ['Python 3.0']

    @mikemccand
    Copy link
    Mannequin Author

    mikemccand mannequin commented Jan 11, 2004

    We have a Python HTTP server that, in the parent
    process, uses os.fork to spawn new children, but at the
    same time the parent could be spawning new threads (in
    threads other than the main thread -- only the main
    thread does forking).

    Anwyay, it very rarely causes deadlock in a spawned
    child when that child tries to start a new thread.

    I believe I've tracked this down to the
    _active_limbo_lock in the threading module.
    Specifically, if this lock is held by parent (because
    another thread is spawning a thread), just as os.fork
    happens, the child inherits the held lock and will then
    block trying to do any threading.* operations.

    Just like the global interp. lock is overwritten in the
    child after fork, I think something similar should
    happen to threading._active_limbo_lock? (And more
    generally the user needs to be aware of locks passing
    through fork; but I think at least threading.py should
    "do the right thing").

    This thread looks quite relevant:

    groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=38E6F2BA.E66CAC90%40ensim.com&rnum=5&prev=/groups%3Fq%3Dpython%2Bfork%2Bthreading%2Bmodule%2B%2Block%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DN%26scoring%3Dd

    @mikemccand mikemccand mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jan 11, 2004
    @tzot
    Copy link
    Mannequin

    tzot mannequin commented Mar 20, 2005

    Logged In: YES
    user_id=539787

    See some more typical info about mixing forks and threads:
    http://mail.python.org/pipermail/python-list/2001-September/066048.html

    This seems not to be Python-related, but platform-related.

    @mikemccand
    Copy link
    Mannequin Author

    mikemccand mannequin commented Mar 20, 2005

    Logged In: YES
    user_id=323786

    True -- there are many platform specific issues on the
    interaction of forking and threading.

    However the _active_limbo_lock is entirely a Python issue (I
    think it can deadlock on any platform, unless the platform
    releases locks in child process after fork).

    After forking, python already resets the GIL, and it should
    also reset any other locks that have global impact like
    _active_limbo_lock.

    @tiran tiran added type-bug An unexpected behavior, bug, or error labels Jan 20, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Jan 25, 2008

    It's not only the _active_limbo_lock. All global structures of the
    threading module should be reinitialized (including the _MainThread
    instance); for that purpose, reload() can be used. I attach an example
    which exercises this problem. Normally the script should hang in the
    last step (using os.fork() and launching threads while some other
    threads call threading.enumerate() in a loop).

    The safe_fork() function in the example is a replacement for os.fork()
    which tries to avoid any deadlock in the threading module. If it's
    deemed useful and bug-free, it could perhaps be added somewhere in the
    stdlib (in the threading module itself perhaps).

    @pitrou
    Copy link
    Member

    pitrou commented Jan 25, 2008

    (or perhaps we should provide an API to hook into PyOS_AfterFork)

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 8, 2008

    Attached patch releases the _active_limbo_lock after a fork(). It is not
    a complete solution, since existing Thread objects don't correspond to
    anything, but it corrects a problem in test_multiprocessing.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 8, 2008

    A slightly better patch, with tests.

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 9, 2008

    Amaury - your latest patch doesn't seem to apply cleanly to trunk, it's
    choking on the Python/ceval.c changes

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 9, 2008

    Here is a third patch with latest trunk.
    Did you apply it to a clean checkout?

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 9, 2008

    I had to flip on ignore whitespace to patch:
    patch -p0 -l <~/Desktop/fork-and-thread3.patch

    Worked. Unfortunately, test_threading is locking up during a make test.
    Here's the verbose regrtest.py output:

    woot:python-trunk jesse$ ./python.exe Lib/test/regrtest.py -v
    test_threading
    ...snip
    <thread 9> done
    <thread 9> is finished. 0 tasks are running
    all tasks done
    ok
    test_join_in_forked_from_thread
    (test.test_threading.ThreadJoinOnShutdown) ...

    At this point it hangs (OS/X 10.5 latest)

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 9, 2008

    Amaury, it looks like it's hanging in subprocess:

    /Users/jesse/open_source/subversion/python-
    trunk/Lib/test/test_threading.py(338)_run_and_join()
    -> p = subprocess.Popen([sys.executable, "-c", script],
    stdout=subprocess.PIPE)
    (Pdb) step
    ...snip...

    (Pdb) step
    --Call--

    /Users/jesse/open_source/subversion/python-
    trunk/Lib/threading.py(788)current_thread()
    -> def current_thread():
    (Pdb) step
    /Users/jesse/open_source/subversion/python-
    trunk/Lib/threading.py(789)current_thread()
    -> try:
    (Pdb) step
    /Users/jesse/open_source/subversion/python-
    trunk/Lib/threading.py(790)current_thread()
    -> return _active[_get_ident()]
    (Pdb) step
    /Users/jesse/open_source/subversion/python-
    trunk/Lib/subprocess.py(1097)_execute_child()
    -> data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB
    (Pdb) step
    ... lockup

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 9, 2008

    Here's the good news, with the patch applied to trunk, I'm not seeing
    hangs in the multiprocessing test suite. I'm running it on a tight loop
    excluding the threading tests to confirm.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 9, 2008

    Well I was a bit presumptuous that my thread+fork tests would pass on
    all platforms out of the box.
    We should disable the tests, or have someone with better Unix expertise
    examine and correct them.
    At least I feel that the actual correction in threading.py goes in the
    right direction.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jul 9, 2008

    In general I suggest replacing the lock with a new lock, rather than
    trying to release the existing one. Releasing *might* work in this
    case, only because it's really a semaphore underneath, but it's still
    easier to think about by just replacing.

    I also suggest deleting _active and recreating it with only the current
    thread.

    I don't understand how test_join_on_shutdown could succeed. The main
    thread shouldn't be marked as done.. well, ever. The test should hang.

    I suspect test_join_in_forked_process should call os.waitpid(childpid)
    so it doesn't exit early, which would cause the original Popen.wait()
    call to exit before the output is produced. The same problem of
    test_join_on_shutdown also applies.

    Ditto for test_join_in_forked_from_thread.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jul 9, 2008

    Looking over some of the other platforms for thread_*.h, I'm sure
    replacing the lock is the right thing.

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 9, 2008

    A new patch:

    • I replaced "_active_limbo_lock.release()" by
      "_active_limbo_lock=_allocate_lock()"

    • I replaced the successive deletions in _active by a re-creation with
      only the current thread. There is no difference in the result, but I
      agree that the intent is more clear.

    • yes, the main thread is marked as done when the interpreter exits
      (hence the convoluted tests with subprocesses): in Modules/main.c,
      WaitForThreadShutdown() calls threading._shutdown().

    Also, I hope the tests make more sense now.

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 10, 2008

    FWIW: the threading tests still hang for me with the latest patch. I'm
    confirming that the patch itself fixes the hanging MP tests though

    @jnoller
    Copy link
    Mannequin

    jnoller mannequin commented Jul 11, 2008

    I can confirm that this seems to clear up the test_multiprocessing hangs
    on my mac, I ran make test in a loop almost all day yesterday without
    hangs. I would really suggest we get this in, minus the new test_threading
    tests for now.

    @amauryfa
    Copy link
    Member

    I agree. My attempt to describe the behaviour of fork+threads in a
    platform-independent test is another issue.

    Just one more thing: looking at Module/posixmodule.c, os.fork1() calls
    PyOS_AfterFork() in both parent and child processes. Is there a "if (pid
    == 0)" test missing, just like os.fork()?

    @amauryfa
    Copy link
    Member

    I am leaving for the whole next week, so anyone, feel free to commit
    (part of) my patch if it helps.

    @gpshead
    Copy link
    Member

    gpshead commented Jul 12, 2008

    The existing fork-and-thread4.patch doesn't actually reset
    _active_limbo_lock. Its missing a "global _active_limbo_lock" statement
    in the threading._after_fork() function.

    @gpshead
    Copy link
    Member

    gpshead commented Jul 12, 2008

    and a few more bugs. a new patch is attached. With this applied,
    pitrou's fork_threading.py bug demonstration script finally does the
    right thing.

    test_threading, including the new deadlock tests, and
    test_multiprocessing both pass.

    Tested on x86 MacOS X 10.4 & x86 Ubuntu 8.04.

    Would someone please try this on other platforms?

    (The new threading test's use of subprocess with [sys.executable, '-c',
    """long script"""] makes me slightly nervous about portability outside
    of Linux and BSD.)

    @gpshead
    Copy link
    Member

    gpshead commented Jul 13, 2008

    I still don't like the _after_fork() implementation. Its O(n) where n
    == number of threads the parent process had.

    Very wasteful when the fork() was done in the most common case of being
    followed by an exec and calling os._exit(). It won't scale nicely with
    system load (forks will start taking longer and longer the more threads
    exist).

    Could os.fork() be extended to have an optional will_exec_or_die
    parameter that determines if _after_fork() is even called at all?
    Things like subprocess should pass in True. The default should be False
    for compatiblity.

    @jnoller jnoller mannequin added the release-blocker label Jul 15, 2008
    @benjaminp
    Copy link
    Contributor

    One of tests is hanging in 3.0.

    @benjaminp benjaminp reopened this Aug 17, 2008
    @benjaminp benjaminp reopened this Aug 17, 2008
    @gpshead
    Copy link
    Member

    gpshead commented Aug 17, 2008

    specifically, test_3_join_in_forked_from_thread hangs in py3k and is
    currently disabled.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 18, 2008

    If you remove the print() call from joining_func(), does it stop
    hanging? It may be due to differences between the io library in py3k and
    the builtin file objects in 2.x.
    (looking at the date of the commit disabling the test, it is not related
    to the thread-safety changes to BufferedWriter, though)

    @benjaminp
    Copy link
    Contributor

    This is also causing hangs in 2.5.

    @gpshead
    Copy link
    Member

    gpshead commented Aug 24, 2008

    it passes on release25-maint for me on linux. i don't see it hanging in
    any of the 2.5 buildbots. looks like your r66003 change was a decent
    fix for windows judging by the buildbot.

    (fwiw, that test you patched in 66003 should probably use readlines()
    and assertListEqual to be cross platform rather than the code thats
    there now but i'm not motivated to change that unless some other
    platform fails on it)

    @pitrou
    Copy link
    Member

    pitrou commented Sep 4, 2008

    Ok, with this patch the test passes under py3k.

    Apart from the trivial typo (_Thread__stopped -> _stopped), I had to
    change print("...") to os.write(1, b"...\n") in the tests as otherwise
    subprocess wouldn't receive any output from the third test (buffering
    problem? I don't know really).

    I also added the ident fix I had already suggested in _after_fork(). If
    you put a print statement at this point you'll see the old and the new
    value are not the same.

    @benjaminp
    Copy link
    Contributor

    I feel like that patch sort of avoids the problem by changing the test.
    The test is hanging for some reason, so we should try to fix that, not
    the test. :) I wonder if it has something to do with the various
    deadlocks we are discovering in the io library.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 4, 2008

    Benjamin, if you don't change the test, the deadlock problem is still
    solved, it's just that the third test fails because the subprocess
    stdout is empty instead of containing the desired value. It is *not*
    because the subprocess doesn't print anything (if you launch an
    equivalent program on the command line, everything is printed), rather
    it seems that subprocess doesn't get what is printed from the child
    process of the subprocess.

    @benjaminp
    Copy link
    Contributor

    On Thu, Sep 4, 2008 at 6:08 PM, Antoine Pitrou <report@bugs.python.org> wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Benjamin, if you don't change the test, the deadlock problem is still
    solved, it's just that the third test fails because the subprocess
    stdout is empty instead of containing the desired value. It is *not*
    because the subprocess doesn't print anything (if you launch an
    equivalent program on the command line, everything is printed), rather
    it seems that subprocess doesn't get what is printed from the child
    process of the subprocess.

    Ah! My apologies for the giant misunderstanding.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 4, 2008

    Instead of os.write(), it is actually sufficient to sys.stdout.flush()
    at the end of the subprocess. Patch attached.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2008

    Just checked that the latest patch works on Windows as well.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 6, 2008

    forkthread2.patch looks good to me. to be consistent shouldn't we also
    apply that fix to 2.6?

    @pitrou
    Copy link
    Member

    pitrou commented Sep 6, 2008

    Le samedi 06 septembre 2008 à 22:27 +0000, Gregory P. Smith a écrit :

    Gregory P. Smith <greg@krypto.org> added the comment:

    forkthread2.patch looks good to me. to be consistent shouldn't we also
    apply that fix to 2.6?

    Only the ident-resetting part needs to be backported, the rest is
    py3k-specific (especially _stopped vs. _Thread__stopped :-)).

    @pitrou
    Copy link
    Member

    pitrou commented Sep 6, 2008

    Committed in r66274, r66275. Thanks!

    @pitrou pitrou closed this as completed Sep 6, 2008
    @pitrou pitrou closed this as completed Sep 6, 2008
    @gpshead
    Copy link
    Member

    gpshead commented Sep 7, 2008

    I backported the last bit from r66275 to release25-maint in r66279.

    @nirai nirai mannequin changed the title threading module can deadlock after fork malloc Jul 11, 2011
    @nirai nirai mannequin changed the title threading module can deadlock after fork malloc Jul 11, 2011
    @nirai nirai mannequin changed the title malloc threading module can deadlock after fork Jul 11, 2011
    @nirai nirai mannequin changed the title malloc threading module can deadlock after fork Jul 11, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants