This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Deadlock in thread.join on Python 2.7/macOS
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: aeros, mark.dickinson, navytux, ned.deily, pitrou, serhiy.storchaka, vstinner, zach.ware
Priority: normal Keywords:

Created on 2016-02-14 11:38 by mark.dickinson, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
hang_backtrace.txt mark.dickinson, 2016-02-14 11:38
threading_hang.py mark.dickinson, 2016-02-14 11:39
debug-build-stacktrace.txt mark.dickinson, 2016-02-14 14:32
Messages (25)
msg260273 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 11:38
On OS X (10.9.5), I'm getting an apparent deadlock in the following simple Python script:

#--------------------------------------------------------
import itertools
import threading

def is_prime(n):
    return n >= 2 and all(n % d for d in xrange(2, n))

def count_primes_in_range(start, stop):
    return sum(is_prime(n) for n in xrange(start, stop))

def main():
    threads = [
        threading.Thread(
            target=count_primes_in_range,
            args=(12500*i, 12500*(i+1))
        )
        for i in xrange(8)
    ]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()


if __name__ == '__main__':
    for i in itertools.count():
        print "Iteration: ", i
        main()
#--------------------------------------------------------

Each iteration takes around 60 seconds, and I typically get a deadlock within the first 5 iterations. It looks as though the deadlock happens during the "thread.join", at a stage where some of the threads have already completed and been joined. The code hangs with no CPU activity, and a backtrace (attached) shows that all the background threads are waiting for the GIL, while the main thread is in a blocking `thread.join` call.

I've attached a gdb-generated stack trace.

I was unable to reproduce this with a debug build of Python. I *have* reproduced with a normal build of Python, and on various Python 2.7 executables from 3rd party sources (Apple, Macports, Enthought Canopy).

I've also not yet managed to reproduce on Python 3, but I haven't tried that hard. I suspect it's a Python 2-only problem, though.

(And yes, this is horrible code that doesn't make much sense. It's actually a cut-down example from a "how not to do it" part of a course on concurrency. Nevertheless, it shouldn't be deadlocking.)
msg260274 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 11:39
Here's the hanging code as an attached file.
msg260275 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 11:44
Build details: this was a version of Python built with a simple `./configure && make` from commit 32893d8a52a9.

taniyama:cpython mdickinson$ ./python.exe
Python 2.7.10+ (2.7:32893d8a52a9, Feb 14 2016, 11:08:14) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
msg260277 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 12:16
One thing that looks a bit odd to my eyes is the different values for `lock` in the PyThread_acquire_lock calls (see below). All of threads 2-5 should be waiting on the GIL, so I'd expect to see the same value for `lock` in all those calls. Instead, threads 2 and 5 seem to be waiting on the same lock, but threads 3 and 4 seem to have their own lock. But I don't know this code well enough to know whether it's supposed to be this way.

Thread 5 (Thread 0x1903 of process 14075):
#2  0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed500, waitflag=<optimized out>) at Python/thread_pthread.h:429

Thread 4 (Thread 0x1803 of process 14075):
#2  0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed300, waitflag=<optimized out>) at Python/thread_pthread.h:429

Thread 3 (Thread 0x1703 of process 14075):
#2  0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed600, waitflag=<optimized out>) at Python/thread_pthread.h:429

Thread 2 (Thread 0x1603 of process 14075):
#2  0x00000001000e7e21 in PyThread_acquire_lock (lock=0xed3ed300, waitflag=<optimized out>) at Python/thread_pthread.h:429

Thread 1 (Thread 0x1503 of process 14075):
#2  0x00000001000e7e21 in PyThread_acquire_lock (lock=0x0, waitflag=<optimized out>) at Python/thread_pthread.h:429
msg260278 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 12:17
> threads 2 and 5

Sorry, that should have said "threads 2 and 4".
msg260279 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-14 12:22
lock=0x0 looks suspicious.
msg260281 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 12:43
> lock=0x0 looks suspicious.

Agreed. I think that *could* possibly be a consequence of the earlier multiple GILs, though I haven't figured out exactly how yet.
msg260283 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-14 14:32
Update: I *have* been able to reproduce in a debug build. Here's the stack trace.
msg260372 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 01:26
debug-build-stacktrace.txt:
* Thread 1 is waiting on threading_hang.py:21: thread.join()
* Thread 2-6 (5 threads) are waiting for the GIL

The thread 1 is not supposed to hold the GIL: thread.join() is implemented by lock_PyThread_acquire_lock() at Modules/threadmodule.c:52:

    Py_BEGIN_ALLOW_THREADS
    i = PyThread_acquire_lock(self->lock_lock, i);
    Py_END_ALLOW_THREADS

Note: I backported enhancements of python-gdb.py from Python 3, but I'm not sure that "Waiting for the GIL" line is reliable :-/ I recall vaguely a technical issue specific to Python 2 to check if a thread is waiting on the GIL or not.
msg260374 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 01:33
> I've also not yet managed to reproduce on Python 3, but I haven't tried that hard. I suspect it's a Python 2-only problem, though.

Python 3.2 got a new GIL and many bugs related to the GIL were fixed in Python 3 too.

The "workaround" is to use Python 3, until this Python 2 issue is fixed (if we are able to fix it!).
msg260375 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 01:46
>> lock=0x0 looks suspicious.

> Agreed. I think that *could* possibly be a consequence of the earlier multiple GILs, though I haven't figured out exactly how yet.

Don't trust gdb output when Python is compiled in release mode (with -O3).

gdb output is only reliable when Python is compiled with -O0 (./configure --with-pydebug is enough for Python 2.7).
msg260378 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-17 07:14
> I'm not sure that "Waiting for the GIL" line is reliable

Right; I was interpreting that line more as "I don't have the GIL at the moment", which I think applies to all threads, including the main thread. It's pretty clear that threads 2-5 really are waiting for the GIL, and (as you point out), for thread 1 to be where it is, it must have also released the GIL by that point.

Agreed about the gdb results for the -O3 build.

So it looks as though something's gone wrong with the "I've just released the GIL; now someone else can pick it up" signaling mechanism.
msg260386 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 10:40
"Each iteration takes around 60 seconds, and I typically get a deadlock within the first 5 iterations."

I ran a modified version of threading_hang.py (I added some debug print) a whole night: it's still running after 118 iterations.

Each iteration takes ~5 minutes, I'm using a debug build of Python 2.7 on Linux (Fedora 23).
msg260406 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-17 17:29
Thanks for testing! It wouldn't entirely surprise me if this turned out to be an Apple bug somewhere.
msg260407 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-17 17:39
> Thanks for testing! It wouldn't entirely surprise me if this turned out to be an Apple bug somewhere.

I'm ran threading_hang.py on Mac OS X 10.10 Yosemite (darwin 14.5) on Python 2.7 compiled in release mode.

After 28 iterations (~4 hours, 9 minutes per iteration), the bug was still not reproduced.

It looks like my OS X version is more recent than yours. You should run your test on various OS X versions to try to identify the bug.

Since it's likely a race condition, I would not be surprised that the bug is very hard to reproduce and may exist on other platforms, but may be less likely on other platforms or even less likely on some computers for various reasons...
msg260422 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-02-18 02:07
I'm not able to test it myself at the moment but you could also try ensuring your 2.7 build is with the most up-to-date ABI for your system:

./configure MACOSX_DEPLOYMENT_TARGET=10.9

2.7 builds do not set this automatically; 3.x builds do.
msg260446 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-18 09:39
Thanks, Ned. I tried with:

   ./configure --with-pydebug MACOSX_DEPLOYMENT_TARGET=10.9 && make

I'm still able to reproduce. I plan to find some time this weekend to dig into the source and see if I can narrow down the cause (or at least make the hang easier / quicker to reproduce).
msg260471 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-18 17:31
A minor update: I did manage to verify that the GIL is indeed not held by any threads at the moment of the hang, so it does look very much as though a `pthread_cond_signal` signal has gone astray somehow.
msg260742 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2016-02-23 19:55
I've found OS X 10.10 and OS X 10.11 machines to test on. I'm able to reproduce on 10.10, but not on 10.11. This adds to the plausibility of the "Apple bug" explanation.
msg352002 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-11 17:45
Maybe issue38106 related.
msg352423 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-09-14 09:37
> Maybe issue38106 related.

That looks plausible, but unfortunately I'm still able to reproduce the hang with your PR (commit 9b135c02aa1edab4c99c915c43cd62d988f1f9c1, macOS 10.14.6).
msg352439 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-09-14 16:01
FWIW, I've confirmed again that the exact same script on the same machine seems fine under Python 3.x. Given the imminent demise of Python 2, perhaps this issue is just destined to be an unsolved historical oddity.
msg352459 - (view) Author: Kyle Stanley (aeros) * (Python committer) Date: 2019-09-15 04:06
> FWIW, I've confirmed again that the exact same script on the same machine seems fine under Python 3.x. Given the imminent demise of Python 2, perhaps this issue is just destined to be an unsolved historical oddity.

Since it doesn't seem to be occurring across every macOS platform, is specific to 2.7, and isn't proving to be an easy fix, it's probably not worthwhile to invest a significant amount of further time into the issue. Thanks for working on it though, at least we tried. (:
msg352469 - (view) Author: Kirill Smelkov (navytux) * Date: 2019-09-15 09:06
> > Maybe issue38106 related.
>
> That looks plausible, but unfortunately I'm still able to reproduce the hang with your PR (commit 9b135c02aa1edab4c99c915c43cd62d988f1f9c1, macOS 10.14.6).

Thanks for feedback. Then hereby bug is probably deadlock of another kind.
msg367349 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2020-04-27 01:56
As 2.7 is now EOL, I'm closing the issue.
History
Date User Action Args
2022-04-11 14:58:27adminsetgithub: 70548
2020-04-27 01:56:26zach.waresetstatus: open -> closed

nosy: + zach.ware
messages: + msg367349

resolution: out of date
stage: resolved
2019-09-15 09:06:46navytuxsetmessages: + msg352469
2019-09-15 04:06:22aerossetnosy: + aeros
messages: + msg352459
2019-09-14 16:01:01mark.dickinsonsetmessages: + msg352439
2019-09-14 09:46:29mark.dickinsonsettitle: Deadlock in thread.join on Python 2.7/Mac OS X 10.9, 10.10 -> Deadlock in thread.join on Python 2.7/macOS
2019-09-14 09:37:11mark.dickinsonsetmessages: + msg352423
2019-09-11 17:45:33navytuxsetnosy: + navytux
messages: + msg352002
2016-02-23 19:55:48mark.dickinsonsettitle: Deadlock in thread.join on Python 2.7/Mac OS X 10.9 -> Deadlock in thread.join on Python 2.7/Mac OS X 10.9, 10.10
2016-02-23 19:55:39mark.dickinsonsetmessages: + msg260742
2016-02-18 17:31:24mark.dickinsonsetmessages: + msg260471
2016-02-18 09:39:26mark.dickinsonsetmessages: + msg260446
2016-02-18 02:07:19ned.deilysetnosy: + ned.deily
messages: + msg260422
2016-02-17 17:39:37vstinnersetmessages: + msg260407
2016-02-17 17:29:03mark.dickinsonsetmessages: + msg260406
2016-02-17 10:42:59vstinnersettitle: Deadlock in thread.join -> Deadlock in thread.join on Python 2.7/Mac OS X 10.9
2016-02-17 10:40:50vstinnersetmessages: + msg260386
2016-02-17 07:14:03mark.dickinsonsetmessages: + msg260378
2016-02-17 01:46:41vstinnersetmessages: + msg260375
2016-02-17 01:33:15vstinnersetmessages: + msg260374
2016-02-17 01:26:35vstinnersetmessages: + msg260372
2016-02-17 01:19:41vstinnersetnosy: + vstinner
2016-02-14 14:32:14mark.dickinsonsetfiles: + debug-build-stacktrace.txt

messages: + msg260283
2016-02-14 12:43:15mark.dickinsonsetmessages: + msg260281
2016-02-14 12:22:01serhiy.storchakasetmessages: + msg260279
2016-02-14 12:17:08mark.dickinsonsetmessages: + msg260278
2016-02-14 12:16:13mark.dickinsonsetmessages: + msg260277
2016-02-14 11:57:10serhiy.storchakasetnosy: + pitrou, serhiy.storchaka
2016-02-14 11:44:47mark.dickinsonsetmessages: + msg260275
2016-02-14 11:39:23mark.dickinsonsetfiles: + threading_hang.py

messages: + msg260274
2016-02-14 11:38:44mark.dickinsoncreate