Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread.join returns before PyThreadState is destroyed #63008

Closed
TamasK mannequin opened this issue Aug 22, 2013 · 60 comments
Closed

Thread.join returns before PyThreadState is destroyed #63008

TamasK mannequin opened this issue Aug 22, 2013 · 60 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@TamasK
Copy link
Mannequin

TamasK mannequin commented Aug 22, 2013

BPO 18808
Nosy @tim-one, @jcea, @csernazs, @ncoghlan, @pitrou, @vstinner, @koobs
Files
  • subinterp_threads.patch
  • threadstate_join.patch
  • threadstate_join_2.patch
  • threadstate_join_3.patch
  • threadstate_join_4.patch
  • threadstate_join_5.patch: added more is_alive() testing
  • threadstate_join_6.patch
  • subinterp_test.patch
  • blind.patch: noble attempt ;-) to repair buildbot failures
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-09-09.20:08:46.070>
    created_at = <Date 2013-08-22.13:56:37.656>
    labels = ['interpreter-core', 'library', 'type-crash']
    title = 'Thread.join returns before PyThreadState is destroyed'
    updated_at = <Date 2019-06-10.23:19:38.831>
    user = 'https://bugs.python.org/TamasK'

    bugs.python.org fields:

    activity = <Date 2019-06-10.23:19:38.831>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-09-09.20:08:46.070>
    closer = 'tim.peters'
    components = ['Interpreter Core', 'Library (Lib)']
    creation = <Date 2013-08-22.13:56:37.656>
    creator = 'Tamas.K'
    dependencies = []
    files = ['31445', '31455', '31537', '31635', '31640', '31650', '31652', '31653', '31659']
    hgrepos = []
    issue_num = 18808
    keywords = ['patch']
    message_count = 60.0
    messages = ['195891', '195895', '195906', '195909', '195910', '195911', '195999', '196076', '196077', '196085', '196115', '196154', '196156', '196611', '196627', '196663', '196672', '196923', '196956', '196962', '196977', '196983', '197038', '197052', '197125', '197137', '197177', '197178', '197179', '197181', '197182', '197183', '197184', '197185', '197188', '197190', '197191', '197193', '197195', '197197', '197202', '197207', '197213', '197214', '197215', '197217', '197221', '197247', '197248', '197251', '197253', '197291', '197294', '197302', '197304', '197306', '197307', '197310', '197338', '345152']
    nosy_count = 12.0
    nosy_names = ['tim.peters', 'jcea', 'csernazs', 'ncoghlan', 'pitrou', 'vstinner', 'grahamd', 'neologix', 'python-dev', 'bkabrda', 'koobs', 'Tamas.K']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue18808'
    versions = ['Python 3.4']

    @TamasK
    Copy link
    Mannequin Author

    TamasK mannequin commented Aug 22, 2013

    When a thread is started in CPython, t_bootstrap [Modules/threadmodule.c] creates a PyThreadState object and then calls Thread.__bootstrap_inner [Lib/threading.py] which calls Thread.run, protected with self.__block, a Condition. Thread.join uses the same __block to block until Thread.run finished.

    When Thread.run finished, __bootstrap_inner notifies on __block, so join will return. Here lies a race condition, if a thread switch to Thread.join occures before __bootstrap_inner returns to t_bootstrap. Then join will return before the PyThreadState for the thread is destroyed by t_bootstrap.

    It is mostly harmless for general use, as __bootstrap_inner eventually gets scheduled again and PyThreadState is taken care of.

    However. Py_EndInterpreter [Python/pythonrun.c] can be called when only the main interpreter thread is running. So when we want to call Py_EndInterpreter, we signal every other thread to stop, and join them. And when Thread.join returns, we call Py_EndInterpreter. Py_EndInterpreter checks if there are any other PyThreadStates still around and does a Py_FatalError.

    As a workaround, we now do a sleep after join.

    @TamasK TamasK mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump labels Aug 22, 2013
    @TamasK
    Copy link
    Mannequin Author

    TamasK mannequin commented Aug 22, 2013

    The bug was found in 2.6.5, but after a quick eyeballing, current HEAD has the same problem.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 22, 2013

    I *think* we could remove that limitation in Py_EndInterpreter(). After all, Py_Finalize() is able to join() non-daemon threads, so there's no reason for Py_EndInterpreter() to allow it too. We must keep the fatal error for daemon threads, though.

    As for ensuring the thread state is destroyed before Thread.join() returns, I don't know how to do it: the thread state isn't a PyObject, we can't access it from Python code.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 22, 2013

    (Actually, it would almost be possible to combine weakrefs and thread locals to ensure that join() only returns when the thread state is cleared. However, the thread state is cleared before it is untied from the interpreter, and there would still be a small window for a race condition if some destructor called by clearing the thread state released the GIL.)

    @pitrou
    Copy link
    Member

    pitrou commented Aug 22, 2013

    (Of course, the latter problem can be solved by having a dedicated sentinel in the thread state that gets DECREF'efed only *after* the thread state is removed from the interpreter...)

    @TamasK
    Copy link
    Mannequin Author

    TamasK mannequin commented Aug 22, 2013

    Removing the last thread check sounds good, what we actually need is a Py_EndInterpreter that does not casually abort() on us, we don't really care about PyThreadStates that much.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 23, 2013

    Here is a patch for 3.3/3.4.

    @TamasK
    Copy link
    Mannequin Author

    TamasK mannequin commented Aug 24, 2013

    After a quick glance, I can't see how this patch would fix the problem. It still depends on threading's Thread.join, which is affected by the race condition in __bootstrap_inner.

    We already did a Thread.join before calling Py_EndInterpreter and still got bitten by the race.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 24, 2013

    Well, that's a good point. It does bring in line subinterpreters with the main interpreter when it comes to automatically joining non-daemon threads, but it doesn't solve the race condition you talked about. I forgot a bit too fast about it :-)

    @pitrou
    Copy link
    Member

    pitrou commented Aug 24, 2013

    Here is a patch to remove the race condition. The patch is sufficiently delicate that I'd rather reserve this for 3.4, though.

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Aug 25, 2013

    The first patch looks good, as for the second one, it'll take some time :-)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 25, 2013

    New changeset becbb65074e1 by Antoine Pitrou in branch 'default':
    Issue bpo-18808: Non-daemon threads are now automatically joined when a sub-interpreter is shutdown (it would previously dump a fatal error).
    http://hg.python.org/cpython/rev/becbb65074e1

    @pitrou
    Copy link
    Member

    pitrou commented Aug 25, 2013

    Ok, I've committed the first patch.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 31, 2013

    Nasty problem ;-)

    I don't understand the need for all the indirections in the second patch. Like, why use a weakref? It's not like we have to worry about an immortal tstate keeping a measly little lock object alive forever, right? Seems to me that if the tstate object held the new lock directly, the tstate destructor could release the lock directly (and so also skip the new tstate->on_delete and tstate->on_delete_data indirections too).

    Then again, I'd settle for Py_EndInterpreter simply sleeping for a second and trying again when it finds "another thread" hanging around (effectively moving Tamas's sleep() into Py_EndInterpreter, but only sleeping if needed). Yes, that's theoretically insecure.

    But if we're worried about wildly improbable timing problems, then the patched code can appear not to honor a non-None Thread.join() timeout argument too. That is, the first call to the new pred() can theoretically consume any amount of time, due to its self._tstate_lock.acquire(). The same kinds of timing implausibilities that could cause a short sleep to fail could also cause that acquire() to take "a long time" (maybe even 2 seconds - LOL).

    But, ya, it's better to really fix it.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 31, 2013

    I don't understand the need for all the indirections in the second
    patch. Like, why use a weakref? It's not like we have to worry about
    an immortal tstate keeping a measly little lock object alive forever,
    right? Seems to me that if the tstate object held the new lock
    directly, the tstate destructor could release the lock directly (and
    so also skip the new tstate->on_delete and tstate->on_delete_data
    indirections too).

    The problem is if the tstate holds the last reference to the lock. It
    will destroy the lock at the end, but the lock could have weakrefs
    attached to it with arbitrary callbacks. Those callbacks will run
    without a current thread state and may crash. OTOH, this can't happen
    with a private weakref.

    You may suggest we only keep the Py_thread_lock in the tstate, rather
    than the enclosing PyObject. But it has lifetime problems of its owns
    (the Py_thread_lock doesn't have a refcount, and it is shared with a
    PyObject who thinks it owns it and will destroy it on dealloc... adding
    the necessary logic to circumvent this would make the final solution
    more complicated than the weakref one).

    Then again, I'd settle for Py_EndInterpreter simply sleeping for a
    second and trying again when it finds "another thread" hanging around
    (effectively moving Tamas's sleep() into Py_EndInterpreter, but only
    sleeping if needed). Yes, that's theoretically insecure.

    Well, that sounds less predictable. Depending on machine load,
    Py_EndInterpreter might succeed or crash with a fatal error. Users may
    not like this :-)

    But if we're worried about wildly improbable timing problems, then the
    patched code can appear not to honor a non-None Thread.join()
    timeout argument too. That is, the first call to the new pred()
    can theoretically consume any amount of time, due to its
    self._tstate_lock.acquire().

    Ah, well, good point. It's weird it didn't get caught by the unit
    tests... I'll take a look.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 31, 2013

    Ok, here is a new patch observing the timeout parameter. I've changed the implementation strategy a bit, it seems hopefully sane.

    There's a remaining (slightly bordeline) issue: the thread state lock isn't reinitialized on fork(). Should it?

    @tim-one
    Copy link
    Member

    tim-one commented Aug 31, 2013

    I'm getting a headache now - LOL ;-) Thanks for the explanation!

    What I still don't understand: the new lock is an internal implementation detail. How would it gain a weakref with a callback? Users aren't going to mess with this lock, and if you want to stop Python maintainers from giving it a weakref with a callback, simply say they shouldn't do that (in the code comments) - you could even add code verifying it doesn't have any weakrefs outstanding (although that would likely be a waste of time and code: no maintainer is going to _want_ to make a weakref to it, let alone a weakref with a callback).

    My concern is the bulk and obscurity of this code, all to plug such a minor hole. I call it "minor" because it's been reported once in the history of the project, and Tamas wormed around it with a 1-liner (added a sleep).

    Granted, it's much harder to fix "for real" and when most of the interpreter has been destroyed ;-)

    @pitrou
    Copy link
    Member

    pitrou commented Sep 4, 2013

    What I still don't understand: the new lock is an internal
    implementation detail. How would it gain a weakref with a callback?
    Users aren't going to mess with this lock, and if you want to stop
    Python maintainers from giving it a weakref with a callback, simply
    say they shouldn't do that (in the code comments) - you could even
    add code verifying it doesn't have any weakrefs outstanding
    (although that would likely be a waste of time and code: no
    maintainer is going to _want_ to make a weakref to it, let alone a
    weakref with a callback).

    Well... perhaps I'm being too perfectionist, but I don't want Python
    to be crashable merely by manipulating a Thread object's private
    attributes.
    If I add some check code, it will quickly become more complicated than
    the current weakref version, which works by design :-)

    My concern is the bulk and obscurity of this code, all to plug such a
    minor hole. I call it "minor" because it's been reported once in
    the history of the project, and Tamas wormed around it with a
    1-liner (added a sleep).

    Yeah, the overall concern is a bit obscure, but still: right now, if
    you use threads inside a subinterpreter, your code can work or crash
    depending on subtle timing conditions (where "crash" doesn't even mean
    "raise an exception" but "abort the whole process"). So I think it's
    worth trying to fix, even though the complication can look a bit
    disproportionate.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 4, 2013

    Oh, I'm not opposed, I'm just complaining ;-)

    It would be much nicer to have an approach that worked for all thread users, not just threading.Thread users. For example, a user can easily (well, plausibly) get into the same kinds of troubles here by calling _thread.start_new_thread() directly, then waiting for their threads "to end" before letting the program finish - they have no idea either when their tstates are actually destroyed.

    A high-probability way to "appear to fix" this for everyone could change Py_EndInterpreter's

        if (tstate != interp->tstate_head || tstate->next != NULL)
            Py_FatalError("Py_EndInterpreter: not the last thread");

    to something like

        int count = 0;
        while (tstate != interp->tstate_head || tstate->next != NULL) {
            ++count;
            if (count > SOME_MAGIC_VALUE)
                Py_FatalError("Py_EndInterpreter: not the last thread");
            sleep(SOME_SHORT_TIME);
        }

    In the meantime ;-), you should change this part of the new .join() code:

            if endtime is not None:
                waittime = endtime - _time()
                if not lock.acquire(timeout=waittime):
                    return

    The problem here is that we have no idea how much time may have elapsed before computing the new waittime. So the new waittime may be negative, in which case we've already timed out (but passing a negative waittime to acquire() means "wait as long as it takes to acquire the lock"). So this block should return if waittime < 0.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 4, 2013

    Oops! The docs are wrong - a negative timeout actually raises:

    ValueError: timeout value must be strictly positive

    unless the timeout is exactly -1. All the more reason to ensure that a negative waittime isn't passed.

    I opened a different issue about the doc glitch.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 5, 2013

    Fudge - there's another unlikely problem here. For example: main program creates a threading.Thread t, runs it, and does t.join(5) (whatever - any timeout value). When t.join() returns, the main program has no idea whether t is done or not. Suppose t isn't done, but finishes _bootstrap_inner a microsecond (whatever) later.

    Now the user follows the doc's advice, and checks t.is_alive() to see whether it still needs to join t. t.is_alive() returns False! _bootstrap_inner() already finished, so the t._stopped event is already set.

    So the main program skips doing another t.join(), and lets the program exit. There's no guarantee then that t's tstate has been cleared, and we can be back to same fatal error that started this.

    Of course this has nothing to do with the patch switching from a Condition to an Event to manage the stopped state (good change!) - it has to do with that, no matter how it's coded, _bootstrap_inner() says "this thread is done" before the tstate is unlinked from the chain.

    For a related reason, note that Python's threading._shutdown() doesn't prevent the fatal shutdown error even after the patch! _pickSomeNonDaemonThread() also ignores threads for which is_alive() returns False. As above, that can return False while the tstate is still in the chain, and then threading._shutdown() never joins such a thread. join() can't fix the problem if it's not called. I supposed that one can be repaired by removing the is_alive() check in _pickSomeNonDaemonThread() (i.e., always join() every non-daemon thread).

    In their favor, the probabilistic "sleep" approaches would almost always "fix" these too ;-)

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2013

    It would be much nicer to have an approach that worked for all thread users,
    not just threading.Thread users. For example, a user can easily (well,
    plausibly) get into the same kinds of troubles here by calling
    _thread.start_new_thread() directly, then waiting for their threads "to end"
    before letting the program finish - they have no idea either when their
    tstates are actually destroyed.

    The _thread module is a private API these days, I wouldn't worry too much about that. Are there reasons *not* to use the threading module (except CPython bootstrap issues that normal users should have no bother about)? Also, how would they wait for the thread to end, except by triggering their own Event object?

    So the new waittime may be negative, in which case we've already timed
    out (but passing a negative waittime to acquire() means "wait as long as
    it takes to acquire the lock"). So this block should return if waittime < 0.

    Thanks, good point :-o

    As for the is_alive() method, mmh, ok, it needs to be tweaked too. That will make the patch more interesting.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 6, 2013

    So you're not concerned about a now-private API (which used to be advertised), but are concerned about a user mucking with a new private lock in an exceedingly unlikely (in the absence of malice) way. That clarifies things ;-)

    I'm not really concerned about either. User perversity knows no limits, though, so I wouldn't be surprised if some people are rolling their own higher-level threads in Python for reasons they think are compelling.

    Details don't really matter to that, right? Like:

    Also, how would they wait for the thread to end, except by
    triggering their own Event object?

    Any number of ways. Roll their own Event, roll their own Barrier, roll their own Condition, or even something as simple as keeping an integer count of the number of threads they created, and then (e.g.)

    while nthreads:
        time.sleep(1)

    at the end, with each thread doing a

    global nthreads
    with nthreads_lock:
        nthreads -= 1
    

    in its end-of-life code. Essentially rolling their own clumsy variant of a Semaphore.

    I've seen code like that "in the wild". But, no, I'll join you in not worrying about it ;-)

    @pitrou
    Copy link
    Member

    pitrou commented Sep 6, 2013

    Le vendredi 06 septembre 2013 à 00:19 +0000, Tim Peters a écrit :

    Tim Peters added the comment:

    So you're not concerned about a now-private API (which used to be
    advertised), but are concerned about a user mucking with a new private
    lock in an exceedingly unlikely (in the absence of malice) way. That
    clarifies things ;-)

    :-)
    The only reason I'm concerned about the user mucking with the private
    lock is that it's a new opportunity that's opened. But, yeah, I could
    remove the weakref and only keep the lock. The code would only be ~10
    lines shorter, though. What do other people think?

    in its end-of-life code. Essentially rolling their own clumsy variant
    of a Semaphore.

    I guess they spell it like:

      import clumsy_threading as threading

    I've seen code like that "in the wild".

    Well, I've sure seen my share of sleep() calls as a synchronization
    measure too (and removed a number of them)... :-)
    That's one of the reasons I added the timeout arguments, actually.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 7, 2013

    All the timeout args are great! I wish Python had always had them :-)

    Back to the pain at hand, it's not the number of lines of code that rubs me the wrong way, but the sheer obscurity of it all. This handful of lines is - of necessity - sprayed across two C code files, a C header file, and a Python module. That makes it very hard (for anyone but you - LOL) to reconstruct the _intent_ of it all.

    I'm adding another patch, which is your threadstate_join_2.patch but with a new comment block (in pystate.h) spelling out the rationale behind it all. I can live with that ;-)

    @tim-one
    Copy link
    Member

    tim-one commented Sep 7, 2013

    New patch (threadstate_join_4.patch) refactors so that is_alive() returns True until the tstate is cleared. This simplifies join() a lot (it doesn't have to roll its own timeout implementation anymore), but complicates is_alive().

    Caution: I don't know anything about the dummy threading implementation. So if I broke that, sue me ;-)

    @tim-one
    Copy link
    Member

    tim-one commented Sep 7, 2013

    New patch threadstate_join_5.patch adds more testing of is_alive().

    An inelegance I don't care about (but someone might): if join() is called with a timeout, and the Python part of the thread ends before the timeout expires (stopped gets set), then a _non-blocking attempt to acquire _tstate_lock is made, and join() returns regardless of the outcome. So, with a timeout, it's possible for join() to return before the C part of the thread is done even if the timeout isn't yet exhausted.

    That's unlikely, and I don't see that it makes any difference. Anyone doing a join() with a timeout has to be aware that they have no idea whether the thread is done when join() returns, and do another join() or check is_alive() later.

    I'd rather not complicate the code to wait longer for _tstate_lock in this case.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 7, 2013

    New changeset d52b68edbca6 by Antoine Pitrou in branch 'default':
    Issue bpo-18808: Thread.join() now waits for the underlying thread state to be destroyed before returning.
    http://hg.python.org/cpython/rev/d52b68edbca6

    @pitrou
    Copy link
    Member

    pitrou commented Sep 7, 2013

    Ok, this should be finally fixed!

    @pitrou pitrou closed this as completed Sep 7, 2013
    @pitrou
    Copy link
    Member

    pitrou commented Sep 7, 2013

    @pitrou pitrou reopened this Sep 7, 2013
    @tim-one
    Copy link
    Member

    tim-one commented Sep 7, 2013

    Figures. That's why I wanted your name on the blamelist ;-)

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Sep 7, 2013

    You mean I actually need to pay proper attention now? :)

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Sep 8, 2013

    Slavek - one you may want to take a look at this. Antoine and Tim are trying to fix a race condition with the destruction of thread state objects, but the current fix isn't playing nice on older Linux kernels (including the one in RHEL 6).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2013

    New changeset 5cfd7b2eb994 by Tim Peters in branch 'default':
    bpo-18808: blind attempt to repair some buildbot failures.
    http://hg.python.org/cpython/rev/5cfd7b2eb994

    @tim-one
    Copy link
    Member

    tim-one commented Sep 8, 2013

    Just pushed 5cfd7b2eb994 in a poke-and-hope blind attempt to repair the annoying ;-) buildbot failures.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 8, 2013

    Doesn't look like 5cfd7b2eb994 is going to fix it :-( So I'll revert it. Attaching the patch as blind.patch. After that patch, is_alive() only looks at Thread data members, where ._is_stopped "is obviously" True, and ._tstate_lock "is obviously" None, both forced by threading.py's _after_fork() function. No theory as to how that can be screwed up.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 8, 2013

    Weird! The Ubuntu box passed test_is_alive_after_fork on its /second/ run with the patch:

    http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8564/steps/test/logs/stdio

    The other box passed all tests:

    http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2667

    So I won't revert it after all - but the Ubuntu box's behavior is still major mystery :-(

    @tim-one
    Copy link
    Member

    tim-one commented Sep 8, 2013

    Well, the next time the Ubuntu box ran the tests, it was clean all the way. So it's fixed! Despite that it isn't ;-)

    @pitrou
    Copy link
    Member

    pitrou commented Sep 8, 2013

    Indeed the Ubuntu Shared buildbot started failing again. There's probably a timing-dependent behaviour here (which is why test_is_alive_after_fork() tries several times, after all).

    @pitrou
    Copy link
    Member

    pitrou commented Sep 8, 2013

    I think I've found the answer: the thread is sometimes already stopped by the time the child is forked, so it doesn't appear in _enumerate() anymore (it left the _active dict). Therefore its locks are not reset in _after_fork().

    Oh, I also get the following sporadic failure which is triggered by slight change in semantics with Thread.join(timeout) :-)

    ======================================================================
    FAIL: test_various_ops (test.test_threading.ThreadTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops
        self.assertTrue(not t.is_alive())
    AssertionError: False is not true

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2013

    New changeset 74dc664ad699 by Antoine Pitrou in branch 'default':
    Issue bpo-18808 again: fix the after-fork logic for not-yet-started or already-stopped threads.
    http://hg.python.org/cpython/rev/74dc664ad699

    @koobs
    Copy link

    koobs commented Sep 8, 2013

    Adding reference to failing tests on koobs-freebsd9 and koobs-freebsd10 buildbots:

    ======================================================================
    FAIL: test_is_alive_after_fork (test.test_threading.ThreadTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/usr/home/buildbot/koobs-freebsd10/3.x.koobs-freebsd10/build/Lib/test/test_threading.py", line 478, in test_is_alive_after_fork
        self.assertEqual(0, status)
    AssertionError: 0 != 256

    @tim-one
    Copy link
    Member

    tim-one commented Sep 8, 2013

    [Antoine]

    Oh, I also get the following sporadic failure
    which is triggered by slight change in semantics
    with Thread.join(timeout) :-)
    ======================================================================
    FAIL: test_various_ops (test.test_threading.ThreadTests)
    ----------------------------------------------------------------------

    > Traceback (most recent call last):
    >  File "/home/antoine/cpython/default/Lib/test/test_threading.py", line 113, in test_various_ops
    >    self.assertTrue(not t.is_alive())
    > AssertionError: False is not true

    Really! In context, the test does:

                t.join()
                self.assertTrue(not t.is_alive())

    (BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) )

    It was the intent that this continue to work - the only intended change in Python-visible semantics had to do with join'ing with a timeout.

    Without a timeout, I confess I don't see how this can fail. join() is join(timeout=None), which does:

            self._stopped.wait(timeout)
            if self._stopped.is_set():
                self._wait_for_tstate_lock(timeout is None)

    which is

            self._stopped.wait(None)
            if self._stopped.is_set():
                self._wait_for_tstate_lock(True)

    which should be the same as

            self._stopped.wait()
            self._wait_for_tstate_lock(True)

    after which _stopped should be set and _tstate_lock should be None. The subsequent is_alive() should then return False, via its

            return self._tstate_lock is not None

    What's going wrong?

    @pitrou
    Copy link
    Member

    pitrou commented Sep 8, 2013

    Le dimanche 08 septembre 2013 à 17:30 +0000, Tim Peters a écrit :

    Really! In context, the test does:

            t.join()
            self.assertTrue(not t.is_alive())
    

    Ah, no, the failing test did t.join(something). I removed the timeout
    to remove the failure :-)

    (BTW, that would be clearer as self.assertFalse(t.is_alive()) ;-) )

    Yes, old coding style.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 8, 2013

    Ah - the test used to do t.join(NUMTASKS)! That's just bizarre ;-)

    I believe I can repair that too (well - there was never a _guarantee_ that waiting 10 seconds would be long enough), but I'll wait until this all settles down.

    join() and is_alive() are too complicated now, because of the 2-step dance to check whether the thread is done: we have both an Event (_stopped) and a lock (tstate_lock) to check now. The Event doesn't serve a purpose anymore: it's almost always uninteresting to know _just that the Python part of the thread has ended. The only exception I can see is the perverse case of joining the main thread done in some of the tests (in that case we have to claim the main thread is done even though its tstate is still active).

    Anyway, after getting rid of the Event it should be dead easy to make join(10) "appear to work the same as before, despite that it never really worked ;-)".

    @pitrou
    Copy link
    Member

    pitrou commented Sep 8, 2013

    join() and is_alive() are too complicated now, because of the 2-step
    dance to check whether the thread is done: we have both an Event
    (_stopped) and a lock (tstate_lock) to check now. The Event doesn't
    serve a purpose anymore: it's almost always uninteresting to know
    _just
    that the Python part of the thread has ended.

    Yes, that crossed my mind too. The difficulty is that only plain lock
    objects are available from C code, not Events. But if the first join()er
    releases the lock just after taking it, it will be enough to make the
    code correct?

    (after all, it's an event that's never reset, which simplifies things)

    (also, why is the current Event implementation based on Condition? isn't
    an Event actually simpler than a Condition?)

    @tim-one
    Copy link
    Member

    tim-one commented Sep 8, 2013

    Without _stopped, join() can simply wait to acquire _tstate_lock (with or without a timeout, and skipping this if _tstate_lock is already None). Etc ;-) Of course details matter, but it's easy. I did it once, but the tests joining the main thread failed and I put the code on hold. I'll dust it off when the buildbots are all happy with the current changes.

    (also, why is the current Event implementation based
    on Condition?

    We'd have to ask Guido ;-) Best guess is that Condition supplied all the machinery to make Event.wait() work correctly, including waking all waiters up when the Event gets set.

    isn't an Event actually simpler than a Condition?)

    Events are indeed simple :-) There are many ways to implement them, but "ain't broke, don't fix" seems the right approach to me here. In effect, if we get rid of _stopped, the code remaining will be much like an Event implementation built on the plain _tstate_lock lock.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 8, 2013

    Without _stopped, join() can simply wait to acquire _tstate_lock (with
    or without a timeout, and skipping this if _tstate_lock is already
    None). Etc ;-) Of course details matter, but it's easy. I did it
    once, but the tests joining the main thread failed and I put the code
    on hold.

    Ah, of course. The main thread needs the event, since the thread state
    will only be deleted at the end of Py_Finalize().
    The MainThread class could override is_alive() and join(), then.

    @tim-one
    Copy link
    Member

    tim-one commented Sep 8, 2013

    The MainThread class could override is_alive() and join(), then.

    I think it will be easier than that, but we'll see ;-)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2013

    New changeset aff959a3ba13 by Tim Peters in branch 'default':
    bpo-18984: Remove ._stopped Event from Thread internals.
    http://hg.python.org/cpython/rev/aff959a3ba13

    @tim-one tim-one closed this as completed Sep 9, 2013
    @vstinner
    Copy link
    Member

    _thread._set_sentinel() and threading.Thread._tstate_lock is a great enhancement, as Py_EndInterprter() which now calls threading._shutdown().

    FYI I found yet another race condition, this time in threading._shutdown(). See bpo-36402 follow-up: "threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly".

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants