Oh, I'm not opposed, I'm just complaining ;-)

It would be much nicer to have an approach that worked for all thread users, not just threading.Thread users.  For example, a user can easily (well, plausibly) get into the same kinds of troubles here by calling _thread.start_new_thread() directly, then waiting for their threads "to end" before letting the program finish - they have no idea either when their tstates are actually destroyed.

A high-probability way to "appear to fix" this for everyone could change Py_EndInterpreter's

    if (tstate != interp->tstate_head || tstate->next != NULL)
        Py_FatalError("Py_EndInterpreter: not the last thread");

to something like

    int count = 0;
    while (tstate != interp->tstate_head || tstate->next != NULL) {
        if (count > SOME_MAGIC_VALUE)
            Py_FatalError("Py_EndInterpreter: not the last thread");

In the meantime ;-), you should change this part of the new .join() code:

        if endtime is not None:
            waittime = endtime - _time()
            if not lock.acquire(timeout=waittime):

The problem here is that we have no idea how much time may have elapsed before computing the new `waittime`.  So the new `waittime` _may_ be negative, in which case we've already timed out (but passing a negative `waittime` to acquire() means "wait as long as it takes to acquire the lock").  So this block should return if waittime < 0.
