Author aeros
Recipients aeros, eric.snow, vstinner
Date 2020-01-15.07:20:48
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1579072848.56.0.254568201766.issue37224@roundup.psfhosted.org>
In-reply-to
Content
I just made a rather interesting discovery. Instead of specifically focusing my efforts on the logic with interp_destroy(), I decided to take a closer look at the failing unit test itself. 

The main test within DestroyTests that's failing is the following:

```
    def test_still_running(self):
        main, = interpreters.list_all()
        interp = interpreters.create()
        with _running(interp):
            with self.assertRaises(RuntimeError):
                interpreters.destroy(interp)
            self.assertTrue(interpreters.is_running(interp))
```

(Specifically, "self.assertRaises(RuntimeError): interpreters.destroy(interp)" is the main point of failure)

In order to be 100% certain that it was an issue occurring from interpreters.destroy(), I decided to add in a bit of a "sanity check" to ensure the interpreter was actually running in the first place before destroying it (I also added some extra debugging info):

```
    def test_still_running(self):
        main, = interpreters.list_all()
        interp = interpreters.create()
        with _running(interp):
            self.assertTrue(interpreters.is_running(interp),
                msg=f"Interp {interp} should be running before destruction.")

            with self.assertRaises(RuntimeError,
                msg=f"Should not be able to destroy interp {interp} while"
                           " it's still running."):
                interpreters.destroy(interp)

            self.assertTrue(interpreters.is_running(interp))
```

The results were very interesting...

```
OK
0:00:49 load avg: 135.49 [306/1] test__xxsubinterpreters failed
test_all (test.test__xxsubinterpreters.DestroyTests) ... ok
test_already_destroyed (test.test__xxsubinterpreters.DestroyTests) ... ok
test_bad_id (test.test__xxsubinterpreters.DestroyTests) ... ok
test_does_not_exist (test.test__xxsubinterpreters.DestroyTests) ... ok
test_from_current (test.test__xxsubinterpreters.DestroyTests) ... ok
test_from_other_thread (test.test__xxsubinterpreters.DestroyTests) ... ok
test_from_sibling (test.test__xxsubinterpreters.DestroyTests) ... ok
test_main (test.test__xxsubinterpreters.DestroyTests) ... ok
test_one (test.test__xxsubinterpreters.DestroyTests) ... ok
test_still_running (test.test__xxsubinterpreters.DestroyTests) ... FAIL

======================================================================
FAIL: test_still_running (test.test__xxsubinterpreters.DestroyTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/aeros/repos/aeros-cpython/Lib/test/test__xxsubinterpreters.py", line 763, in test_still_running
    self.assertTrue(interpreters.is_running(interp),
AssertionError: False is not true : Interp 12 should be running before destruction.

----------------------------------------------------------------------
```

As can be seen from the results above, the interpreter is not even running in the first place before it's destroyed, so of course destroy() won't raise an RuntimeError. I think this proves that interpreters.destroy() is _not_ where we should be focusing our efforts (at least for now). Instead, we should first investigate why it's not even running at this point. I suspect the issue _might_ be a race condition within the "_running()" context manager that's preventing the interpreter from being ran, but I'll have to do some further investigation.

I also ran this ~20 times to be 100% certain, and every single one of those times the point of failure was at the new assertion check I added before destroy(). 

Notably, a rather difficult and hard to explain side effect occurred from adding the new assertion. The average number of tests before failure increased by a significant amount. In the above test, it was able to pass 306 iterations before failing, and in one of my earlier tests it reached over 1000. 

That never happened before on the 3.8 branch, it would very consistently fail in the first set of parallel workers if not very soon after. I can say that with a degree certainty as well, since I've ran this set of tests a countless number of times while trying to debug the failure. But, I have no explanation for this.

Do you have any potential ideas, Eric? Also, do you think it might be worth adding in the changes I made to DestroyTests.test_still_running above? It doesn't directly address the core failure occurring, but I think it improves the test significantly; both in functionality and debugging info. I would be glad to open a PR if you think the test changes might be useful, as well as make any needed adjustments to them.
History
Date User Action Args
2020-01-15 07:20:48aerossetrecipients: + aeros, vstinner, eric.snow
2020-01-15 07:20:48aerossetmessageid: <1579072848.56.0.254568201766.issue37224@roundup.psfhosted.org>
2020-01-15 07:20:48aeroslinkissue37224 messages
2020-01-15 07:20:48aeroscreate