> Out of curiosity, is the failure deterministic in environments where it fails? If not, what is the source of the indeterminism -- some kind of race condition or something else?

The GC uses counters and thresholds to decide which collection and when they should be collected. I tried to play with gc.set_threshold(), but I failed to reproduce the issue on Linux.

Maybe it's a very precise threshold which triggers the issue. Between Linux and macOS, the site module executes different code paths which produce different GC counters.

Sometimes, the GC must happen in a very precise line, one line later is too late. See bpo-44422 for a case of threading.enumerate() (different kind of bug, but it's related to the GC). See bpo-44184 for another example of GC bug which only occurred on Windows and only if you type an exact command line.
