classification
Title: importlib: _find_and_load() race condition on sys.modules[name] check
Type: behavior Stage: needs patch
Components: Tests, Windows Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, eric.snow, haypo, ncoghlan, paul.moore, serhiy.storchaka, steve.dower, tim.golden, zach.ware
Priority: normal Keywords: patch

Created on 2017-07-10 09:44 by haypo, last changed 2017-07-21 11:28 by haypo.

Files
File name Uploaded Description Edit
importlib_sleep.patch haypo, 2017-07-10 13:59
Pull Requests
URL Status Linked Edit
PR 2645 serhiy.storchaka, 2017-07-10 15:13
PR 2646 merged haypo, 2017-07-10 16:12
PR 2651 merged haypo, 2017-07-10 20:56
PR 2665 merged haypo, 2017-07-11 08:03
PR 2801 merged haypo, 2017-07-21 11:05
Messages (21)
msg298044 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 09:44
1:36:32 [224/405/1] test_import failed -- running: test_mmap (2236 sec), test_io (2529 sec), test_venv (1437 sec)
beginning 6 repetitions
123456
.....test test_import failed -- Traceback (most recent call last):
  File "D:\buildarea\3.6.ware-win81-release.refleak\build\lib\test\test_import\__init__.py", line 377, in test_concurrency
    raise exc
  File "D:\buildarea\3.6.ware-win81-release.refleak\build\lib\test\test_import\__init__.py", line 362, in run
    import package
  File "D:\buildarea\3.6.ware-win81-release.refleak\build\lib\test\test_import\data\package\__init__.py", line 2, in <module>
    package.submodule
AttributeError: module 'package' has no attribute 'submodule'
msg298046 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 09:56
I was able to *easily* (100% failure rate) reproduce the failure with Python 3.6 and 3.7 (master) on Windows using this command:

./python -m test -R 3:100 -m test_concurrency test_import
msg298047 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 09:59
I don't understand if a test_concurrency() failure is a bug in importlib, or a bug in the test :-)
msg298048 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 10:02
Note: "./python -m test -R 3:100 -m test_concurrency test_import" is just a convenient way to run test_concurrency() "in a loop", in the same process, 100 times.

I'm unable to reproduce the bug on Linux, even when I write 3 into /proc/sys/vm/drop_caches while the test is running to drop filesystem caches.

I also tried to use a temporary directory in my home (TMPDIR=$HOME/tmp) to use temporary files in ext4 rather than tmpfs. But I don't know if importlib uses temporary files or not :-)
msg298049 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 10:04
Oh, the unit test was added 4 days ago:

commit b4baacee1adc06edbe30ac7574d17a8cd168e2e0
Author: Serhiy Storchaka <storchaka@gmail.com>
Date:   Thu Jul 6 08:09:03 2017 +0300

    bpo-30814: Fixed a race condition when import a submodule from a package. (#2580)
msg298053 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-07-10 10:40
Indeed, Serhiy's been stress-testing some bits of the import system, and finding a few holes made possible by the shift away from a single global import lock to per-module locks. (Issue 30876 is another example of that)

Nothing immediately comes to mind as a platform-dependent code path that could explain why that error is showing up on Windows and not on Linux, though.
msg298058 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-10 12:02
Some parts of import machinery are platform depended (see issue30873).

I don't have idea what is wrong on Windows.

I'm puzzled. test_concurrency itself runs a test in a loop. The only difference between one and several consequent runs of test_concurrency is running setUp() and tearDown() between tests. Victor, try to add importlib.invalidate_caches() inside a loop. Does it cause a failure in a single test_concurrency?
msg298063 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 13:09
> Victor, try to add importlib.invalidate_caches() inside a loop. Does it cause a failure in a single test_concurrency?

I added a "importlib.invalidate_caches()" call in the "for i in range(10):" loop. With this change, a single run doesn't fail. It doesn't fix the test neither: with -R 3:100, the command still fails.

> Some parts of import machinery are platform depended (see issue30873).

I don't think that this issue30873 or issue30626 is platform dependent.


> I don't have idea what is wrong on Windows.

IMHO it's just that my Windows VM is slow. It's a race condition. If I run "./python -m test -m test_concurrency test_import test_import": the command pass, but when again, it fails. It's pure random :-)
msg298064 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 13:59
I added print(id(package)) in Lib/test/test_import/data/package/__init__.py: when the bug occurs on my Windows VM, I see two print with two different identifiers. Moreover, when the bug occurs, first id(sys.modules['package']) is the first identifier, and then package is replaced with the new 'package' package in sys.modules (new identifier).

I don't understand something: sys.modules isn't supposed to be protected by a lock? I checked if _installed_safely.__enter__() if the _imp is held: it isn't held when _bootstrap._setup() is called, nor when _imp._install_external_importers() is called. But even if I patch these two functions to acquire the lock, Python startup fails on importing a codec module, because the _imp lock is not held in _installed_safely.__enter__(). So maybe sys.modules dict is supposed to be protected by a different lock, maybe indirectly?

I hack _bootstrap._find_and_load_unlocked() to sleep 10 ms if name == 'package'. With this change, test_concurrency() fails immediatly on my Linux box.

* Apply attached importlib_sleep.patch (written for the current master branch)
* Recompile: make regen-importlib && make
* Run test_concurrency() just once: ./python -m test -m test_concurrency -v test_import
* You get the bug

By the way, importlib_sleep.patch changes test_concurrency() to run the test exactly once (it changes to loop to a single iteration, instead of 10).

Is it normal that a sleep makes import failing? Imports are not supposed to be atomic?

--

I tried to similar hack before the commit b4baacee1adc06edbe30ac7574d17a8cd168e2e0:

* add a sleep in importlib
* copy test_concurrency() from master
* Run test_concurrency()
* The test pass

While I'm not sure of my hack, it *seems* like sys.modules was somehow atomic before this commit?
msg298066 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 14:25
FYI test_concurrency() added by commit b4baacee1adc06edbe30ac7574d17a8cd168e2e0 doesn't really test bpo-30814. I backported the unit test on 1ccbad9c95cf5782a1329eeaecba6e3eb0c37cb8 (commit before b4baacee1adc06edbe30ac7574d17a8cd168e2e0), and the test pass, whereas I expected it to fail before the fix.
msg298067 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-07-10 14:44
No, there's no global lock on the import system these days.

Instead, imports are supposed to be atomic on a per-module basis, but issue 30814 showed that some of the accesses to the shared state management (which *is* still supposed to be protected by the import lock) could get into trouble given concurrent imports from different threads.

However, the new test exercises an interesting sequence of per-module lock acquisition, since each thread effectively runs through this sequence:

- import package # Test case
- import package.submodule # Package init
- import package # Parent package import

That moves back-and-forth between C & Python in a few different places, and hence may have been relying on the GIL to keep the parts executed in C atomic.

So now I'm wondering whether or not this new problem might be due to moving the _imp.acquire_lock() calls into Python code as part of issue 30814, as that creates a window near the start of _find_and_load() where the thread may end up holding neither the GIL *nor* the import lock. By contrast, when the import lock was acquired in C, then by the time the Python code starts running (potentially releasing the GIL), the import lock has already been acquired.
msg298068 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 14:47
> FYI test_concurrency() added by commit b4baacee1adc06edbe30ac7574d17a8cd168e2e0 doesn't really test bpo-30814. I backported the unit test on 1ccbad9c95cf5782a1329eeaecba6e3eb0c37cb8 (commit before b4baacee1adc06edbe30ac7574d17a8cd168e2e0), and the test pass, whereas I expected it to fail before the fix.

Oops, please ignore this comment, it was a mistake. After clearing my repository (removing __pycache__ directories), the test failed. I forgot to copy the Lib/test/test_import/data/package/ directory.
msg298069 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-10 14:52
test_concurrency is failed before b4baacee1adc06edbe30ac7574d17a8cd168e2e0 on my computer. I think it tests the issue30814 bug even if it doesn't fail on all computers. At the end, this is a race condition, and the failure is probabilistic.
msg298070 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 15:09
Using attached importlib_sleep.patch, you can *see* the race condition with:

diff --git a/Lib/test/test_import/data/package/__init__.py b/Lib/test/test_import/data/package/__init__.py
index a4f2bc3..f07eb11 100644
--- a/Lib/test/test_import/data/package/__init__.py
+++ b/Lib/test/test_import/data/package/__init__.py
@@ -1,2 +1,3 @@
 import package.submodule
+print("package: id=%#x" % id(package))
 package.submodule


Example of output:
---
haypo@selma$ ./python -m test -m test_concurrency -v test_import
package: id=0x7fb497400d58
package: id=0x7fb49744d458
test test_import failed -- (...)
---

Lib/test/test_import/data/package/__init__.py is run twice instead of being run once. A lock is missing somewhere on the package module.
msg298071 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-07-10 15:17
Victor tested test_concurrency on 3.5, and it is failed on slow Windows VM. Therefore the bug exists on 3.5 and is not related to recent changes.
msg298072 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-07-10 15:28
I'm about to head to bed so I'll get back to this discussion tomorrow, but in the meantime: Victor, maybe it would make sense to reformulate your reproducer as a monkeypatch for _find_and_load_unlocked() that injects the 10 ms sleep from the test case if it finds _frozen_importlib in sys.modules?

That is, temporarily replacing it with something like:

    orig_find_and_load_unlocked = _frozen_importlib._find_and_load_unlocked
    def _slow_find_and_load_unlocked(name, import_):
        time.sleep(0.010)
        return orig_find_and_load_unlocked(name, import_)

That way, we'd hopefully be able to reproduce the failure independently of the slower buildbot.
msg298081 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 20:52
New changeset 4f9a446f3fb42f800e73cd9414dd1eccb3ca4fa7 by Victor Stinner in branch 'master':
bpo-30891: Fix importlib _find_and_load() race condition (#2646)
https://github.com/python/cpython/commit/4f9a446f3fb42f800e73cd9414dd1eccb3ca4fa7
msg298087 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 21:16
New changeset fe6e686c27ce3a3ecdf03803cff7e230dee5530d by Victor Stinner in branch '3.6':
bpo-30891: Fix importlib _find_and_load() race condition (#2646) (#2651)
https://github.com/python/cpython/commit/fe6e686c27ce3a3ecdf03803cff7e230dee5530d
msg298088 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-10 21:23
Backporting the fix ot Python 3.5 doesn't seem trivial, 3.5 lacks many changes.
msg298798 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-21 11:00
New changeset e72b1359f81d1dd42bd8a5c5cc2b3928b74f8023 by Victor Stinner in branch 'master':
bpo-30891: Fix again importlib _find_and_load() (#2665)
https://github.com/python/cpython/commit/e72b1359f81d1dd42bd8a5c5cc2b3928b74f8023
msg298801 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-07-21 11:28
New changeset bb323b261d0f41ad6998d3a16c41b29b3bb8e982 by Victor Stinner in branch '3.6':
bpo-30891: Fix again importlib _find_and_load() (#2665) (#2801)
https://github.com/python/cpython/commit/bb323b261d0f41ad6998d3a16c41b29b3bb8e982
History
Date User Action Args
2017-07-21 11:28:58hayposetmessages: + msg298801
2017-07-21 11:05:08hayposetpull_requests: + pull_request2852
2017-07-21 11:00:48hayposetmessages: + msg298798
2017-07-11 19:53:47hayposetpull_requests: - pull_request2734
2017-07-11 16:27:05cooperleessetpull_requests: + pull_request2734
2017-07-11 08:03:34hayposetpull_requests: + pull_request2730
2017-07-10 21:23:57hayposetmessages: + msg298088
2017-07-10 21:16:29hayposetmessages: + msg298087
2017-07-10 20:56:36hayposetpull_requests: + pull_request2717
2017-07-10 20:52:34hayposetmessages: + msg298081
2017-07-10 19:11:26brett.cannonlinkissue30893 dependencies
2017-07-10 16:24:05hayposettitle: Sometimes, test_concurrency() of test_import fails with AttributeError: module 'package' has no attribute 'submodule' on AMD64 Windows8.1 Refleaks 3.6 -> importlib: _find_and_load() race condition on sys.modules[name] check
versions: + Python 3.5, Python 3.6
2017-07-10 16:12:05hayposetpull_requests: + pull_request2712
2017-07-10 15:28:52ncoghlansetmessages: + msg298072
2017-07-10 15:17:44serhiy.storchakasetmessages: + msg298071
2017-07-10 15:13:34serhiy.storchakasetpull_requests: + pull_request2711
2017-07-10 15:09:27hayposetmessages: + msg298070
2017-07-10 14:52:56serhiy.storchakasetmessages: + msg298069
2017-07-10 14:47:33hayposetmessages: + msg298068
2017-07-10 14:44:11ncoghlansetmessages: + msg298067
2017-07-10 14:25:32hayposetmessages: + msg298066
2017-07-10 13:59:46hayposetfiles: + importlib_sleep.patch
keywords: + patch
messages: + msg298064
2017-07-10 13:09:31hayposetmessages: + msg298063
2017-07-10 12:02:13serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg298058
2017-07-10 10:40:24ncoghlansettype: behavior
messages: + msg298053
stage: needs patch
2017-07-10 10:05:00hayposetmessages: + msg298049
2017-07-10 10:02:10hayposetmessages: + msg298048
2017-07-10 09:59:54hayposetnosy: + brett.cannon, ncoghlan, eric.snow
messages: + msg298047
2017-07-10 09:56:36hayposetmessages: + msg298046
2017-07-10 09:44:24haypocreate