Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importlib: _find_and_load() race condition on sys.modules[name] check #75074

Closed
vstinner opened this issue Jul 10, 2017 · 35 comments
Closed

importlib: _find_and_load() race condition on sys.modules[name] check #75074

vstinner opened this issue Jul 10, 2017 · 35 comments
Labels
3.7 (EOL) end of life OS-windows tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

BPO 30891
Nosy @gvanrossum, @Yhg1s, @warsaw, @brettcannon, @pfmoore, @ncoghlan, @vstinner, @tjguk, @ambv, @ericsnowcurrently, @zware, @serhiy-storchaka, @zooba
PRs
  • [3.5] bpo-30814: Add a test for concurrent relative import. (GH-2580). #2645
  • bpo-30891: Fix importlib _find_and_load() race condition #2646
  • [3.6] bpo-30891: Fix importlib _find_and_load() race condition (#2646) #2651
  • bpo-30891: importlib _find_and_load() try/except #2665
  • [3.6] bpo-30891: Fix again importlib _find_and_load() (#2665) #2801
  • Files
  • importlib_sleep.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-07-27.10:25:47.039>
    created_at = <Date 2017-07-10.09:44:24.364>
    labels = ['3.7', 'type-bug', 'tests', 'OS-windows']
    title = 'importlib: _find_and_load() race condition on sys.modules[name] check'
    updated_at = <Date 2018-04-04.19:19:31.431>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2018-04-04.19:19:31.431>
    actor = 'gvanrossum'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-07-27.10:25:47.039>
    closer = 'vstinner'
    components = ['Tests', 'Windows']
    creation = <Date 2017-07-10.09:44:24.364>
    creator = 'vstinner'
    dependencies = []
    files = ['47000']
    hgrepos = []
    issue_num = 30891
    keywords = ['patch']
    message_count = 35.0
    messages = ['298044', '298046', '298047', '298048', '298049', '298053', '298058', '298063', '298064', '298066', '298067', '298068', '298069', '298070', '298071', '298072', '298081', '298087', '298088', '298798', '298801', '299184', '299305', '302106', '302124', '314594', '314615', '314818', '314819', '314879', '314887', '314889', '314894', '314948', '314949']
    nosy_count = 13.0
    nosy_names = ['gvanrossum', 'twouters', 'barry', 'brett.cannon', 'paul.moore', 'ncoghlan', 'vstinner', 'tim.golden', 'lukasz.langa', 'eric.snow', 'zach.ware', 'serhiy.storchaka', 'steve.dower']
    pr_nums = ['2645', '2646', '2651', '2665', '2801']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue30891'
    versions = ['Python 3.6', 'Python 3.7']

    @vstinner
    Copy link
    Member Author

    1:36:32 [224/405/1] test_import failed -- running: test_mmap (2236 sec), test_io (2529 sec), test_venv (1437 sec)
    beginning 6 repetitions
    123456
    .....test test_import failed -- Traceback (most recent call last):
      File "D:\buildarea\3.6.ware-win81-release.refleak\build\lib\test\test_import\__init__.py", line 377, in test_concurrency
        raise exc
      File "D:\buildarea\3.6.ware-win81-release.refleak\build\lib\test\test_import\__init__.py", line 362, in run
        import package
      File "D:\buildarea\3.6.ware-win81-release.refleak\build\lib\test\test_import\data\package\__init__.py", line 2, in <module>
        package.submodule
    AttributeError: module 'package' has no attribute 'submodule'

    @vstinner vstinner added 3.7 (EOL) end of life tests Tests in the Lib/test dir OS-windows labels Jul 10, 2017
    @vstinner
    Copy link
    Member Author

    I was able to *easily* (100% failure rate) reproduce the failure with Python 3.6 and 3.7 (master) on Windows using this command:

    ./python -m test -R 3:100 -m test_concurrency test_import

    @vstinner
    Copy link
    Member Author

    I don't understand if a test_concurrency() failure is a bug in importlib, or a bug in the test :-)

    @vstinner
    Copy link
    Member Author

    Note: "./python -m test -R 3:100 -m test_concurrency test_import" is just a convenient way to run test_concurrency() "in a loop", in the same process, 100 times.

    I'm unable to reproduce the bug on Linux, even when I write 3 into /proc/sys/vm/drop_caches while the test is running to drop filesystem caches.

    I also tried to use a temporary directory in my home (TMPDIR=$HOME/tmp) to use temporary files in ext4 rather than tmpfs. But I don't know if importlib uses temporary files or not :-)

    @vstinner
    Copy link
    Member Author

    Oh, the unit test was added 4 days ago:

    commit b4baace
    Author: Serhiy Storchaka <storchaka@gmail.com>
    Date: Thu Jul 6 08:09:03 2017 +0300

    bpo-30814: Fixed a race condition when import a submodule from a package. (bpo-2580)
    

    @ncoghlan
    Copy link
    Contributor

    Indeed, Serhiy's been stress-testing some bits of the import system, and finding a few holes made possible by the shift away from a single global import lock to per-module locks. (bpo-30876 is another example of that)

    Nothing immediately comes to mind as a platform-dependent code path that could explain why that error is showing up on Windows and not on Linux, though.

    @ncoghlan ncoghlan added the type-bug An unexpected behavior, bug, or error label Jul 10, 2017
    @serhiy-storchaka
    Copy link
    Member

    Some parts of import machinery are platform depended (see bpo-30873).

    I don't have idea what is wrong on Windows.

    I'm puzzled. test_concurrency itself runs a test in a loop. The only difference between one and several consequent runs of test_concurrency is running setUp() and tearDown() between tests. Victor, try to add importlib.invalidate_caches() inside a loop. Does it cause a failure in a single test_concurrency?

    @vstinner
    Copy link
    Member Author

    Victor, try to add importlib.invalidate_caches() inside a loop. Does it cause a failure in a single test_concurrency?

    I added a "importlib.invalidate_caches()" call in the "for i in range(10):" loop. With this change, a single run doesn't fail. It doesn't fix the test neither: with -R 3:100, the command still fails.

    Some parts of import machinery are platform depended (see bpo-30873).

    I don't think that this bpo-30873 or bpo-30626 is platform dependent.

    I don't have idea what is wrong on Windows.

    IMHO it's just that my Windows VM is slow. It's a race condition. If I run "./python -m test -m test_concurrency test_import test_import": the command pass, but when again, it fails. It's pure random :-)

    @vstinner
    Copy link
    Member Author

    I added print(id(package)) in Lib/test/test_import/data/package/init.py: when the bug occurs on my Windows VM, I see two print with two different identifiers. Moreover, when the bug occurs, first id(sys.modules['package']) is the first identifier, and then package is replaced with the new 'package' package in sys.modules (new identifier).

    I don't understand something: sys.modules isn't supposed to be protected by a lock? I checked if _installed_safely.__enter__() if the _imp is held: it isn't held when _bootstrap._setup() is called, nor when _imp._install_external_importers() is called. But even if I patch these two functions to acquire the lock, Python startup fails on importing a codec module, because the _imp lock is not held in _installed_safely.__enter__(). So maybe sys.modules dict is supposed to be protected by a different lock, maybe indirectly?

    I hack _bootstrap._find_and_load_unlocked() to sleep 10 ms if name == 'package'. With this change, test_concurrency() fails immediatly on my Linux box.

    • Apply attached importlib_sleep.patch (written for the current master branch)
    • Recompile: make regen-importlib && make
    • Run test_concurrency() just once: ./python -m test -m test_concurrency -v test_import
    • You get the bug

    By the way, importlib_sleep.patch changes test_concurrency() to run the test exactly once (it changes to loop to a single iteration, instead of 10).

    Is it normal that a sleep makes import failing? Imports are not supposed to be atomic?

    --

    I tried to similar hack before the commit b4baace:

    • add a sleep in importlib
    • copy test_concurrency() from master
    • Run test_concurrency()
    • The test pass

    While I'm not sure of my hack, it *seems* like sys.modules was somehow atomic before this commit?

    @vstinner
    Copy link
    Member Author

    FYI test_concurrency() added by commit b4baace doesn't really test bpo-30814. I backported the unit test on 1ccbad9 (commit before b4baace), and the test pass, whereas I expected it to fail before the fix.

    @ncoghlan
    Copy link
    Contributor

    No, there's no global lock on the import system these days.

    Instead, imports are supposed to be atomic on a per-module basis, but bpo-30814 showed that some of the accesses to the shared state management (which *is* still supposed to be protected by the import lock) could get into trouble given concurrent imports from different threads.

    However, the new test exercises an interesting sequence of per-module lock acquisition, since each thread effectively runs through this sequence:

    • import package # Test case
    • import package.submodule # Package init
    • import package # Parent package import

    That moves back-and-forth between C & Python in a few different places, and hence may have been relying on the GIL to keep the parts executed in C atomic.

    So now I'm wondering whether or not this new problem might be due to moving the _imp.acquire_lock() calls into Python code as part of bpo-30814, as that creates a window near the start of _find_and_load() where the thread may end up holding neither the GIL *nor* the import lock. By contrast, when the import lock was acquired in C, then by the time the Python code starts running (potentially releasing the GIL), the import lock has already been acquired.

    @vstinner
    Copy link
    Member Author

    FYI test_concurrency() added by commit b4baace doesn't really test bpo-30814. I backported the unit test on 1ccbad9 (commit before b4baace), and the test pass, whereas I expected it to fail before the fix.

    Oops, please ignore this comment, it was a mistake. After clearing my repository (removing __pycache__ directories), the test failed. I forgot to copy the Lib/test/test_import/data/package/ directory.

    @serhiy-storchaka
    Copy link
    Member

    test_concurrency is failed before b4baace on my computer. I think it tests the bpo-30814 bug even if it doesn't fail on all computers. At the end, this is a race condition, and the failure is probabilistic.

    @vstinner
    Copy link
    Member Author

    Using attached importlib_sleep.patch, you can *see* the race condition with:

    diff --git a/Lib/test/test_import/data/package/__init__.py b/Lib/test/test_import/data/package/__init__.py
    index a4f2bc3..f07eb11 100644
    --- a/Lib/test/test_import/data/package/__init__.py
    +++ b/Lib/test/test_import/data/package/__init__.py
    @@ -1,2 +1,3 @@
     import package.submodule
    +print("package: id=%#x" % id(package))
     package.submodule

    Example of output:
    ---
    haypo@selma$ ./python -m test -m test_concurrency -v test_import
    package: id=0x7fb497400d58
    package: id=0x7fb49744d458
    test test_import failed -- (...)
    ---

    Lib/test/test_import/data/package/init.py is run twice instead of being run once. A lock is missing somewhere on the package module.

    @serhiy-storchaka
    Copy link
    Member

    Victor tested test_concurrency on 3.5, and it is failed on slow Windows VM. Therefore the bug exists on 3.5 and is not related to recent changes.

    @ncoghlan
    Copy link
    Contributor

    I'm about to head to bed so I'll get back to this discussion tomorrow, but in the meantime: Victor, maybe it would make sense to reformulate your reproducer as a monkeypatch for _find_and_load_unlocked() that injects the 10 ms sleep from the test case if it finds _frozen_importlib in sys.modules?

    That is, temporarily replacing it with something like:

        orig_find_and_load_unlocked = _frozen_importlib._find_and_load_unlocked
        def _slow_find_and_load_unlocked(name, import_):
            time.sleep(0.010)
            return orig_find_and_load_unlocked(name, import_)

    That way, we'd hopefully be able to reproduce the failure independently of the slower buildbot.

    @vstinner vstinner changed the title Sometimes, test_concurrency() of test_import fails with AttributeError: module 'package' has no attribute 'submodule' on AMD64 Windows8.1 Refleaks 3.6 importlib: _find_and_load() race condition on sys.modules[name] check Jul 10, 2017
    @vstinner
    Copy link
    Member Author

    New changeset 4f9a446 by Victor Stinner in branch 'master':
    bpo-30891: Fix importlib _find_and_load() race condition (bpo-2646)
    4f9a446

    @vstinner
    Copy link
    Member Author

    New changeset fe6e686 by Victor Stinner in branch '3.6':
    bpo-30891: Fix importlib _find_and_load() race condition (bpo-2646) (bpo-2651)
    fe6e686

    @vstinner
    Copy link
    Member Author

    Backporting the fix ot Python 3.5 doesn't seem trivial, 3.5 lacks many changes.

    @vstinner
    Copy link
    Member Author

    New changeset e72b135 by Victor Stinner in branch 'master':
    bpo-30891: Fix again importlib _find_and_load() (bpo-2665)
    e72b135

    @vstinner
    Copy link
    Member Author

    New changeset bb323b2 by Victor Stinner in branch '3.6':
    bpo-30891: Fix again importlib _find_and_load() (bpo-2665) (bpo-2801)
    bb323b2

    @vstinner
    Copy link
    Member Author

    With Python 3.5.4rc1 released (3.5.4 will be the last 3.5 release accepting bug fixes), I'm not sure that it's still worth it to try to backport such changes to Python 3.5. The importlib is very important, any regression can be very important.

    I'm not sure that backporting the unit test is interesting. If the test fails, we will be unable to fix it.

    @vstinner
    Copy link
    Member Author

    Victor:

    I'm not sure that backporting the unit test is interesting. If the test fails, we will be unable to fix it.

    Serhiy closed his PR:
    #2645 (comment)

    So now I close this issue.

    @ambv
    Copy link
    Contributor

    ambv commented Sep 13, 2017

    If you're looking at this issue because it still happens to you, you are probably using zipimport. Stop using zipimport.

    @vstinner
    Copy link
    Member Author

    Łukasz Langa: "If you're looking at this issue because it still happens to you, you are probably using zipimport. Stop using zipimport."

    Hum, do you mean that zipimport also has an issue? Maybe open a new issue?

    @gvanrossum
    Copy link
    Member

    Yeah, what's wrong with zipimport?

    @brettcannon
    Copy link
    Member

    There's a bunch of things wrong with zipimport, just ask Thomas. ;)

    @ambv
    Copy link
    Contributor

    ambv commented Apr 2, 2018

    In the context of this issue: zipimport doesn't properly lock on imports happening in multiple threads at the same time. This causes one thread to receive a partially imported module while the other thread is still performing the import.

    zipimport also doesn't properly check __pycache__ directories but *does* look for old-style .pyc files next to the corresponding .py file.

    IIRC Brett chooses not to deal with zipimport's warts and has no time to rewrite it in Python. Since the rest of the core team is less qualified, they steer clear, too.

    So, don't use zipimport.

    @gvanrossum
    Copy link
    Member

    That's rather disturbing, and none of that is reflected in the zipimport docs. Is there at least a bpo issue about it? I may be the last person in the world to still recommend putting a zipfile on sys.path -- I used to believe it's faster than directory searches, is that no longer the case?

    @zooba
    Copy link
    Member

    zooba commented Apr 3, 2018

    I recommend and use it without trouble, but then I rarely defer imports to worker threads.

    Barry may have some ideas, but I think this really shouldn't be difficult to do. importlib is pretty extensible, and it's not difficult to read from a zip file.

    @brettcannon
    Copy link
    Member

    The other problem is zipimport is written in C and no one wants to deal with that headache if possible. ;) We actually tweaked the importlib.resources API specifically to avoid having to muck with zipimport's C code as it would have required major surgery otherwise. Thomas has also been saying he was planning to upstream the improvements he has made to zipimport internally at Google which also demotivates touching the code knowing that changes are planned.

    As for Steve's comment, the tricky bit is zipfile has a bunch of dependencies, so it's not so simple to just freeze that one module and wrap it with importlib to be able to load the stdlib from a zip file (which zipimport supports). Basically zip-reading code needs to be written that can be bootstrapped (C or Python), and then use that in importlib to be able to import the stdlib from a zip file.

    I have also been toying with the idea of a zippath module that would take pathlib's API and apply it to zip files acting like a self-contained file system, but I have never bothered as zipfile's API is messy. The "all the time in the world" project would be making all of this come together in a release of Python by building off of each other: new code to read zip files written in freezable Python, use importlib to replace zipimport, and provide a pathlib API to work with zip files (which would probably be partially built for use by importlib).

    If I drop my idea of putting a package on PyPI to do lazy importing I could look at tackling this for 3.8.

    @warsaw
    Copy link
    Member

    warsaw commented Apr 3, 2018

    Take a look at https://bugs.python.org/issue25711 which wants to rewrite zipimport from scratch. I took Serhiy's C->Python patch and turned it into a PR, but it would need a lot of love to get it into shape.

    (Let's not even talk about https://bugs.python.org/issue32866 which caused me some pain during importlib.resources work.)

    Pretty sure Thomas's rumored rewrite is still in C, and it solves problems like support for zip64 and comments, which we've run into also.

    zipimport is crucial for single-file executable solutions like pex and our own soon-to-be-open-sourced tool, so it really can't go away. Both of those still unpack wheels/eggs/sdists to a file system cache, not so much for speed but because shared libraries generally can't be imported from anything but a physical path (thanks to dlopen()). I think Google has a patch for glibc that would solve that but there's some hesitation from the glibc maintainers to merge it (and then what about other platforms?).

    @gvanrossum
    Copy link
    Member

    OK, this sounds much better: zipimport has problems and we should fix them,
    presumably by getting Thomas's fixes (sans Google-isms, which may require
    some effort). I don't believe Łukasz's recommendation "don't use it" is the
    right attitude (though what he puts in Facebook's internal style guide is
    his problem :-), and I don't believe that we're better off deprecating it.

    @brettcannon
    Copy link
    Member

    And mostly for completeness, I know Thomas has maligned the fact that zipimport has tests for some crazy things in its semantics which made it difficult for him to improve the module.

    @gvanrossum
    Copy link
    Member

    If there are overspecified tests we can debate them of course.

    On Wed, Apr 4, 2018, 11:55 Brett Cannon <report@bugs.python.org> wrote:

    Brett Cannon <brett@python.org> added the comment:

    And mostly for completeness, I know Thomas has maligned the fact that
    zipimport has tests for some crazy things in its semantics which made it
    difficult for him to improve the module.

    ----------


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue30891\>


    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life OS-windows tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants