Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile.open(mode="r") race condition when importing lzma #83611

Closed
MaciejGol mannequin opened this issue Jan 23, 2020 · 18 comments
Closed

tarfile.open(mode="r") race condition when importing lzma #83611

MaciejGol mannequin opened this issue Jan 23, 2020 · 18 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@MaciejGol
Copy link
Mannequin

MaciejGol mannequin commented Jan 23, 2020

BPO 39430
Nosy @brettcannon, @ncoghlan, @gustaebel, @ethanfurman, @xdegaye, @ericsnowcurrently, @serhiy-storchaka, @miss-islington
PRs
  • bpo-39430: Fix race condition in lazy imports in tarfile. #18161
  • [3.8] bpo-39430: Fix race condition in lazy imports in tarfile. (GH-18161) #18171
  • [3.7] bpo-39430: Fix race condition in lazy imports in tarfile. (GH-18161) #18172
  • Files
  • test.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2020-01-23.11:15:38.684>
    labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
    title = 'tarfile.open(mode="r") race condition when importing lzma'
    updated_at = <Date 2020-01-27.21:29:15.029>
    user = 'https://bugs.python.org/MaciejGol'

    bugs.python.org fields:

    activity = <Date 2020-01-27.21:29:15.029>
    actor = 'ethan.furman'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2020-01-23.11:15:38.684>
    creator = 'Maciej Gol'
    dependencies = []
    files = ['48861']
    hgrepos = []
    issue_num = 39430
    keywords = ['patch']
    message_count = 18.0
    messages = ['360551', '360552', '360598', '360618', '360619', '360620', '360629', '360634', '360635', '360636', '360637', '360649', '360650', '360653', '360708', '360714', '360793', '360805']
    nosy_count = 9.0
    nosy_names = ['brett.cannon', 'ncoghlan', 'lars.gustaebel', 'ethan.furman', 'xdegaye', 'eric.snow', 'serhiy.storchaka', 'miss-islington', 'Maciej Gol']
    pr_nums = ['18161', '18171', '18172']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue39430'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @MaciejGol
    Copy link
    Mannequin Author

    MaciejGol mannequin commented Jan 23, 2020

    Hey guys,

    We have a component that archives and unarchives multiple files in separate threads that started to
    misbehave recently.

    We have noticed a bunch of AttributeError: module 'lzma' has no attribute 'LZMAFile' errors, which are
    unexpected because our python is not compiled with LZMA support.

    What is unfortunate, is that given the traceback:

        Traceback (most recent call last):
          File "test.py", line 18, in <module>
            list(pool.map(test_lzma, range(100)))
          File "/opt/lang/python37/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
            yield fs.pop().result()
          File "/opt/lang/python37/lib/python3.7/concurrent/futures/_base.py", line 428, in result
            return self.__get_result()
          File "/opt/lang/python37/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
            raise self._exception
          File "/opt/lang/python37/lib/python3.7/concurrent/futures/thread.py", line 57, in run
            result = self.fn(*self.args, **self.kwargs)
          File "test.py", line 14, in test_lzma
            tarfile.open(fileobj=buf, mode="r")
          File "/opt/lang/python37/lib/python3.7/tarfile.py", line 1573, in open
            return func(name, "r", fileobj, **kwargs)
          File "/opt/lang/python37/lib/python3.7/tarfile.py", line 1699, in xzopen
            fileobj = lzma.LZMAFile(fileobj or name, mode, preset=preset)
        AttributeError: module 'lzma' has no attribute 'LZMAFile'

    the last line of the traceback is right AFTER this block (tarfile.py:1694):

    try:
        import lzma
    except ImportError:
        raise CompressionError("lzma module is not available")
    

    Importing lzma in ipython fails properly:

    In [2]: import lzma                                                                                                                               
    \---------------------------------------------------------------------------
    
        ModuleNotFoundError                       Traceback (most recent call last)
        <ipython-input-2-c0255b54beb9> in <module>
        ----> 1 import lzma
    /opt/lang/python37/lib/python3.7/lzma.py in <module>
        25 import io
        26 import os
    ---> 27 from _lzma import *
        28 from _lzma import _encode_filter_properties, _decode_filter_properties
        29 import _compression
    
    ModuleNotFoundError: No module named '_lzma'
    

    When trying to debug the problem, we have noticed it's not deterministic. In order to reproduce it,
    we have created a test python that repeatedly writes an archive to BytesIO and then reads from it.
    Using it with 5 threads and 100 calls, gives very good chances of reproducing the issue. For us it
    was almost every time.

    Race condition occurs both on Python 3.7.3 and 3.7.6.
    Test script used to reproduce it attached.

    I know that the test script writes uncompressed archives and during opening tries to guess the compression.
    But I guess this is a legitimate scenario and should not matter in this case.

    @MaciejGol MaciejGol mannequin added type-crash A hard crash of the interpreter, possibly with a core dump 3.7 (EOL) end of life labels Jan 23, 2020
    @MaciejGol
    Copy link
    Mannequin Author

    MaciejGol mannequin commented Jan 23, 2020

    Uploading fixed file (the former had a typo)

    @serhiy-storchaka
    Copy link
    Member

    PR 18161 fixes race condition by using "from ... import ..." which waits until the module be completely initialized if the specified names are not set.

    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir 3.8 only security fixes 3.9 only security fixes type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Jan 24, 2020
    @MaciejGol
    Copy link
    Mannequin Author

    MaciejGol mannequin commented Jan 24, 2020

    PR 18161 fixes race condition by using "from ... import ..."
    which waits until the module be completely initialized if the specified
    names are not set.

    Correct me if I'm wrong, but I think the behavior of 'import lzma' in
    this case (vulnerable to race conditions) is not as intended? Shouldn't we also fix
    the 'import' statement itself?

    In general, I understand that due to how dynamic Python is, it might not be
    possible to load every single name at import time using import and using
    from x import y brings more determinism (because we have a safeguard now).

    But, looking at the stacktrace from ipython the problem lies in a sequence of
    import statements, not dynamic python coding. Shouldn't the importing mechanism
    be more deterministic in such case? For sure, it should not return an empty module
    (this is the case when race condition occurs).

    I think a race condition caused by simply using import statements
    (not from x import y) is a big caveat in the statement itself and how python
    imports work.

    Haven't checked if the cause is isolated to how tarfile works, or works in general, though :-(

    @MaciejGol
    Copy link
    Mannequin Author

    MaciejGol mannequin commented Jan 24, 2020

    By the way, thanks a lot for the fix <3

    @serhiy-storchaka
    Copy link
    Member

    It is intended to support circular imports. Let foo.py contains "import bar" and bar.py contains "import foo". When you execute "import foo", the import machinery first creates an empty module foo, adds it to sys.modules, reads foo.py and executes it in the namespace of module foo. When the interpreter encounters "import bar" in foo.py, the import machinery creates an empty module bar, adds it to sys.modules, reads bar.py and executes it in the namespace of module bar. When the interpreter encounters "import foo" in bar.py, the import machinery takes the module foo from sys.modules. So you break an infinite cycle and can import modules with cyclic dependencies.

    You can argue that cyclic import does not look as a good practice, but actually it is pretty common case when you import a submodule in a package. If foo/init.py contains "from .bar import Bar", the foo module must be imported before you import foo.bar, but is not completely initialized at that time yet.

    @MaciejGol
    Copy link
    Mannequin Author

    MaciejGol mannequin commented Jan 24, 2020

    This is a HUGE eye opener! Didn't know of that 'import' vs 'from x import
    y' difference. Thanks a lot! Is it documented somewhere ?

    pt., 24 sty 2020, 15:08 użytkownik Serhiy Storchaka <report@bugs.python.org>
    napisał:

    Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:

    It is intended to support circular imports. Let foo.py contains "import
    bar" and bar.py contains "import foo". When you execute "import foo", the
    import machinery first creates an empty module foo, adds it to sys.modules,
    reads foo.py and executes it in the namespace of module foo. When the
    interpreter encounters "import bar" in foo.py, the import machinery creates
    an empty module bar, adds it to sys.modules, reads bar.py and executes it
    in the namespace of module bar. When the interpreter encounters "import
    foo" in bar.py, the import machinery takes the module foo from sys.modules.
    So you break an infinite cycle and can import modules with cyclic
    dependencies.

    You can argue that cyclic import does not look as a good practice, but
    actually it is pretty common case when you import a submodule in a package.
    If foo/init.py contains "from .bar import Bar", the foo module must be
    imported before you import foo.bar, but is not completely initialized at
    that time yet.

    ----------


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue39430\>


    @brettcannon
    Copy link
    Member

    It's covered in the language reference where import semantics are explained. It isn't explicitly called out in higher-level docs that I'm aware of because it only really comes up in edge cases like importing in a thread which we actively discourage unless there's a compelling reason not to.

    @ethanfurman
    Copy link
    Member

    Thanks, Serhiy! I had no idea of those changes to import.

    This does raise the question, though, of whether accessing an empty module's attributes should either:

    • wait for the module to finish initializing; or
    • raise an exception immediately with better information
      ?

    It seems like this is a trap waiting for anybody using threaded code and on-demand imports.

    @ethanfurman
    Copy link
    Member

    New changeset 9017e0b by Ethan Furman (Serhiy Storchaka) in branch 'master':
    bpo-39430: Fix race condition in lazy imports in tarfile. (GH-18161)
    9017e0b

    @ethanfurman
    Copy link
    Member

    Brett:
    ------

    It's covered in the language reference where import semantics are
    explained. It isn't explicitly called out in higher-level docs that
    I'm aware of because it only really comes up in edge cases like
    importing in a thread which we actively discourage unless there's a
    compelling reason not to.

    When was this change made? Assuming it is, in fact, a change. It seems like most of the stdlib could be called in a thread, so the practice of lazily importing a module wholesale might need to be audited and proactively fixed. Maybe that could be a newcomer's easy(ish) project?

    @miss-islington
    Copy link
    Contributor

    New changeset ea4a61f by Miss Islington (bot) in branch '3.8':
    bpo-39430: Fix race condition in lazy imports in tarfile. (GH-18161)
    ea4a61f

    @miss-islington
    Copy link
    Contributor

    New changeset 1a27435 by Miss Islington (bot) in branch '3.7':
    bpo-39430: Fix race condition in lazy imports in tarfile. (GH-18161)
    1a27435

    @serhiy-storchaka
    Copy link
    Member

    I suppose it fixed the bug. But I cannot confirm because I cannot reproduce the original bug.

    @xdegaye
    Copy link
    Mannequin

    xdegaye mannequin commented Jan 25, 2020

    In msg360620 Serhiy wrote:

    When the interpreter encounters "import foo" in bar.py, the import machinery takes the module foo from sys.modules.
    So you break an infinite cycle and can import modules with cyclic dependencies.

    The following pdb session that is run when executing the foo.py script, shows that this is not quite accurate. When the interpreter encounters "import foo" in bar.py, the import machinery instead starts executing foo.py once again and detects that at line 9 of foo.py the 'import bar' statement does not have to be executed since the import machinery is already in the process of importing this module. So it is at that point that the infinite cycle is broken.
    Later on line 12 in foo.py the import machinery detects that the AttributeError is raised because of a circular import and warns that "partially initialized module 'bar' has no attribute 'BAR' (most likely due to a circular import)".

    $ cat -n foo.py
         1  import pdb
         2  debug = 1
         3
         4  # Prevent starting a new pdb session when foo is imported by bar.
         5  if debug and not hasattr(pdb, 'is_pdb_started'):
         6      pdb.Pdb(skip=['importlib*']).set_trace()
         7      pdb.is_pdb_started = True
         8
         9  import bar
        10  FOO = 'foo'
        11  try:
        12      print(bar.BAR)
        13  except AttributeError as e:
        14      # The exception is raised with an explicit reason when foo is imported by
        15      # bar due to partially initialized module 'bar'.
        16      print(e)
    
    $ cat -n bar.py
         1  import foo
         2  BAR = 'bar'
         3  print(foo.FOO)
    
    $ python foo.py
    > /tmp/foo.py(7)<module>()
    -> pdb.is_pdb_started = True
    (Pdb) step
    > /tmp/foo.py(9)<module>()
    -> import bar
    (Pdb) step
    --Call--
    > /tmp/bar.py(1)<module>()
    -> import foo
    (Pdb) step
    > /tmp/bar.py(1)<module>()
    -> import foo
    (Pdb) step
    --Call--
    > /tmp/foo.py(1)<module>()
    -> import pdb
    (Pdb) step
    > /tmp/foo.py(1)<module>()
    -> import pdb
    (Pdb) step
    > /tmp/foo.py(2)<module>()
    -> debug = 1
    (Pdb) step
    > /tmp/foo.py(5)<module>()
    -> if debug and not hasattr(pdb, 'is_pdb_started'):
    (Pdb) step
    > /tmp/foo.py(9)<module>()
    -> import bar
    (Pdb) step
    > /tmp/foo.py(10)<module>()
    -> FOO = 'foo'
    (Pdb) step
    > /tmp/foo.py(11)<module>()
    -> try:
    (Pdb) step
    > /tmp/foo.py(12)<module>()
    -> print(bar.BAR)
    (Pdb) step
    AttributeError: partially initialized module 'bar' has no attribute 'BAR' (most likely due to a circular import)
    > /tmp/foo.py(12)<module>()
    -> print(bar.BAR)
    (Pdb) step
    > /tmp/foo.py(13)<module>()
    -> except AttributeError as e:
    (Pdb) step
    > /tmp/foo.py(16)<module>()
    -> print(e)
    (Pdb) step
    partially initialized module 'bar' has no attribute 'BAR' (most likely due to a circular import)
    --Return--
    > /tmp/foo.py(16)<module>()->None
    -> print(e)
    (Pdb) step
    > /tmp/bar.py(2)<module>()
    -> BAR = 'bar'
    (Pdb) step
    > /tmp/bar.py(3)<module>()
    -> print(foo.FOO)
    (Pdb) step
    foo
    --Return--
    > /tmp/bar.py(3)<module>()->None
    -> print(foo.FOO)
    (Pdb) step
    > /tmp/foo.py(10)<module>()
    -> FOO = 'foo'
    (Pdb) step
    > /tmp/foo.py(11)<module>()
    -> try:
    (Pdb) step
    > /tmp/foo.py(12)<module>()
    -> print(bar.BAR)
    (Pdb) step
    bar
    --Return--
    > /tmp/foo.py(12)<module>()->None
    -> print(bar.BAR)
    (Pdb) step
    $

    @xdegaye
    Copy link
    Mannequin

    xdegaye mannequin commented Jan 25, 2020

    This fixed the bug.

    To reproduce the bug, substitute the "from _lzma import *" statement with "raise ImportError"
    in Lib/lzma.py to simulate that the _lzma module is missing and run the test.py script written and uploaded by Maciej.

    @brettcannon
    Copy link
    Member

    When was this change made?

    It has always been this way, so no change here. We have actually improved the situation over the years with attempts at better locking in importlib itself.

    @ethanfurman
    Copy link
    Member

    Cool. I appreciate all the work in this area, thank you!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Status: Done
    Development

    No branches or pull requests

    5 participants