classification
Title: tarfile.open(mode="r") race condition when importing lzma
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Maciej Gol, brett.cannon, eric.snow, ethan.furman, lars.gustaebel, miss-islington, ncoghlan, serhiy.storchaka, xdegaye
Priority: normal Keywords: patch

Created on 2020-01-23 11:15 by Maciej Gol, last changed 2020-01-27 21:29 by ethan.furman.

Files
File name Uploaded Description Edit
test.py Maciej Gol, 2020-01-23 11:19
Pull Requests
URL Status Linked Edit
PR 18161 merged serhiy.storchaka, 2020-01-24 09:06
PR 18171 merged miss-islington, 2020-01-24 17:56
PR 18172 merged miss-islington, 2020-01-24 17:56
Messages (18)
msg360551 - (view) Author: Maciej Gol (Maciej Gol) Date: 2020-01-23 11:15
Hey guys,

We have a component that archives and unarchives multiple files in separate threads that started to
misbehave recently.

We have noticed a bunch of `AttributeError: module 'lzma' has no attribute 'LZMAFile'` errors, which are
unexpected because our python is not compiled with LZMA support.

What is unfortunate, is that given the traceback:

    Traceback (most recent call last):
      File "test.py", line 18, in <module>
        list(pool.map(test_lzma, range(100)))
      File "/opt/lang/python37/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
        yield fs.pop().result()
      File "/opt/lang/python37/lib/python3.7/concurrent/futures/_base.py", line 428, in result
        return self.__get_result()
      File "/opt/lang/python37/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
        raise self._exception
      File "/opt/lang/python37/lib/python3.7/concurrent/futures/thread.py", line 57, in run
        result = self.fn(*self.args, **self.kwargs)
      File "test.py", line 14, in test_lzma
        tarfile.open(fileobj=buf, mode="r")
      File "/opt/lang/python37/lib/python3.7/tarfile.py", line 1573, in open
        return func(name, "r", fileobj, **kwargs)
      File "/opt/lang/python37/lib/python3.7/tarfile.py", line 1699, in xzopen
        fileobj = lzma.LZMAFile(fileobj or name, mode, preset=preset)
    AttributeError: module 'lzma' has no attribute 'LZMAFile'


the last line of the traceback is right AFTER this block (tarfile.py:1694):

    try:
        import lzma
    except ImportError:
        raise CompressionError("lzma module is not available")


Importing lzma in ipython fails properly:

    In [2]: import lzma                                                                                                                               
    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-2-c0255b54beb9> in <module>
    ----> 1 import lzma

    /opt/lang/python37/lib/python3.7/lzma.py in <module>
        25 import io
        26 import os
    ---> 27 from _lzma import *
        28 from _lzma import _encode_filter_properties, _decode_filter_properties
        29 import _compression

    ModuleNotFoundError: No module named '_lzma'

When trying to debug the problem, we have noticed it's not deterministic. In order to reproduce it,
we have created a test python that repeatedly writes an archive to BytesIO and then reads from it.
Using it with 5 threads and 100 calls, gives very good chances of reproducing the issue. For us it
was almost every time.

Race condition occurs both on Python 3.7.3 and 3.7.6.
Test script used to reproduce it attached.

I know that the test script writes uncompressed archives and during opening tries to guess the compression.
But I guess this is a legitimate scenario and should not matter in this case.
msg360552 - (view) Author: Maciej Gol (Maciej Gol) Date: 2020-01-23 11:19
Uploading fixed file (the former had a typo)
msg360598 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-01-24 09:14
PR 18161 fixes race condition by using "from ... import ..." which waits until the module be completely initialized if the specified names are not set.
msg360618 - (view) Author: Maciej Gol (Maciej Gol) Date: 2020-01-24 13:33
> PR 18161 fixes race condition by using "from ... import ..."
> which waits until the module be completely initialized if the specified
> names are not set.

Correct me if I'm wrong, but I think the behavior of 'import lzma' in
this case (vulnerable to race conditions) is not as intended? Shouldn't we also fix
the 'import' statement itself?

In general, I understand that due to how dynamic Python is, it might not be
possible to load every single name at import time using `import` and using
`from x import y` brings more determinism (because we have a safeguard now).

But, looking at the stacktrace from ipython the problem lies in a sequence of
import statements, not dynamic python coding. Shouldn't the importing mechanism
be more deterministic in such case? For sure, it should not return an empty module
(this is the case when race condition occurs).

I think a race condition caused by simply using `import` statements
(not `from x import y`) is a big caveat in the statement itself and how python
imports work.

Haven't checked if the cause is isolated to how tarfile works, or works in general, though :-(
msg360619 - (view) Author: Maciej Gol (Maciej Gol) Date: 2020-01-24 13:34
By the way, thanks a lot for the fix <3
msg360620 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-01-24 14:08
It is intended to support circular imports. Let foo.py contains "import bar" and bar.py contains "import foo". When you execute "import foo", the import machinery first creates an empty module foo, adds it to sys.modules, reads foo.py and executes it in the namespace of module foo. When the interpreter encounters "import bar" in foo.py, the import machinery creates an empty module bar, adds it to sys.modules, reads bar.py and executes it in the namespace of module bar. When the interpreter encounters "import foo" in bar.py, the import machinery takes the module foo from sys.modules. So you break an infinite cycle and can import modules with cyclic dependencies.

You can argue that cyclic import does not look as a good practice, but actually it is pretty common case when you import a submodule in a package. If foo/__init__.py contains "from .bar import Bar", the foo module must be imported before you import foo.bar, but is not completely initialized at that time yet.
msg360629 - (view) Author: Maciej Gol (Maciej Gol) Date: 2020-01-24 16:50
This is a HUGE eye opener! Didn't know of that 'import' vs 'from x import
y' difference. Thanks a lot! Is it documented somewhere ?

pt., 24 sty 2020, 15:08 u┼╝ytkownik Serhiy Storchaka <report@bugs.python.org>
napisał:

>
> Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:
>
> It is intended to support circular imports. Let foo.py contains "import
> bar" and bar.py contains "import foo". When you execute "import foo", the
> import machinery first creates an empty module foo, adds it to sys.modules,
> reads foo.py and executes it in the namespace of module foo. When the
> interpreter encounters "import bar" in foo.py, the import machinery creates
> an empty module bar, adds it to sys.modules, reads bar.py and executes it
> in the namespace of module bar. When the interpreter encounters "import
> foo" in bar.py, the import machinery takes the module foo from sys.modules.
> So you break an infinite cycle and can import modules with cyclic
> dependencies.
>
> You can argue that cyclic import does not look as a good practice, but
> actually it is pretty common case when you import a submodule in a package.
> If foo/__init__.py contains "from .bar import Bar", the foo module must be
> imported before you import foo.bar, but is not completely initialized at
> that time yet.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue39430>
> _______________________________________
>
msg360634 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2020-01-24 17:50
It's covered in the language reference where import semantics are explained. It isn't explicitly called out in higher-level docs that I'm aware of because it only really comes up in edge cases like importing in a thread which we actively discourage unless there's a compelling reason not to.
msg360635 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2020-01-24 17:51
Thanks, Serhiy!  I had no idea of those changes to import.

This does raise the question, though, of whether accessing an empty module's attributes should either:
- wait for the module to finish initializing; or
- raise an exception immediately with better information
?

It seems like this is a trap waiting for anybody using threaded code and on-demand imports.
msg360636 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2020-01-24 17:56
New changeset 9017e0bd5e124ae6d2ed94b9e9cacb2e86270980 by Ethan Furman (Serhiy Storchaka) in branch 'master':
bpo-39430: Fix race condition in lazy imports in tarfile. (GH-18161)
https://github.com/python/cpython/commit/9017e0bd5e124ae6d2ed94b9e9cacb2e86270980
msg360637 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2020-01-24 18:06
Brett:
------
> It's covered in the language reference where import semantics are
> explained. It isn't explicitly called out in higher-level docs that
> I'm aware of because it only really comes up in edge cases like
> importing in a thread which we actively discourage unless there's a
> compelling reason not to.

When was this change made?  Assuming it is, in fact, a change.  It seems like most of the stdlib could be called in a thread, so the practice of lazily importing a module wholesale might need to be audited and proactively fixed.  Maybe that could be a newcomer's easy(ish) project?
msg360649 - (view) Author: miss-islington (miss-islington) Date: 2020-01-24 20:10
New changeset ea4a61fec842c94107eef46e5030b89a086f94bb by Miss Islington (bot) in branch '3.8':
bpo-39430: Fix race condition in lazy imports in tarfile. (GH-18161)
https://github.com/python/cpython/commit/ea4a61fec842c94107eef46e5030b89a086f94bb
msg360650 - (view) Author: miss-islington (miss-islington) Date: 2020-01-24 20:10
New changeset 1a274359283d3d1f4f60dd527843f72e0368caf3 by Miss Islington (bot) in branch '3.7':
bpo-39430: Fix race condition in lazy imports in tarfile. (GH-18161)
https://github.com/python/cpython/commit/1a274359283d3d1f4f60dd527843f72e0368caf3
msg360653 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-01-24 20:25
I suppose it fixed the bug. But I cannot confirm because I cannot reproduce the original bug.
msg360708 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2020-01-25 20:20
In msg360620 Serhiy wrote:
> When the interpreter encounters "import foo" in bar.py, the import machinery takes the module foo from sys.modules.
> So you break an infinite cycle and can import modules with cyclic dependencies.

The following pdb session that is run when executing the foo.py script, shows that this is not quite accurate.  When the interpreter encounters "import foo" in bar.py, the import machinery instead starts executing foo.py once again and detects that at line 9 of foo.py the 'import bar' statement does not have to be executed since the import machinery is already in the process of importing this module.  So it is at that point that the infinite cycle is broken.
Later on line 12 in foo.py the import machinery detects that the AttributeError is raised because of a circular import and warns that "partially initialized module 'bar' has no attribute 'BAR' (most likely due to a circular import)".

$ cat -n foo.py
     1  import pdb
     2  debug = 1
     3
     4  # Prevent starting a new pdb session when foo is imported by bar.
     5  if debug and not hasattr(pdb, 'is_pdb_started'):
     6      pdb.Pdb(skip=['importlib*']).set_trace()
     7      pdb.is_pdb_started = True
     8
     9  import bar
    10  FOO = 'foo'
    11  try:
    12      print(bar.BAR)
    13  except AttributeError as e:
    14      # The exception is raised with an explicit reason when foo is imported by
    15      # bar due to partially initialized module 'bar'.
    16      print(e)

$ cat -n bar.py
     1  import foo
     2  BAR = 'bar'
     3  print(foo.FOO)

$ python foo.py
> /tmp/foo.py(7)<module>()
-> pdb.is_pdb_started = True
(Pdb) step
> /tmp/foo.py(9)<module>()
-> import bar
(Pdb) step
--Call--
> /tmp/bar.py(1)<module>()
-> import foo
(Pdb) step
> /tmp/bar.py(1)<module>()
-> import foo
(Pdb) step
--Call--
> /tmp/foo.py(1)<module>()
-> import pdb
(Pdb) step
> /tmp/foo.py(1)<module>()
-> import pdb
(Pdb) step
> /tmp/foo.py(2)<module>()
-> debug = 1
(Pdb) step
> /tmp/foo.py(5)<module>()
-> if debug and not hasattr(pdb, 'is_pdb_started'):
(Pdb) step
> /tmp/foo.py(9)<module>()
-> import bar
(Pdb) step
> /tmp/foo.py(10)<module>()
-> FOO = 'foo'
(Pdb) step
> /tmp/foo.py(11)<module>()
-> try:
(Pdb) step
> /tmp/foo.py(12)<module>()
-> print(bar.BAR)
(Pdb) step
AttributeError: partially initialized module 'bar' has no attribute 'BAR' (most likely due to a circular import)
> /tmp/foo.py(12)<module>()
-> print(bar.BAR)
(Pdb) step
> /tmp/foo.py(13)<module>()
-> except AttributeError as e:
(Pdb) step
> /tmp/foo.py(16)<module>()
-> print(e)
(Pdb) step
partially initialized module 'bar' has no attribute 'BAR' (most likely due to a circular import)
--Return--
> /tmp/foo.py(16)<module>()->None
-> print(e)
(Pdb) step
> /tmp/bar.py(2)<module>()
-> BAR = 'bar'
(Pdb) step
> /tmp/bar.py(3)<module>()
-> print(foo.FOO)
(Pdb) step
foo
--Return--
> /tmp/bar.py(3)<module>()->None
-> print(foo.FOO)
(Pdb) step
> /tmp/foo.py(10)<module>()
-> FOO = 'foo'
(Pdb) step
> /tmp/foo.py(11)<module>()
-> try:
(Pdb) step
> /tmp/foo.py(12)<module>()
-> print(bar.BAR)
(Pdb) step
bar
--Return--
> /tmp/foo.py(12)<module>()->None
-> print(bar.BAR)
(Pdb) step
$
msg360714 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2020-01-25 23:24
This fixed the bug.

To reproduce the bug, substitute the "from _lzma import *" statement with "raise ImportError"
in Lib/lzma.py to simulate that the _lzma module is missing and run the test.py script written and uploaded by Maciej.
msg360793 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2020-01-27 20:08
> When was this change made?

It has always been this way, so no change here. We have actually improved the situation over the years with attempts at better locking in importlib itself.
msg360805 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2020-01-27 21:29
Cool.  I appreciate all the work in this area, thank you!
History
Date User Action Args
2020-01-27 21:29:15ethan.furmansetmessages: + msg360805
2020-01-27 20:08:26brett.cannonsetmessages: + msg360793
2020-01-25 23:24:38xdegayesetmessages: + msg360714
2020-01-25 20:21:00xdegayesetnosy: + xdegaye
messages: + msg360708
2020-01-24 20:25:25serhiy.storchakasetmessages: + msg360653
2020-01-24 20:10:56miss-islingtonsetmessages: + msg360650
2020-01-24 20:10:50miss-islingtonsetnosy: + miss-islington
messages: + msg360649
2020-01-24 18:06:10ethan.furmansetmessages: + msg360637
2020-01-24 17:56:19miss-islingtonsetpull_requests: + pull_request17557
2020-01-24 17:56:12ethan.furmansetmessages: + msg360636
2020-01-24 17:56:11miss-islingtonsetpull_requests: + pull_request17556
2020-01-24 17:51:45ethan.furmansetmessages: + msg360635
2020-01-24 17:50:14brett.cannonsetmessages: + msg360634
2020-01-24 16:50:23Maciej Golsetmessages: + msg360629
2020-01-24 14:08:24serhiy.storchakasetmessages: + msg360620
2020-01-24 13:34:09Maciej Golsetmessages: + msg360619
2020-01-24 13:33:43Maciej Golsetmessages: + msg360618
2020-01-24 10:11:36xtreaksetnosy: + ethan.furman
2020-01-24 09:14:45serhiy.storchakasetnosy: + brett.cannon, ncoghlan, lars.gustaebel, eric.snow
2020-01-24 09:14:16serhiy.storchakasetversions: + Python 3.8, Python 3.9
nosy: + serhiy.storchaka

messages: + msg360598

components: + Library (Lib)
type: crash -> behavior
2020-01-24 09:06:33serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request17548
2020-01-23 11:19:33Maciej Golsetfiles: - test.py
2020-01-23 11:19:25Maciej Golsetfiles: + test.py

messages: + msg360552
2020-01-23 11:15:38Maciej Golcreate