classification
Title: tarfile not re-entrant for multi-threading
Type: behavior Stage:
Components: Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, lars.gustaebel, r.david.murray, sgnn7, xdegaye
Priority: normal Keywords: patch

Created on 2015-03-12 16:40 by sgnn7, last changed 2019-06-10 14:55 by vstinner.

Files
File name Uploaded Description Edit
mutithreading_tarfile.patch sgnn7, 2015-03-12 21:59 makdirs_patch review
extract_from_packages.py xdegaye, 2018-04-07 10:01
Messages (11)
msg237960 - (view) Author: Srdjan Grubor (sgnn7) * Date: 2015-03-12 16:40
When running tarfile.extract through multiple threads, the archive reading pointer is not protected from simultaneous seeks and causes various convoluted bugs:

  <some code>
    self.archive_object.extract(member, extraction_path)
  File "/usr/lib/python3.4/tarfile.py", line 2019, in extract
    set_attrs=set_attrs)
  File "/usr/lib/python3.4/tarfile.py", line 2088, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/usr/lib/python3.4/tarfile.py", line 2127, in makefile
    source.seek(tarinfo.offset_data)
  File "/usr/lib/python3.4/gzip.py", line 573, in seek
    self.read(1024)
  File "/usr/lib/python3.4/gzip.py", line 365, in read
    if not self._read(readsize):
  File "/usr/lib/python3.4/gzip.py", line 449, in _read
    self._read_eof()
  File "/usr/lib/python3.4/gzip.py", line 485, in _read_eof
    hex(self.crc)))
OSError: CRC check failed 0x1036a2e1 != 0x0
msg237961 - (view) Author: Srdjan Grubor (sgnn7) * Date: 2015-03-12 16:45
Also, extract_member in tarfile.py is not thread-safe since the check for folder existence might occur during another thread's creation of that same dir causing the code to error out.

  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./xdelta3-dir-patcher", line 499, in _apply_file_delta
    archive_object.expand(patch_file, staging_dir)
  File "./xdelta3-dir-patcher", line 284, in expand
    self.archive_object.extract(member, extraction_path)
  File "/usr/lib/python3.4/tarfile.py", line 2019, in extract
    set_attrs=set_attrs)
  File "/usr/lib/python3.4/tarfile.py", line 2080, in _extract_member
    os.makedirs(upperdirs)
  File "/usr/lib/python3.4/os.py", line 237, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/tmp/XDelta3DirPatcher_is0y4_5f/xdelta/updated folder'

Code causing problems:
2065     def _extract_member(self, tarinfo, targetpath, set_attrs=True):
...
2075         # Create all upper directories.
2076         upperdirs = os.path.dirname(targetpath)
2077         if upperdirs and not os.path.exists(upperdirs):
...
2080             os.makedirs(upperdirs)  # Fails since the dir might be already created between lines 2077 and 2080
msg237964 - (view) Author: Srdjan Grubor (sgnn7) * Date: 2015-03-12 16:52
The code around tarfile multi-threading was fixed for me on the user-side with threading.Lock() usage so it might work to use this within the library and the directory creation could be improved by probably doing a try/except around the makedirs() call with ignoring of the exception if it's FileExistsError - my code I use elsewhere fixes this with:
    def _safe_makedirs(self, dir_path):
        try:
            makedirs(dir_path)
        # Concurrency problems need to be handled. If two threads create
        # the same dir, there might be a race between them checking and
        # doing makedirs so we handle that as gracefully as possible here.
        except FileExistsError as fee:
            if not os.path.isdir(dir_path):
                raise fee 

If I get time, I'll submit a patch but it seems like I probably won't for this.
msg237968 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-03-12 17:51
If you want to use an object that has state in more than one thread you generally have to put some locking around it.  Unless I'm missing something (which I might be) I don't think it is tarfile's responsibility to do this.
msg237970 - (view) Author: Srdjan Grubor (sgnn7) * Date: 2015-03-12 17:57
I don't know if that's true of core libraries. Why complicate things for end users when those issues could be done in the library itself and be completely transparent to the devs? A simple RLock latch wouldn't pose almost any speed degradation but would work in both threaded and non-threaded situations as expected.
msg237986 - (view) Author: Srdjan Grubor (sgnn7) * Date: 2015-03-12 21:05
After some thinking, for the makedirs it should only need makedirs(exist_ok=True)
msg237989 - (view) Author: Srdjan Grubor (sgnn7) * Date: 2015-03-12 21:59
Patch for the multithreaded expansion of files and use of makedirs.
msg237990 - (view) Author: Srdjan Grubor (sgnn7) * Date: 2015-03-12 22:03
The whole lib still needs the threading locks added but the patch submitted should fix things for people that do the locking from their code.
msg238181 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-03-16 07:08
I agree with David that there is no need for tarfile to be thread-safe. There is nothing to be gained from distributing one TarFile object among multiple threads because it operates on a single resource which has to be accessed sequentially anyway. So, it seems best to me if we leave it like it is and let the user add locks around it as she/he sees fit.
msg238183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-16 08:39
Lars Gustäbel added the comment:
> I agree with David that there is no need for tarfile to be thread-safe. There is nothing to be gained from distributing one TarFile object among multiple threads because it operates on a single resource which has to be accessed sequentially anyway. So, it seems best to me if we leave it like it is and let the user add locks around it as she/he sees fit.

In asyncio, it was a design choice to not be thread-safe, to allow
more optimizations and support multiple implementations of asyncio,
without this important constraint.

I modified recently the asyncio doc to warn users in each class that
asyncio objects are *not* thread safe, with an explanation how to use
correctly asyncio with threads.

https://docs.python.org/dev/library/asyncio-eventloop.html#asyncio.BaseEventLoop
"This class is not thread safe."

Such change in tarfile doc is probably enough for tarfile.
msg315067 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2018-04-07 10:01
extract_from_pkgs() in the attached extract_from_packages.py script extracts /etc files from the tar files in PKG_DIR into WORK_DIR using a ThreadPoolExecutor (a ThreadPoolExecutor, when used to extract all the /etc files from the packages that build a whole ArchLinux system, divides the elapsed time by 2). Running this script that tests this function fails randomly with the same error as reported by Srdjan in msg237961.

Replacing ThreadPoolExecutor with ProcessPoolExecutor also fails randomly.

Using the safe_makedirs() context manager to enclose the statements than run ThreadPoolExecutor fixes the problem.

Obviously this in not a problem related to thread-safety (it occurs also with ProcessPoolExecutor) but a problem about the robustness of the tarfile module in a concurrent access context. The problem is insidious in that it may never occur in an application test suite.
History
Date User Action Args
2019-06-10 14:55:43vstinnersetnosy: - vstinner
2019-06-09 22:36:40Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
2018-04-07 10:01:02xdegayesetfiles: + extract_from_packages.py
nosy: + xdegaye
messages: + msg315067

2015-03-16 08:39:52vstinnersetmessages: + msg238183
2015-03-16 07:08:01lars.gustaebelsetmessages: + msg238181
2015-03-12 22:03:17sgnn7setmessages: + msg237990
2015-03-12 21:59:53sgnn7setfiles: + mutithreading_tarfile.patch
keywords: + patch
messages: + msg237989
2015-03-12 21:05:30sgnn7setmessages: + msg237986
2015-03-12 17:57:03sgnn7setmessages: + msg237970
versions: - Python 3.5
2015-03-12 17:51:22r.david.murraysetnosy: + r.david.murray
messages: + msg237968
2015-03-12 17:12:12vstinnersetnosy: + lars.gustaebel, vstinner

versions: + Python 3.5
2015-03-12 16:52:03sgnn7setmessages: + msg237964
2015-03-12 16:46:14sgnn7settype: enhancement -> behavior
2015-03-12 16:46:07sgnn7settype: behavior -> enhancement
2015-03-12 16:45:57sgnn7setmessages: + msg237961
2015-03-12 16:40:01sgnn7create