This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Reading ZipFile not thread-safe
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Spencer Brown, Thomas, cuibonobo, eric.smith, kevinmehall, khaledk, malin, mdavis-xyz, miss-islington, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2020-11-16 11:07 by Thomas, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test.py Thomas, 2020-11-16 11:07 Test case and traceback
Pull Requests
URL Status Linked Edit
PR 26974 merged kevinmehall, 2021-06-30 19:02
PR 32008 merged miss-islington, 2022-03-20 14:26
PR 32009 merged miss-islington, 2022-03-20 14:26
Messages (13)
msg381090 - (view) Author: Thomas (Thomas) * Date: 2020-11-16 11:07
According to https://docs.python.org/3.5/whatsnew/changelog.html#id108 bpo-14099, reading multiple ZipExtFiles should be thread-safe, but it is not.

I created a small example where two threads try to read files from the same ZipFile simultaneously, which crashes with a Bad CRC-32 error. This is especially surprising since all files in the ZipFile only contain 0-bytes and have the same CRC.

My use case is a ZipFile with 82000 files. Creating multiple ZipFiles from the same "physical" zip file is not a satisfactory workaround because it takes several seconds each time. Instead, I open it only once and clone it for each thread:

with zipfile.ZipFile("/tmp/dummy.zip", "w") as dummy:
    pass

def clone_zipfile(z):
    z_cloned = zipfile.ZipFile("/tmp/dummy.zip")
    z_cloned.NameToInfo = z.NameToInfo
    z_cloned.fp = open(z.fp.name, "rb")
    return z_cloned

This is a much better solution for my use case than locking. I am using multiple threads because I want to finish my task faster, but locking defeats that purpose.

However, this cloning is somewhat of a dirty hack and will break when the file is not a real file but rather a file-like object.

Unfortunately, I do not have a solution for the general case.
msg381126 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2020-11-16 17:14
I'm changing from "crash" to "behavior". We use "crash" for a segfault or equivalent. I realize that most people are unlikely to know this, but we consider "crash" to be more alarming, so I want to make sure it's correct.

Also: when this happens, is it always for file 127, or does it change on each run?
msg381139 - (view) Author: Thomas (Thomas) * Date: 2020-11-16 18:40
I have not observed any segfaults yet. Only zipfile.BadZipFile exceptions so far.

The exact file at which it crashes is fairly random. It even crashes if all threads try to read the same file multiple times.

I think the root cause of the problem is that the reads of zef_file in ZipFile.read are not locked properly.

https://github.com/python/cpython/blob/c79667ff7921444911e8a5dfa5fba89294915590/Lib/zipfile.py#L1515

The underlying file object is shared between all ZipExtFiles. Every time a thread makes a call to ZipFile.read, a new lock is created in _SharedFile, but that lock only protects against multiple threads reading the same ZipExtFile. Multiple threads reading different ZipExtFiles with the same underlying file object will cause trouble. The locks do nothing in this scenario because they are individual to each thread and not shared.
msg381205 - (view) Author: Thomas (Thomas) * Date: 2020-11-17 05:08
Scratch what I said in the previous message. I thought that the lock was created in _SharedFile and did not notice that it was passed as a parameter.
msg381212 - (view) Author: Thomas (Thomas) * Date: 2020-11-17 07:17
I have simplified the test case a bit more:

import multiprocessing.pool, zipfile

# Create a ZipFile with two files and same content
with zipfile.ZipFile("test.zip", "w", zipfile.ZIP_STORED) as z:
    z.writestr("file1", b"0"*10000)
    z.writestr("file2", b"0"*10000)

# Read file1  with two threads at once
with zipfile.ZipFile("test.zip", "r") as z:
    pool = multiprocessing.pool.ThreadPool(2)
    while True:
        pool.map(z.read, ["file1", "file1"])

Two files are sufficient to cause the error. It does not matter which files are read or which content they have.

I also narrowed down the point of failure a bit. After

self._file.seek(self._pos)

in _SharedFile.read ( https://github.com/python/cpython/blob/c79667ff7921444911e8a5dfa5fba89294915590/Lib/zipfile.py#L742 ), the following assertion should hold:

assert(self._file.tell() == self._pos)

The issue occurs when seeking to position 35 (size of header + length of name). Most of the time, self._file.tell() will then be 35 as expected, but sometimes it is 8227 instead, i.e. 35 + 8192.

I am not sure how this can happen since the file object should be locked.
msg396792 - (view) Author: Kevin Mehall (kevinmehall) * Date: 2021-06-30 19:13
I think I found the root cause of this problem and proposed a fix in https://github.com/python/cpython/pull/26974

To monkey-patch this fix on existing versions of Python, I'm using:

class PatchedSharedFile(zipfile._SharedFile):
    def __init__(self, *args):
        super().__init__(*args)
        self.tell = lambda: self._pos
zipfile._SharedFile = PatchedSharedFile
msg396809 - (view) Author: Thomas (Thomas) * Date: 2021-07-01 06:54
The monkey patch works for me! Thank you very much! (I have only tested reading, not writing).

However, the lock contention of Python's ZipFile is so bad that using multiple threads actually makes the code run _slower_ than single threaded code when reading a zip file with many small files. For this reason, I am not using ZipFile any longer. Instead, I have implemented a subset of the zip spec without locks, which gives me a speedup of over 2500 % for reading many small files compared to ZipFile.

I think that the architecture of ZipFile should be reconsidered, but this exceeds the scope of this issue.
msg398741 - (view) Author: Khaled (khaledk) Date: 2021-08-02 09:52
Hi Thomas, I'm facing the same issue. Would you care to opensource your implementation.
msg409596 - (view) Author: Thomas (Thomas) * Date: 2022-01-03 17:15
@khaledk I finally got some time off, so here you go https://github.com/99991/ParallelZipFile

I can not offer any support for a more correct implementation of the zip specification due to time constraints, but maybe the code is useful for you anyway.
msg412320 - (view) Author: Matthew Davis (mdavis-xyz) * Date: 2022-02-01 23:57
In addition to fixing any unexpected behavior, can we update the documentation [1] to state what the expected behavior is in terms of thread safety?

[1] https://docs.python.org/3/library/zipfile.html
msg415607 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-20 14:26
New changeset e730ae7effe4f13b24f1b5fb1fca005709c86acb by Kevin Mehall in branch 'main':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
https://github.com/python/cpython/commit/e730ae7effe4f13b24f1b5fb1fca005709c86acb
msg415609 - (view) Author: miss-islington (miss-islington) Date: 2022-03-20 14:51
New changeset 4352ca234e979ad1c7158981addf899b119cd448 by Miss Islington (bot) in branch '3.10':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
https://github.com/python/cpython/commit/4352ca234e979ad1c7158981addf899b119cd448
msg415610 - (view) Author: miss-islington (miss-islington) Date: 2022-03-20 14:54
New changeset 4aa8b802513340d12a6ffea3d5e2228ac6c7d5b8 by Miss Islington (bot) in branch '3.9':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
https://github.com/python/cpython/commit/4aa8b802513340d12a6ffea3d5e2228ac6c7d5b8
History
Date User Action Args
2022-04-11 14:59:38adminsetgithub: 86535
2022-03-20 16:00:32serhiy.storchakasetstatus: open -> closed
stage: patch review -> resolved
resolution: fixed
versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.7, Python 3.8
2022-03-20 14:54:38miss-islingtonsetmessages: + msg415610
2022-03-20 14:51:22miss-islingtonsetmessages: + msg415609
2022-03-20 14:26:52miss-islingtonsetpull_requests: + pull_request30097
2022-03-20 14:26:48miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request30096
2022-03-20 14:26:23serhiy.storchakasetmessages: + msg415607
2022-02-01 23:57:32mdavis-xyzsetnosy: + mdavis-xyz
messages: + msg412320
2022-01-03 17:15:10Thomassetmessages: + msg409596
2021-11-07 11:06:22Spencer Brownsetnosy: + Spencer Brown
2021-08-02 09:52:11khaledksetnosy: + khaledk
messages: + msg398741
2021-07-01 06:54:55Thomassetmessages: + msg396809
2021-06-30 19:13:56kevinmehallsetmessages: + msg396792
2021-06-30 19:02:57kevinmehallsetkeywords: + patch
nosy: + kevinmehall

pull_requests: + pull_request25538
stage: patch review
2021-01-25 22:47:46cuibonobosetnosy: + cuibonobo
2020-11-17 08:28:25serhiy.storchakasetnosy: + serhiy.storchaka
2020-11-17 07:17:55Thomassetmessages: + msg381212
2020-11-17 05:08:06Thomassetmessages: + msg381205
2020-11-16 18:40:07Thomassetmessages: + msg381139
2020-11-16 17:14:36eric.smithsettype: crash -> behavior

messages: + msg381126
nosy: + eric.smith
2020-11-16 11:25:09malinsetnosy: + malin
2020-11-16 11:10:08Thomassettype: crash
components: + Library (Lib)
2020-11-16 11:07:51Thomascreate