This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile stream read performance regression
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: hajoscher, methane, miss-islington, vstinner
Priority: normal Keywords: 3.2regression, patch

Created on 2018-06-30 09:27 by hajoscher, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 8020 merged python-dev, 2018-06-30 09:36
PR 8082 merged miss-islington, 2018-07-04 08:14
PR 8083 merged miss-islington, 2018-07-04 08:15
Messages (10)
msg320763 - (view) Author: hajoscher (hajoscher) * Date: 2018-06-30 09:27
Buffer read of large files in a compressed tarfile stream performs poorly.

The buffered read in tarfile _Stream is extending a bytes object. 
It is much more efficient to use a list followed by a join. 
Using a list can mean seconds instead of minutes. 

This performance regression was introduced in b506dc32c1a. 

How to test:
# create random tarfile 50Mb
dd if=/dev/urandom of=test.bin count=50 bs=1M
tar czvf test.tgz test.bin

# read with tarfile as stream (note pipe symbol in 'r|gz')
import tarfile
tfile = tarfile.open("test.tgz", 'r|gz')
for t in tfile:
    file = tfile.extractfile(t)
    if file:
        print(len(file.read()))
msg320848 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-07-01 23:30
Nice catch.

I confirmed this is a hard regression of performance.
Decompressing a file must be O(n) when n=filesize, but O(n^2) now.

While we live with this regression for a long time, I feel it's worth enough to backport.
This can be DoS vulnerability.

Can you write NEWS entry for it?
msg321011 - (view) Author: hajoscher (hajoscher) * Date: 2018-07-04 05:22
Yes, it performance is really bad for large files, and memory consumption as well.  I will write something for NEWS.
msg321018 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-07-04 08:13
New changeset 12a08c47601cadea8e7d3808502cdbcca87b2ce2 by INADA Naoki (hajoscher) in branch 'master':
bpo-34010: Fix tarfile read performance regression (GH-8020)
https://github.com/python/cpython/commit/12a08c47601cadea8e7d3808502cdbcca87b2ce2
msg321020 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-07-04 08:25
thanks
msg321021 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-04 08:28
https://github.com/python/cpython/pull/8020/files/77a54a39aace1a38794884218abe801b85b54e62#diff-ef64d8b610dda67977a63a9837f46349

-            buf = "".join(t)
+            buf = b"".join(t)


@hajoscher: "It never caused a problem, since this line is never called; size is never None in the function call. But still, should be fixed, I guess."

Would it be possible to have an unit test for this modified line? Untested code is broken, as you showed :-)
msg321022 - (view) Author: miss-islington (miss-islington) Date: 2018-07-04 08:32
New changeset c1b75b5fb92fda0ac5b931d7b18c1418557cb7c4 by Miss Islington (bot) in branch '3.7':
bpo-34010: Fix tarfile read performance regression (GH-8020)
https://github.com/python/cpython/commit/c1b75b5fb92fda0ac5b931d7b18c1418557cb7c4
msg321023 - (view) Author: miss-islington (miss-islington) Date: 2018-07-04 08:43
New changeset d7a0ad7dd7bd7dfbdbf6be2c89fde5a71813628a by Miss Islington (bot) in branch '3.6':
bpo-34010: Fix tarfile read performance regression (GH-8020)
https://github.com/python/cpython/commit/d7a0ad7dd7bd7dfbdbf6be2c89fde5a71813628a
msg321040 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-07-04 12:22
@Victor I think removing unused code is better than adding test for it and maintain it.

So I removed that unused code block in GH-8089.
msg321041 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-04 12:27
> @Victor I think removing unused code is better than adding test for it and maintain it.

Sure. I reviewed your PR 8089.
History
Date User Action Args
2022-04-11 14:59:02adminsetgithub: 78191
2018-07-04 12:27:59vstinnersetmessages: + msg321041
2018-07-04 12:22:31methanesetmessages: + msg321040
2018-07-04 08:43:52miss-islingtonsetmessages: + msg321023
2018-07-04 08:32:44miss-islingtonsetnosy: + miss-islington
messages: + msg321022
2018-07-04 08:28:58vstinnersetnosy: + vstinner
messages: + msg321021
2018-07-04 08:25:19methanesetstatus: open -> closed
resolution: fixed
messages: + msg321020

stage: patch review -> resolved
2018-07-04 08:15:29miss-islingtonsetpull_requests: + pull_request7684
2018-07-04 08:14:43miss-islingtonsetpull_requests: + pull_request7683
2018-07-04 08:13:21methanesetmessages: + msg321018
2018-07-04 07:49:42methanesetkeywords: + 3.2regression
title: tarfile stream read performance -> tarfile stream read performance regression
2018-07-04 05:22:01hajoschersetmessages: + msg321011
2018-07-03 11:25:24methanesetversions: - Python 3.4, Python 3.5
2018-07-01 23:30:23methanesetnosy: + methane
messages: + msg320848
2018-06-30 09:36:18python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request7628
2018-06-30 09:27:00hajoschercreate