This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author hajoscher
Recipients hajoscher
Date 2018-06-30.09:27:00
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1530350820.63.0.56676864532.issue34010@psf.upfronthosting.co.za>
In-reply-to
Content
Buffer read of large files in a compressed tarfile stream performs poorly.

The buffered read in tarfile _Stream is extending a bytes object. 
It is much more efficient to use a list followed by a join. 
Using a list can mean seconds instead of minutes. 

This performance regression was introduced in b506dc32c1a. 

How to test:
# create random tarfile 50Mb
dd if=/dev/urandom of=test.bin count=50 bs=1M
tar czvf test.tgz test.bin

# read with tarfile as stream (note pipe symbol in 'r|gz')
import tarfile
tfile = tarfile.open("test.tgz", 'r|gz')
for t in tfile:
    file = tfile.extractfile(t)
    if file:
        print(len(file.read()))
History
Date User Action Args
2018-06-30 09:27:00hajoschersetrecipients: + hajoscher
2018-06-30 09:27:00hajoschersetmessageid: <1530350820.63.0.56676864532.issue34010@psf.upfronthosting.co.za>
2018-06-30 09:27:00hajoscherlinkissue34010 messages
2018-06-30 09:27:00hajoschercreate