Issue 18744: doc: pathological performance using tarfile

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62944

classification

Title:	doc: pathological performance using tarfile
Type:	performance	Stage:	needs patch
Components:	Documentation	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	docs@python, lars.gustaebel, nadeem.vawda, r.david.murray, teamnoir
Priority:	normal	Keywords:	easy

Created on 2013-08-15 05:20 by teamnoir, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
tarproblem.py	teamnoir, 2013-08-15 19:23	a script that demonstrates the pathological behavior

Messages (7)
msg195232 - (view)	Author: K Richard Pixley (teamnoir)	Date: 2013-08-15 05:20
There's a problem with tarfile. Write a program to traverse the contents of a modest sized tar archive. Make sure your tar archive is compressed. Then read the tar archive with your program. I'm finding that allowing tarfile to read a compressed archive costs me somewhere on the order of a 60x performance penalty by comparison to opening the file with gzip, then passing the gzip contents to tarfile. Programs that could take a few minutes are literally taking a few hours when using tarfile. This seems stupid. The tarfile library could do the same thing I'm doing manually, in fact, I had assumed that it would and was surprised by the performance I was seeing, so I ran with the profiler and saw millions of decompression calls. It's almost as though the tarfile library is decompressing the entire archive for every member extraction. Note, you can get even worse performance if you sort the member names and then extract in that order. I'm not sure whether this "should" matter since the tar file order is sequential.
msg195235 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-08-15 07:12
Could you please provide a simple script which shows the problem?
msg195277 - (view)	Author: K Richard Pixley (teamnoir)	Date: 2013-08-15 19:22
New info... I see the degradation on most of the linux boxes I've tried: * ubuntu-13.04, (raring), 64-bit * rhel-5.4 64-bit * rhel-5.7 64-bit * suse-11 64-bit I see some degradation on MacOsX-10.8.4 but it's in the acceptable range, more like 2x than 60x. That is still suspicious, but not as problematic.
msg195278 - (view)	Author: K Richard Pixley (teamnoir)	Date: 2013-08-15 19:23
Here's a script that tests for the problem.
msg195418 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-08-16 20:59
Thank you for the script Richard. If you say about performance degradation when extracting a tarfile in changed order this behavior is expected. When you read a gzip file in random order you need seek in it. A gzip file is a singe-direction road. For seeking in a gzip file you need decompress all data between you current position (or from the file start) and target position. In case of random order you need decompress 1/3 tarfile in the mean for every extracted file. THe tarfile module can't do anything with this. It can't first extract all file in the memory because uncompressed file can be too big. It can't resort a list of extracted file in natural order because it can change semantic (a tarfile can contains duplicates and symlinks). Just don't do this. Don't extract a large number of files from compressed tarfile in changed order.
msg195424 - (view)	Author: K Richard Pixley (teamnoir)	Date: 2013-08-16 21:37
I see your point. The alternative would be to limit the size of archive that can be extracted from to the size of virtual memory, which is essentially what I'm doing manually. Either way, someone will be surprised. I'm not which which way will result in the least surprise since I suspect that far more people will be extracting from compressed archives than will be extracting very large archives. The failure mode with limited file size seems much less frequent but also much more annoying. In comparison, the failure, (and the pathological case is effectively a failure), reading compressed archives seems much more common to me, although granted, not completely a total failure. I think this should be mentioned in the doc because I, at least, was extremely surprised by this behavior and it cost me some time to track it down. I might suggest something along the lines of: Be careful when working with compressed archives. In order to support the largest file sizes possible, some approaches may result in pathological behavior causing the original archive to be decompressed, in full, many times. You should be able to avoid this behavior if you traverse the TarInfo items in file order. You might also consider decompressing the archive first, in memory, and then handing the memory copy to tarfile for processing.
msg195428 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-08-16 21:55
I think in most cases peoples extracts archives in natural order and don't have a failure. But adding a warning looks reasonable.

History
Date	User	Action	Args
2022-04-11 14:57:49	admin	set	github: 62944
2021-04-22 23:29:02	iritkatriel	set	keywords: + easy title: pathological performance using tarfile -> doc: pathological performance using tarfile versions: + Python 3.11, - Python 2.7, Python 3.3, Python 3.4
2013-09-13 04:27:34	serhiy.storchaka	set	nosy: - serhiy.storchaka
2013-08-16 21:55:44	serhiy.storchaka	set	assignee: docs@python components: + Documentation, - Library (Lib) versions: + Python 3.3, Python 3.4 nosy: + docs@python messages: + msg195428 stage: needs patch
2013-08-16 21:37:04	teamnoir	set	status: pending -> open messages: + msg195424
2013-08-16 20:59:35	serhiy.storchaka	set	status: open -> pending nosy: + nadeem.vawda, r.david.murray messages: + msg195418
2013-08-15 19:23:13	teamnoir	set	files: + tarproblem.py messages: + msg195278
2013-08-15 19:22:13	teamnoir	set	messages: + msg195277
2013-08-15 07:12:23	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg195235
2013-08-15 05:36:06	ned.deily	set	nosy: + lars.gustaebel
2013-08-15 05:20:25	teamnoir	create