Message 195232 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	teamnoir
Recipients	teamnoir
Date	2013-08-15.05:20:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1376544025.63.0.959353089302.issue18744@psf.upfronthosting.co.za>
In-reply-to

Content
There's a problem with tarfile. Write a program to traverse the contents of a modest sized tar archive. Make sure your tar archive is compressed. Then read the tar archive with your program. I'm finding that allowing tarfile to read a compressed archive costs me somewhere on the order of a 60x performance penalty by comparison to opening the file with gzip, then passing the gzip contents to tarfile. Programs that could take a few minutes are literally taking a few hours when using tarfile. This seems stupid. The tarfile library could do the same thing I'm doing manually, in fact, I had assumed that it would and was surprised by the performance I was seeing, so I ran with the profiler and saw millions of decompression calls. It's almost as though the tarfile library is decompressing the entire archive for every member extraction. Note, you can get even worse performance if you sort the member names and then extract in that order. I'm not sure whether this "should" matter since the tar file order is sequential.

There's a problem with tarfile.  Write a program to traverse the contents of a modest sized tar archive.  Make sure your tar archive is compressed.  Then read the tar archive with your program.

I'm finding that allowing tarfile to read a compressed archive costs me somewhere on the order of a 60x performance penalty by comparison to opening the file with gzip, then passing the gzip contents to tarfile.  Programs that could take a few minutes are literally taking a few hours when using tarfile.

This seems stupid.  The tarfile library could do the same thing I'm doing manually, in fact, I had assumed that it would and was surprised by the performance I was seeing, so I ran with the profiler and saw millions of decompression calls.  It's almost as though the tarfile library is decompressing the entire archive for every member extraction.

Note, you can get even worse performance if you sort the member names and then extract in that order.  I'm not sure whether this "should" matter since the tar file order is sequential.

History
Date	User	Action	Args
2013-08-15 05:20:25	teamnoir	set	recipients: + teamnoir
2013-08-15 05:20:25	teamnoir	set	messageid: <1376544025.63.0.959353089302.issue18744@psf.upfronthosting.co.za>
2013-08-15 05:20:25	teamnoir	link	issue18744 messages
2013-08-15 05:20:24	teamnoir	create