classification
Title: tarfile iterator without members caching
Type: resource usage Stage: resolved
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: karstenw, lars.gustaebel
Priority: normal Keywords:

Created on 2010-10-31 11:19 by karstenw, last changed 2016-04-19 07:18 by lars.gustaebel. This issue is now closed.

Messages (5)
msg120041 - (view) Author: Karsten Wolf (karstenw) Date: 2010-10-31 11:19
It would be helpful to have a tarfile iterator that does not cache every archive member encountered.

This makes it nearly impossible to iterate over an archive with millions of files.
msg120042 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-10-31 11:34
I assume you're using Python 2.x. because tarfile's memory footprint was significantly reduced in Python 3.0, see the patch in issue2058 and r62337. This patch was not backported to the 2.x branch back then. As the 2.x branch has been closed for new features, this is not going to happen in the future.
msg120043 - (view) Author: Karsten Wolf (karstenw) Date: 2010-10-31 11:58
Yes, I'm on 2.6. I checked the Python 3.x tarfile just for this one line in TarFile.next():

self.members.append(tarinfo)

to conclude it would have the same problem.

Reducing 2.5gb memory usage as measured in my particular case by 60%, still leaves 1.5gb ram burned which is too much on a 32-bit 2gb ram machine.

My solution was to comment out that line which worked perfectly for my case but may not be the solution for the module.
msg123835 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-12-12 12:17
There is no trivial or backwards-compatible solution to this problem. The way it is now, there is no alternative to storing all TarInfo objects: there is no central table of contents in an archive we could use, so we must create our own. In other words, tarfile does not "burn" memory without a reason.

The problem you encounter is somehow a corner case, fortunately with a simple workaround:

for tarinfo in tar:
    ...
    tar.members = []

There are two things that I will clearly refuse to do. One thing is to add yet another option to the TarFile class to switch off caching as this would make many TarFile methods dysfunctional without the user knowing why. The other thing is to add an extra non-caching Iterator class.

Sorry, that I have nothing more to offer. Maybe, someone else comes up with a brilliant idea.
msg263714 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2016-04-19 07:18
Closing after six years of inactivity.
History
Date User Action Args
2016-04-19 07:18:38lars.gustaebelsetstatus: open -> closed
resolution: wont fix
messages: + msg263714

stage: resolved
2010-12-12 12:17:27lars.gustaebelsetmessages: + msg123835
2010-10-31 12:05:49pitrousettype: enhancement -> resource usage
versions: + Python 3.1, Python 2.7, Python 3.2
2010-10-31 11:58:44karstenwsetmessages: + msg120043
2010-10-31 11:34:39lars.gustaebelsetassignee: lars.gustaebel

messages: + msg120042
nosy: + lars.gustaebel
2010-10-31 11:19:56karstenwcreate