classification
Title: 3.2: tarfile.getmembers causes 100% cpu usage on Windows
Type: performance Stage: resolved
Components: Library (Lib), Windows Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: lars.gustaebel, srid
Priority: normal Keywords: 3.2regression, patch

Created on 2011-02-16 18:12 by srid, last changed 2011-02-23 11:55 by lars.gustaebel. This issue is now closed.

Files
File name Uploaded Description Edit
tarfile.diff lars.gustaebel, 2011-02-19 10:42
Messages (4)
msg128685 - (view) Author: Sridhar Ratnakumar (srid) Date: 2011-02-16 18:12
tarfile.getmembers has become extremely slow on Windows. This was triggered in r85916 by Lars Gustaebel on Oct 29, 2010 to "add read support for all missing variants of the GNU sparse extensions".

To reproduce, use this "tgz" file:

  http://pypm-free.activestate.com/3.2/win32-x86/pool/a/as/as.mklruntime-1.2_win32-x86_3.2_1.pypm

It contains another tgz file called "data.tar.gz". Run `.getmembers()` on data.tar.gz.

...

This invokes tarfile._FileInFile.read(...) that seems to be cause of slowness (or rather a hang). 

I had to workaround this issue by monkey-patching the above `read` function to revert the change:

+if sys.version_info[:2] >= (3,2):
+    import tarfile
+    class _FileInFileNoSparse(tarfile._FileInFile):
+        def read(self, size):
+            if size is None:
+                size = self.size - self.position
+            else:
+                size = min(size, self.size - self.position)
+            self.fileobj.seek(self.offset + self.position)
+            self.position += size
+            return self.fileobj.read(size)
+    tarfile._FileInFile = _FileInFileNoSparse
+    LOG.info('Monkey patching `tarfile.py` to disable part of r85916 (py3k)')

We caught this bug as part of testing ActiveState PyPM on Python 3.2
http://bugs.activestate.com/show_bug.cgi?id=89376#c3

If you want the easiest way to reproduce this, I can send you (in private) an internal build of ActivePython-3.2 containing PyPM. Running "pypm install numpy" (with breakpoints in tarfile.py) is all that is required to reproduce.
msg128840 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2011-02-19 10:42
_FileInFile.read() does lots of unnecessary seeking and reads the same block again and again. The attached patch fixes that. Please try if it works for you.
msg128931 - (view) Author: Sridhar Ratnakumar (srid) Date: 2011-02-21 02:28
Lars, the attached patch fixes the issue. I'll add this to ActivePython 3.2. Thanks.
msg129178 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2011-02-23 11:55
Thanks for your great report. This is fixed now in r88528 (py3k) and r88529 (release32-maint).
History
Date User Action Args
2011-02-23 11:55:39lars.gustaebelsetstatus: open -> closed
versions: + Python 3.3
messages: + msg129178

keywords: + 3.2regression
resolution: accepted
stage: resolved
2011-02-21 02:28:06sridsetmessages: + msg128931
2011-02-19 10:42:22lars.gustaebelsetfiles: + tarfile.diff

messages: + msg128840
keywords: + patch
assignee: lars.gustaebel
2011-02-16 18:17:04pitrousettype: resource usage -> performance
2011-02-16 18:12:26sridcreate