This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarinfo has problems with longlinks
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: do3cc
Priority: normal Keywords:

Created on 2009-09-28 09:21 by do3cc, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (2)
msg93198 - (view) Author: Patrick Gerken (do3cc) Date: 2009-09-28 09:21
Sadly, I am unable to debug it enough to be able to provide a thorough
test case. I can provide information of how to reproduce the problem on
request. I have a tar file and a diff to tarfile.py with some pdbs that
only get activated in the middle of the file just before the problematic
data.

Installing an egg fails, and setuptools eats the original error.
The original error is this:
ValueError: 'invalid literal for int(): \xcf\xcf\xdf\xfc\xe9\xcd\xa9\xa9'

That happens in the call to next in the class TarFile. Here we read in a
chunk of filedata, and let TarInfo parse it. But the chunk of data is
actually the beginning of an image in the tar file.
Here is a more thorough report of my pdb findings:

Environment:
I created an egg on linux, which resulted in a tar.gz file. Installing
that egg fails, because the tarfile library has problems reading the tar
file. tar itself can extract the full file without problems.
I have a self compiled python 2.4.6. 

The last file that is apparently read correctly form TarFile.next, is a
LONGLINK, tarinfo.type == 'L'
This type has a method callback in TarInfo.TYPE_METH, which it uses for
returning the real TarInfo object. That goes into proc_gnulong of
tarfile.py.
This proc_gnulong method calls next again, to get the real file info, I
think.
The next buffer that is read out, contains a file name that is exactly
100chars long, and seems to be a directory, because it has a trailing
slash. but its filetype is '0'. 
I suspected the error here, and to cross check, I checked the output of
"tar -tf" on the tar file. I expect tar to return the filenames in the
same order as python reads them in. Before the directory that next seems
to find, there is his parent directory in there. The previous tarinfo
object is exactly about this parent directory. So it looks like, we
actually have a directory entry here.
Enough wild guesses and more observations: The next call of
TarInfo.next() creates a TarInfo object again, here at about line 693,
he checks if the file is a regular file but ends with a slash. If so, he
changes the file type from '0', regular file, to '5', DIRTYPE. He
actually does that with our TarInfo object.

The TarInfo object is created successfully and the next method continues
to run. Then, around line 1650, there is a check, if tarinfo.isreg() or
tarinfo.type not in SUPPORTED_TYPES:...
Here the offset for reading the next TarInfo Buffer is increased by the
size of the actual file size in the tar file. But not for our TarInfo
object, because it is not regular file any longer. If I pad the offset
manually, everything continues to work. But I won't do it this time.
The call to next finishes, and after a while TarInfo.next() is called again.
This time, next tries to read a chunk of data again, but this time, the
chunk of data is an actual file content, it starts with 'GIF89a...',
which makes sense, the directory contains images. Here parsing of the
tar file fails.
msg93201 - (view) Author: Patrick Gerken (do3cc) Date: 2009-09-28 09:49
doh, I only searched for open bugs. Not for closed.
This ticket is a dublicate of http://bugs.python.org/issue1471427
and fixed in python 2.5.
If somebody has similar problems, here is a quickfix:
I finally was able to reproduce the issue. It only happens when the path
without the filename but the trailing slash is exactly 100 chars long.
Then, because of the trailing slash, tarfile makes this thing a
directory, and if the file itself was not empty, the next read cannot be
parsed as a tar file. Since I am bound to 2.4 I will rename the directories.
History
Date User Action Args
2022-04-11 14:56:53adminsetgithub: 51260
2009-09-28 09:49:18do3ccsetstatus: open -> closed

messages: + msg93201
2009-09-28 09:21:22do3cccreate