This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Nit
Recipients Nit
Date 2020-02-19.14:34:50
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1582122890.87.0.720726833574.issue39688@roundup.psfhosted.org>
In-reply-to
Content
When tarfile open a tar containing a sparse file where actual data is bigger than 0o77777777777 bytes (~8GB), it fails listing members after this file. As so, it is impossible to access/extract files located after a such file inside the archive.

A tar file presenting the issue is available at https://genua.fr/sample.tar.xz
Uncompressed the file is ~16GB.
It containes two files:
* disk.img a 50GB sparse file containing ~16GB of data
* README.txt a simple text file containing "This last file is not properly listed"

disk.img was generated using the folowing python script:

GB = 1024**3
buf = b"\xFF" * 1024**2
with open('disk.img', 'wb') as f:
    f.seek(10 * GB)
    wrotten = 0
    while wrotten < 0o77777777777:
        wrotten += f.write(buf)
        f.flush()
        print(wrotten/0o77777777777 * 100, '%')
    f.seek(50 * GB - 1)
    f.write(b'\0')

sample.tar was generated using GNU tar 1.30 on a Debian 10 with the following command:

tar --format pax -cvSf sample.tar disk.img README.txt

The following script expose the issue:

import tarfile
t = tarfile.open('sample.tar')
print('members', t.getmembers())
print('offset', t.offset)

Its output is:

members [<TarInfo 'disk.img' at 0x7f5b14242b38>]
offset 17179806208

members should also list README.txt.


I think I have found the root cause of the bug: Because the file is bigger than 0o77777777777, it's size cannot be specified inside the tar ustar header, so a "size" pax extented header is generated. This header contain the full size of the file block in the tar.

As the file is sparse, as of sparse format 1.0, the file block contains first a sparse mapping, then the file data. So this block size is the size of the mapping added to the size of the data.

Because the file is sparse, a GNU.sparse.realsize header is also added containing the full expanded file size (here 50GB).

Here https://github.com/python/cpython/blob/4dee92b0ad9f4e3ea2fbbbb5253340801bb92dc7/Lib/tarfile.py#L1350 tarfile set the tarinfo size to GNU.sparse.realsize  (50GB),then, in this block https://github.com/python/cpython/blob/4dee92b0ad9f4e3ea2fbbbb5253340801bb92dc7/Lib/tarfile.py#L1297 the file offset is moved forward from GNU.sparse.realsize (50GB) instead of pax_headers["size"]. Moreover, the move is done from next.offset_data which is set at https://github.com/python/cpython/blob/master/Lib/tarfile.py#L1338 to after the sparse mapping.
The move forward in the sparse file should be made from next.offset + BLOCKSIZE.
History
Date User Action Args
2020-02-19 14:34:50Nitsetrecipients: + Nit
2020-02-19 14:34:50Nitsetmessageid: <1582122890.87.0.720726833574.issue39688@roundup.psfhosted.org>
2020-02-19 14:34:50Nitlinkissue39688 messages
2020-02-19 14:34:50Nitcreate