This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile: GNU sparse 1.0 pax tar header offset not properly computed
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Nit
Priority: normal Keywords: patch

Created on 2020-02-19 14:34 by Nit, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 18562 open Nit, 2020-02-19 16:28
Messages (3)
msg362275 - (view) Author: Valentin Samir (Nit) * Date: 2020-02-19 14:34
When tarfile open a tar containing a sparse file where actual data is bigger than 0o77777777777 bytes (~8GB), it fails listing members after this file. As so, it is impossible to access/extract files located after a such file inside the archive.

A tar file presenting the issue is available at https://genua.fr/sample.tar.xz
Uncompressed the file is ~16GB.
It containes two files:
* disk.img a 50GB sparse file containing ~16GB of data
* README.txt a simple text file containing "This last file is not properly listed"

disk.img was generated using the folowing python script:

GB = 1024**3
buf = b"\xFF" * 1024**2
with open('disk.img', 'wb') as f:
    f.seek(10 * GB)
    wrotten = 0
    while wrotten < 0o77777777777:
        wrotten += f.write(buf)
        f.flush()
        print(wrotten/0o77777777777 * 100, '%')
    f.seek(50 * GB - 1)
    f.write(b'\0')

sample.tar was generated using GNU tar 1.30 on a Debian 10 with the following command:

tar --format pax -cvSf sample.tar disk.img README.txt

The following script expose the issue:

import tarfile
t = tarfile.open('sample.tar')
print('members', t.getmembers())
print('offset', t.offset)

Its output is:

members [<TarInfo 'disk.img' at 0x7f5b14242b38>]
offset 17179806208

members should also list README.txt.


I think I have found the root cause of the bug: Because the file is bigger than 0o77777777777, it's size cannot be specified inside the tar ustar header, so a "size" pax extented header is generated. This header contain the full size of the file block in the tar.

As the file is sparse, as of sparse format 1.0, the file block contains first a sparse mapping, then the file data. So this block size is the size of the mapping added to the size of the data.

Because the file is sparse, a GNU.sparse.realsize header is also added containing the full expanded file size (here 50GB).

Here https://github.com/python/cpython/blob/4dee92b0ad9f4e3ea2fbbbb5253340801bb92dc7/Lib/tarfile.py#L1350 tarfile set the tarinfo size to GNU.sparse.realsize  (50GB),then, in this block https://github.com/python/cpython/blob/4dee92b0ad9f4e3ea2fbbbb5253340801bb92dc7/Lib/tarfile.py#L1297 the file offset is moved forward from GNU.sparse.realsize (50GB) instead of pax_headers["size"]. Moreover, the move is done from next.offset_data which is set at https://github.com/python/cpython/blob/master/Lib/tarfile.py#L1338 to after the sparse mapping.
The move forward in the sparse file should be made from next.offset + BLOCKSIZE.
msg362279 - (view) Author: Valentin Samir (Nit) * Date: 2020-02-19 15:29
This commit fix the issue https://github.com/nitmir/cpython/commit/50c1f686054e41738f14de453ede30e942064200

I am currently unable to create pull request on github (error 500)
msg362282 - (view) Author: Valentin Samir (Nit) * Date: 2020-02-19 15:51
hum trying to be clever and I am doing mistakes.

This commit is simpler and effectively fixes the issue https://github.com/nitmir/cpython/commit/682138a3544a2d7de457c88712e738938568f908

tarinfo.offset_data is where tarfile start reading the file data and thus must be set to the beginning of the actual data for a sparse file even if the data block starts before.
History
Date User Action Args
2022-04-11 14:59:26adminsetgithub: 83869
2020-02-19 16:28:29Nitsetkeywords: + patch
stage: patch review
pull_requests: + pull_request17942
2020-02-19 15:51:55Nitsetmessages: + msg362282
2020-02-19 15:29:13Nitsetmessages: + msg362279
2020-02-19 14:34:50Nitcreate