This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile: some content different output
Type: behavior Stage: resolved
Components: Library (Lib) Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: FFY00, yellowhat
Priority: normal Keywords:

Created on 2021-05-28 15:39 by yellowhat, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
compress.py yellowhat, 2021-05-28 15:39
compress.py yellowhat, 2021-05-30 15:36
Messages (7)
msg394666 - (view) Author: Vasco Gervasi (yellowhat) Date: 2021-05-28 15:39
Hi,
I am seeing some irregularities on the the tar files created using python.

Consider the attached script.
This is the output from the scripts:
```
  # gz
b'0f2eb7b3cac63267b1cf51d2bd5e3144f53cc5b172bbad3dccd5adf4ffb2d220  /tmp/py.gz\n'
9bde8fdb44d98c5a838a9fedaff6e66cd536d91022f8a64a6ecc514f38ce01af
b'e37c3d30ae3c12e872c6aade55ac0a40da8b3f357ce8ed77287bc9f8f024e587  /tmp/py.gz\n'
7ac976e3c94b90abff3c4138a2d153e9be9cc87e2b5a97baf2be95ca04029936

  # bz2
b'd04678e749491e4de1065d3f72ba66395d6bd8ffba3d6360ed9ca2c514586fd3  /tmp/py.bz2\n'
9aa293624df8c40f47614045602af41cc603ca92c97c94926296ef38396d6e3f
b'd04678e749491e4de1065d3f72ba66395d6bd8ffba3d6360ed9ca2c514586fd3  /tmp/py.bz2\n'
9aa293624df8c40f47614045602af41cc603ca92c97c94926296ef38396d6e3f

  # xz
b'a050baa1ab765fa037524ff061d59f62ad37bc6d1bacf98f9bff2f4b4c312fab  /tmp/py.xz\n'
ca39f034d7812d2420573218c69313ac31fd516ffebe1a57f4e41a32e1e840b9
b'a050baa1ab765fa037524ff061d59f62ad37bc6d1bacf98f9bff2f4b4c312fab  /tmp/py.xz\n'
ca39f034d7812d2420573218c69313ac31fd516ffebe1a57f4e41a32e1e840b9

b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  /tmp/tar_a0.tgz\n'
b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  /tmp/tar_a1.tgz\n'
b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  /tmp/gzp_a0.tgz\n'
b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  /tmp/gzp_a1.tgz\n'
```

As you can see the tar generated using the `tar` command are always same, instead the one generated using python are not.

Am I missing some arguments?

Thanks
msg394744 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021-05-29 23:38
I modified the script to keep the both Python generated tarballs and ran diffoscope, which presents the issue very clearly:


$ diffoscope py.gz py2.gz
--- py.gz
+++ py2.gz
├── filetype from file(1)
│ @@ -1 +1 @@
│ -gzip compressed data, was "py", last modified: Sat May 29 23:24:02 2021, max compression
│ +gzip compressed data, was "py2", last modified: Sat May 29 23:24:03 2021, max compression


The issue is that by default, when writing gzip files, the mtime will be set for the last modification. This is helpful, but might be unwanted in some cases. You can change the mtime as shown in [1].

Now let's take a look at the difference between the file Python generated and the one the `tar` command generated.


$ diffoscope py.gz tar_a0.tgz
--- py.gz
+++ tar_a0.tgz
├── filetype from file(1)
│ @@ -1 +1 @@
│ -gzip compressed data, was "py", last modified: Sat May 29 23:24:02 2021, max compression
│ +gzip compressed data, from Unix


It seems like it generates the same output here because the `tar` command does not set any mtime on the archive by default.


[1] https://github.com/FFY00/trampolim/blob/dbd03c90eaa2cc732e1a01268786b491dc872fb7/trampolim/_build.py#L354
msg394766 - (view) Author: Vasco Gervasi (yellowhat) Date: 2021-05-30 15:36
Dear Filipe,
thanks for your answer.
Following your suggestion, I have tried the attached file.

The output is:
$ python /data/compress.py
b'68963e137ced6ee2aa5a93e155b201a3c172e2683d4b15a0eab7c1d8d43e48b4  /tmp/py_gzip.tgz\n'
b'68963e137ced6ee2aa5a93e155b201a3c172e2683d4b15a0eab7c1d8d43e48b4  /tmp/py_gzip.tgz\n'
$ rm -rf a/
$ mv py_gzip.tgz py_gzip.tgz0
$ python /data/compress.py
b'9c897d82c332f0d5443fe66112abe5f318bf6e6574e44c5c3c385f398784ac35  /tmp/py_gzip.tgz\n'
b'9c897d82c332f0d5443fe66112abe5f318bf6e6574e44c5c3c385f398784ac35  /tmp/py_gzip.tgz\n'
$ diffoscope py_gzip.tgz0 py_gzip.tgz
--- py_gzip.tgz0
+++ py_gzip.tgz
│   --- py_gzip.tgz0-content
├── +++ py_gzip.tgz-content
│ ├── file list
│ │ @@ -1,4 +1,4 @@
│ │ -drwxr-xr-x   0 root         (0) root         (0)        0 2021-05-30 15:32:56.566535 a/
│ │ --rw-r--r--   0 root         (0) root         (0)        6 2021-05-30 15:32:56.566535 a/eph0
│ │ --rw-r--r--   0 root         (0) root         (0)        6 2021-05-30 15:32:56.566535 a/eph1
│ │ --rw-r--r--   0 root         (0) root         (0)        6 2021-05-30 15:32:56.566535 a/eph2
│ │ +drwxr-xr-x   0 root         (0) root         (0)        0 2021-05-30 15:33:16.956535 a/
│ │ +-rw-r--r--   0 root         (0) root         (0)        6 2021-05-30 15:33:16.956535 a/eph0
│ │ +-rw-r--r--   0 root         (0) root         (0)        6 2021-05-30 15:33:16.956535 a/eph1
│ │ +-rw-r--r--   0 root         (0) root         (0)        6 2021-05-30 15:33:16.966535 a/eph2

Even if I am specifing an mtime, it is not correctly applied.

Thanks
msg394771 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021-05-30 16:55
tarfile will keep the mtime from the file, the issue is that you are touching the files in the beginning of the script. When you write to the files, you change the mtime (last modified time), which produces a different TarInfo. If you comment out the code that writes to the files, you get the exact same output.


#dir0 = Path("/tmp/a")
#dir0.mkdir(parents=True, exist_ok=True)
#fil0 = dir0 / "eph0"
#fil0.write_text("Text 0", encoding="UTF-8")
#fil1 = dir0 / "eph1"
#fil1.write_text("Text 1", encoding="UTF-8")
#fil2 = dir0 / "eph2"
#fil2.write_text("Text 2", encoding="UTF-8")


$ python compress.py
b'cc3bd1bf99edc4f0796e1c466d251b0f808db790cbdd55bc920c041fb405e535  /tmp/py_gzip.tgz\n'
b'cc3bd1bf99edc4f0796e1c466d251b0f808db790cbdd55bc920c041fb405e535  /tmp/py_gzip.tgz\n'
$ python compress.py
b'cc3bd1bf99edc4f0796e1c466d251b0f808db790cbdd55bc920c041fb405e535  /tmp/py_gzip.tgz\n'
b'cc3bd1bf99edc4f0796e1c466d251b0f808db790cbdd55bc920c041fb405e535  /tmp/py_gzip.tgz\n'


If you are in a situation where the mtime may change, but you want the same output, you can reset it. See the last example in https://docs.python.org/3/library/tarfile.html#tar-examples.
msg394780 - (view) Author: Vasco Gervasi (yellowhat) Date: 2021-05-30 19:33
Dear Filipe,
sorry I did not explaing the use case, obiously this is a toy example to show my problem.
So I have pipeline, that from a repository generate a tar file, using a python script; if the hash of the tar file is different it will trigger other things.
As you can imagine each time the pipeline is run, the content is the same (if same commit) but the files timestamps are different and so the tar is different.

Thanks for pointing out that examples, I will check and let you know.

Thanks
msg394781 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021-05-30 19:48
Yeah, I understand. What you want is achieved by making sure the mtime from the tar archive, and files on the archive, is reproducible, like I demonstrated here.

Can this be closed?
msg394787 - (view) Author: Vasco Gervasi (yellowhat) Date: 2021-05-31 07:23
Yes, you can close it.

For future reference:

tar_reset = "/tmp/py_tar_reset.tar"

def reset(tarinfo):
    tarinfo.uid = tarinfo.gid = 0
    tarinfo.uname = tarinfo.gname = "root"
    tarinfo.mtime = 1
    return tarinfo

with tarfile.open(tar_reset, "w:xz") as tar_obj:
    tar_obj.add("/tmp/a", arcname="a", filter=reset)
History
Date User Action Args
2022-04-11 14:59:46adminsetgithub: 88428
2021-06-01 18:38:54FFY00setstatus: open -> closed
resolution: not a bug
stage: resolved
2021-05-31 07:23:21yellowhatsetmessages: + msg394787
2021-05-30 19:48:42FFY00setmessages: + msg394781
2021-05-30 19:33:02yellowhatsetmessages: + msg394780
2021-05-30 16:55:10FFY00setmessages: + msg394771
2021-05-30 15:36:20yellowhatsetfiles: + compress.py

messages: + msg394766
2021-05-29 23:39:00FFY00setnosy: + FFY00
messages: + msg394744
2021-05-28 15:39:24yellowhatcreate