Issue44262
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2021-05-28 15:39 by yellowhat, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
compress.py | yellowhat, 2021-05-28 15:39 | |||
compress.py | yellowhat, 2021-05-30 15:36 |
Messages (7) | |||
---|---|---|---|
msg394666 - (view) | Author: Vasco Gervasi (yellowhat) | Date: 2021-05-28 15:39 | |
Hi, I am seeing some irregularities on the the tar files created using python. Consider the attached script. This is the output from the scripts: ``` # gz b'0f2eb7b3cac63267b1cf51d2bd5e3144f53cc5b172bbad3dccd5adf4ffb2d220 /tmp/py.gz\n' 9bde8fdb44d98c5a838a9fedaff6e66cd536d91022f8a64a6ecc514f38ce01af b'e37c3d30ae3c12e872c6aade55ac0a40da8b3f357ce8ed77287bc9f8f024e587 /tmp/py.gz\n' 7ac976e3c94b90abff3c4138a2d153e9be9cc87e2b5a97baf2be95ca04029936 # bz2 b'd04678e749491e4de1065d3f72ba66395d6bd8ffba3d6360ed9ca2c514586fd3 /tmp/py.bz2\n' 9aa293624df8c40f47614045602af41cc603ca92c97c94926296ef38396d6e3f b'd04678e749491e4de1065d3f72ba66395d6bd8ffba3d6360ed9ca2c514586fd3 /tmp/py.bz2\n' 9aa293624df8c40f47614045602af41cc603ca92c97c94926296ef38396d6e3f # xz b'a050baa1ab765fa037524ff061d59f62ad37bc6d1bacf98f9bff2f4b4c312fab /tmp/py.xz\n' ca39f034d7812d2420573218c69313ac31fd516ffebe1a57f4e41a32e1e840b9 b'a050baa1ab765fa037524ff061d59f62ad37bc6d1bacf98f9bff2f4b4c312fab /tmp/py.xz\n' ca39f034d7812d2420573218c69313ac31fd516ffebe1a57f4e41a32e1e840b9 b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /tmp/tar_a0.tgz\n' b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /tmp/tar_a1.tgz\n' b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /tmp/gzp_a0.tgz\n' b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /tmp/gzp_a1.tgz\n' ``` As you can see the tar generated using the `tar` command are always same, instead the one generated using python are not. Am I missing some arguments? Thanks |
|||
msg394744 - (view) | Author: Filipe Laíns (FFY00) * | Date: 2021-05-29 23:38 | |
I modified the script to keep the both Python generated tarballs and ran diffoscope, which presents the issue very clearly: $ diffoscope py.gz py2.gz --- py.gz +++ py2.gz ├── filetype from file(1) │ @@ -1 +1 @@ │ -gzip compressed data, was "py", last modified: Sat May 29 23:24:02 2021, max compression │ +gzip compressed data, was "py2", last modified: Sat May 29 23:24:03 2021, max compression The issue is that by default, when writing gzip files, the mtime will be set for the last modification. This is helpful, but might be unwanted in some cases. You can change the mtime as shown in [1]. Now let's take a look at the difference between the file Python generated and the one the `tar` command generated. $ diffoscope py.gz tar_a0.tgz --- py.gz +++ tar_a0.tgz ├── filetype from file(1) │ @@ -1 +1 @@ │ -gzip compressed data, was "py", last modified: Sat May 29 23:24:02 2021, max compression │ +gzip compressed data, from Unix It seems like it generates the same output here because the `tar` command does not set any mtime on the archive by default. [1] https://github.com/FFY00/trampolim/blob/dbd03c90eaa2cc732e1a01268786b491dc872fb7/trampolim/_build.py#L354 |
|||
msg394766 - (view) | Author: Vasco Gervasi (yellowhat) | Date: 2021-05-30 15:36 | |
Dear Filipe, thanks for your answer. Following your suggestion, I have tried the attached file. The output is: $ python /data/compress.py b'68963e137ced6ee2aa5a93e155b201a3c172e2683d4b15a0eab7c1d8d43e48b4 /tmp/py_gzip.tgz\n' b'68963e137ced6ee2aa5a93e155b201a3c172e2683d4b15a0eab7c1d8d43e48b4 /tmp/py_gzip.tgz\n' $ rm -rf a/ $ mv py_gzip.tgz py_gzip.tgz0 $ python /data/compress.py b'9c897d82c332f0d5443fe66112abe5f318bf6e6574e44c5c3c385f398784ac35 /tmp/py_gzip.tgz\n' b'9c897d82c332f0d5443fe66112abe5f318bf6e6574e44c5c3c385f398784ac35 /tmp/py_gzip.tgz\n' $ diffoscope py_gzip.tgz0 py_gzip.tgz --- py_gzip.tgz0 +++ py_gzip.tgz │ --- py_gzip.tgz0-content ├── +++ py_gzip.tgz-content │ ├── file list │ │ @@ -1,4 +1,4 @@ │ │ -drwxr-xr-x 0 root (0) root (0) 0 2021-05-30 15:32:56.566535 a/ │ │ --rw-r--r-- 0 root (0) root (0) 6 2021-05-30 15:32:56.566535 a/eph0 │ │ --rw-r--r-- 0 root (0) root (0) 6 2021-05-30 15:32:56.566535 a/eph1 │ │ --rw-r--r-- 0 root (0) root (0) 6 2021-05-30 15:32:56.566535 a/eph2 │ │ +drwxr-xr-x 0 root (0) root (0) 0 2021-05-30 15:33:16.956535 a/ │ │ +-rw-r--r-- 0 root (0) root (0) 6 2021-05-30 15:33:16.956535 a/eph0 │ │ +-rw-r--r-- 0 root (0) root (0) 6 2021-05-30 15:33:16.956535 a/eph1 │ │ +-rw-r--r-- 0 root (0) root (0) 6 2021-05-30 15:33:16.966535 a/eph2 Even if I am specifing an mtime, it is not correctly applied. Thanks |
|||
msg394771 - (view) | Author: Filipe Laíns (FFY00) * | Date: 2021-05-30 16:55 | |
tarfile will keep the mtime from the file, the issue is that you are touching the files in the beginning of the script. When you write to the files, you change the mtime (last modified time), which produces a different TarInfo. If you comment out the code that writes to the files, you get the exact same output. #dir0 = Path("/tmp/a") #dir0.mkdir(parents=True, exist_ok=True) #fil0 = dir0 / "eph0" #fil0.write_text("Text 0", encoding="UTF-8") #fil1 = dir0 / "eph1" #fil1.write_text("Text 1", encoding="UTF-8") #fil2 = dir0 / "eph2" #fil2.write_text("Text 2", encoding="UTF-8") $ python compress.py b'cc3bd1bf99edc4f0796e1c466d251b0f808db790cbdd55bc920c041fb405e535 /tmp/py_gzip.tgz\n' b'cc3bd1bf99edc4f0796e1c466d251b0f808db790cbdd55bc920c041fb405e535 /tmp/py_gzip.tgz\n' $ python compress.py b'cc3bd1bf99edc4f0796e1c466d251b0f808db790cbdd55bc920c041fb405e535 /tmp/py_gzip.tgz\n' b'cc3bd1bf99edc4f0796e1c466d251b0f808db790cbdd55bc920c041fb405e535 /tmp/py_gzip.tgz\n' If you are in a situation where the mtime may change, but you want the same output, you can reset it. See the last example in https://docs.python.org/3/library/tarfile.html#tar-examples. |
|||
msg394780 - (view) | Author: Vasco Gervasi (yellowhat) | Date: 2021-05-30 19:33 | |
Dear Filipe, sorry I did not explaing the use case, obiously this is a toy example to show my problem. So I have pipeline, that from a repository generate a tar file, using a python script; if the hash of the tar file is different it will trigger other things. As you can imagine each time the pipeline is run, the content is the same (if same commit) but the files timestamps are different and so the tar is different. Thanks for pointing out that examples, I will check and let you know. Thanks |
|||
msg394781 - (view) | Author: Filipe Laíns (FFY00) * | Date: 2021-05-30 19:48 | |
Yeah, I understand. What you want is achieved by making sure the mtime from the tar archive, and files on the archive, is reproducible, like I demonstrated here. Can this be closed? |
|||
msg394787 - (view) | Author: Vasco Gervasi (yellowhat) | Date: 2021-05-31 07:23 | |
Yes, you can close it. For future reference: tar_reset = "/tmp/py_tar_reset.tar" def reset(tarinfo): tarinfo.uid = tarinfo.gid = 0 tarinfo.uname = tarinfo.gname = "root" tarinfo.mtime = 1 return tarinfo with tarfile.open(tar_reset, "w:xz") as tar_obj: tar_obj.add("/tmp/a", arcname="a", filter=reset) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:46 | admin | set | github: 88428 |
2021-06-01 18:38:54 | FFY00 | set | status: open -> closed resolution: not a bug stage: resolved |
2021-05-31 07:23:21 | yellowhat | set | messages: + msg394787 |
2021-05-30 19:48:42 | FFY00 | set | messages: + msg394781 |
2021-05-30 19:33:02 | yellowhat | set | messages: + msg394780 |
2021-05-30 16:55:10 | FFY00 | set | messages: + msg394771 |
2021-05-30 15:36:20 | yellowhat | set | files:
+ compress.py messages: + msg394766 |
2021-05-29 23:39:00 | FFY00 | set | nosy:
+ FFY00 messages: + msg394744 |
2021-05-28 15:39:24 | yellowhat | create |