classification
Title: Make shutil.make_archive have deterministic sorting
Type: enhancement Stage:
Components: Library (Lib) Versions: Python 3.6
process
Status: open Resolution:
Dependencies: 30693 Superseder:
Assigned To: Nosy List: lars.gustaebel, r.david.murray, rhettinger, samthursfield
Priority: normal Keywords: patch

Created on 2015-06-18 12:14 by samthursfield, last changed 2017-06-18 22:09 by martin.panter.

Files
File name Uploaded Description Edit
tar-reproducible-testcase.py samthursfield, 2015-06-18 12:14 Testcase for stable tar ordering patch
tarfile-stable-ordering.patch samthursfield, 2015-06-18 12:21 Patch to fix issue review
make_archive-stable-ordering.patch samthursfield, 2015-06-22 13:19 Patch to make shutil.make_archive(format='tar') determinstic, but not tar.add(recursive=True) review
Messages (9)
msg245464 - (view) Author: Sam Thursfield (samthursfield) * Date: 2015-06-18 12:14
I want shutil.make_archive() to produce deterministic output when given identical data as inputs.

Right now there are two holes in this. One is that mtimes might not match. This can be fixed by the caller. The second is that the order that files in a subdirectory get added to the tarfile is not deterministic. This can't be fixed by the caller.

Attached is a trivial patch to sort the results of os.listdir() to ensure the output tarfile is stable.

This only applies to the 'tar' format.

I've attached my testcase for this, which creates 3 tarfiles in /tmp. When this patch is applied, the 3 tarfiles it creates are identical according to `sha1sum`. Without this patch, they are all different.
msg245465 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-06-18 12:35
This would go beyond what the tar command itself does.  I'm not sure we want to do that, as we are pretty much modeling our behavior on tar.  However, that doesn't automatically mean we can't do it.   We'll see what other people think.  Personally I'm -0.

I've changed the issue title since your proposed patch is to tarfile, not shutil.
msg245466 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-18 13:04
You don't need to patch the tarfile module. You could use os.walk() in shutil._make_tarball() and add each file with TarFile.add(recursive=False).
msg245467 - (view) Author: Sam Thursfield (samthursfield) * Date: 2015-06-18 14:25
Thanks for the comments! Would you be happy for the patch to be merged if it was implemented by modifying shutil.make_archive() instead? I will rework it if so.
msg245469 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-06-18 15:13
I don't see any downside for this simple patch and think there is some merit for wanting a reproducible archive.
msg245493 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-06-19 07:15
The patch would change behaviour for all tarfile users by the back door, that's why I am a little reluctant. And if the same can be achieved by a reasonably simple change to shutil I think it's just as well.
msg245497 - (view) Author: Sam Thursfield (samthursfield) * Date: 2015-06-19 10:21
I've discovered that this patch introduces a nasty failure case! If you have a relative symlink pointing to a directory that's alphabetically sorted after the symlink, and files inside the symlink, 'tar -x' won't be able to create those files because the symlink target won't exist yet.

I'll rework this to only affect shutil.make_archive(), and to avoid hitting this bug.
msg245498 - (view) Author: Sam Thursfield (samthursfield) * Date: 2015-06-19 10:24
Having tested, the problem I described above doesn't happen with this patch. It's a mistake in some other code I wrote which is following symlinks when it should not do.
msg245628 - (view) Author: Sam Thursfield (samthursfield) * Date: 2015-06-22 13:19
Here's a patch which does the same thing but only for shutil.make_archive().

Note that the final output will still be non-deterministic if you use format=gztar because time.time() and the base_name argument get added to the gzip header. Might be nice to add an option to make that deterministic too, as a separate thing. This patch is useful to me as-is though.
History
Date User Action Args
2017-06-18 22:09:51martin.pantersetdependencies: + tarfile add uses random order
title: Make tarfile have deterministic sorting -> Make shutil.make_archive have deterministic sorting
2015-06-22 13:19:37samthursfieldsetfiles: + make_archive-stable-ordering.patch

messages: + msg245628
2015-06-19 10:24:50samthursfieldsetmessages: + msg245498
2015-06-19 10:21:52samthursfieldsetmessages: + msg245497
2015-06-19 07:15:23lars.gustaebelsetmessages: + msg245493
2015-06-18 15:13:15rhettingersetnosy: + rhettinger
messages: + msg245469
2015-06-18 14:25:14samthursfieldsetmessages: + msg245467
2015-06-18 13:04:11lars.gustaebelsetnosy: + lars.gustaebel
messages: + msg245466
2015-06-18 12:35:05r.david.murraysetnosy: + r.david.murray
title: Make tar files created by shutil.make_archive() have deterministic sorting -> Make tarfile have deterministic sorting
messages: + msg245465

versions: + Python 3.6
2015-06-18 12:21:45samthursfieldsetfiles: + tarfile-stable-ordering.patch
keywords: + patch
2015-06-18 12:14:12samthursfieldcreate