classification
Title: Support tarfile.PAX_FORMAT in shutil.make_archive
Type: enhancement Stage: resolved
Components: Documentation Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: CAM-Gerlach, docs@python, lars.gustaebel, ncoghlan
Priority: normal Keywords: patch

Created on 2017-06-14 01:25 by ncoghlan, last changed 2019-04-07 04:50 by ncoghlan. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12355 merged CAM-Gerlach, 2019-03-15 18:32
PR 12635 merged CAM-Gerlach, 2019-03-30 17:36
Messages (9)
msg295974 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-14 01:25
shutil.make_archive currently just uses the default tar format, which is GNU_FORMAT.

This format doesn't ensure that all character paths are encoded as UTF-8, and hence may end up embedding platform specific encoding assumptions into the generated tarball.

I see a few possible ways of resolving this:

1. Change the default tar format to PAX_FORMAT. It's been 16 years since that was defined, and Python itself has supported it since 2.6 was released in 2008, so perhaps we can rely on other tools supporting it now? (My main open question on that front would be "What happens if you specify "format=GNU_FORMAT" when attempting to read a PAX formatted archive?)

2. Add new shutil level "pax", "gzpax", "bzpax", "xzpax" format definitions to explicitly request PAX_FORMAT

3. Add a mechanism to shutil.make_archive that allows format-dependent settings to be based down to the underlying archive creation functions (e.g. "format=tarfile.PAX_FORMAT").
msg295976 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-14 01:49
The main benefit I'd see to the last option is that it would also cover passing a "filter" option for tarfile.TarFile.add(). Dropping down to the lower level API for that isn't *hard*, it's just a bit fiddly (note: currently untested example code):

   sdist = tarfile.open(sdist_path, "w:gz", format=tarfile.PAX_FORMAT)
   sdist.add(os.getcwd(), arcname=sdist_subdir, filter=_exclude_hidden_and_special_files)
msg338021 - (view) Author: C.A.M. Gerlach (CAM-Gerlach) * Date: 2019-03-15 19:21
FYI, [GH-12355](https://github.com/python/cpython/pull/12355) will implement pax as default, as discussed in [bpo-36268](https://bugs.python.org/issue36268), which should be equivalent to option 1 here, thus also resolving this issue. Could you confirm that this is the case, and do you have any other comments on the change? Thanks!
msg339192 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2019-03-30 12:23
Aye, I agree that changing the default resolves the feature request here. I've recategorised this as a documentation issue, as the initial PR only changed the `tarfile` documentation, so the impact on `shutil` isn't obvious.

So the changes needed will be:

* add a "What's New" entry for shutil, noting that shtuil.make_archive inherited the change in default archive format from tarfile
* corresponding "version changed" note in the shutil.make_archive documentation


An addition to the "Porting" section in What's New may also be needed, depending on how tarfile.Tarfile behaves if you tell it to open a PAX_FORMAT archive using GNU_FORMAT or vice-versa (tarfile.open and shutil.unpack_archive will be fine, since they query the file's own metadata to find out which format to use)
msg339218 - (view) Author: C.A.M. Gerlach (CAM-Gerlach) * Date: 2019-03-30 18:47
I opened a PR to implement both those changes, and also added some minor related clarifications and fixes to the format section of the tarfile docs.

> how tarfile.Tarfile behaves if you tell it to open a PAX_FORMAT archive using GNU_FORMAT or vice-versa

I tested tarfile.Tarfile() and extract_all() on the resulting object with several different simple- to moderately-complex (including Unicode filenames) real-world pax- and GNU-format archives packed with different archivers, with both format=GNU_FORMAT and format=PAX_FORMAT for each one, got no warnings or errors with debug=3 and errorlevel=2, and extraction was successful and yielded identical results for either format argument, and did not get a PAXHEADERS file output for either one. Furthermore, tracing the code, its not clear that Tarfile() (with 'r') and extract, etc. use the passed `format`.

Even if so, in order to produce an error after this change but not before, all of the following would seem to have to be the case:

* The tarfile being read would have to be in GNU format, i.e. from an external source or produced with an older version of Python
* The tarfile would have to make use of specific extended/non-standard GNU tar features not tested above
* The user would have to use Tarfile() to open the tarfile, rather than one of the other, more common/higher-level methods
* The user's call to Tarfile() would have to have used DEFAULT_FORMAT rather than being explicitly specified. and implicitly relied DEFAULT_FORMAT == GNU_FORMAT

Therefore, this seems like a very specific corner-case. However, if you think I should include it, I'll go ahead with it. Also, let me know if these doc changes should have a separate NEWS entry or the previous one adequately covers it.
msg339300 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2019-04-01 16:45
tarfile does not use the `format` argument for reading, it will be detected. You can even mix different formats in one archive and tarfile will be fine with it.
msg339305 - (view) Author: C.A.M. Gerlach (CAM-Gerlach) * Date: 2019-04-01 20:23
Thanks for the confirmation!
msg339555 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2019-04-07 04:47
New changeset 89a894403cfa880d7f9d1d67070f61456d14cbde by Nick Coghlan (CAM Gerlach) in branch 'master':
bpo-30661: Improve docs for tarfile pax change and effect on shutil (GH-12635)
https://github.com/python/cpython/commit/89a894403cfa880d7f9d1d67070f61456d14cbde
msg339556 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2019-04-07 04:50
Thanks for the technical clarification Lars, and for the docs update C.A.M.
History
Date User Action Args
2019-04-07 04:50:17ncoghlansetstatus: open -> closed
resolution: fixed
messages: + msg339556

stage: patch review -> resolved
2019-04-07 04:47:53ncoghlansetmessages: + msg339555
2019-04-01 20:23:38CAM-Gerlachsetmessages: + msg339305
2019-04-01 16:45:51lars.gustaebelsetnosy: + lars.gustaebel
messages: + msg339300
2019-03-30 18:47:13CAM-Gerlachsetmessages: + msg339218
2019-03-30 17:36:23CAM-Gerlachsetstage: needs patch -> patch review
pull_requests: + pull_request12566
2019-03-30 12:23:10ncoghlansetnosy: + docs@python
messages: + msg339192

assignee: docs@python
components: + Documentation, - Library (Lib)
stage: patch review -> needs patch
2019-03-15 19:22:05CAM-Gerlachsetversions: + Python 3.8, - Python 3.7
2019-03-15 19:21:20CAM-Gerlachsetnosy: + CAM-Gerlach
messages: + msg338021
components: + Library (Lib)
2019-03-15 18:32:37CAM-Gerlachsetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request12321
2017-06-14 01:49:10ncoghlansetmessages: + msg295976
2017-06-14 01:25:39ncoghlancreate