classification
Title: Change default tar format to modern POSIX 2001 (pax) for better portability/interop, support and standards conformance
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: CAM-Gerlach, lars.gustaebel, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-03-12 02:14 by CAM-Gerlach, last changed 2019-03-30 17:36 by CAM-Gerlach. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12355 merged CAM-Gerlach, 2019-03-15 18:32
PR 12635 merged CAM-Gerlach, 2019-03-30 17:36
Messages (9)
msg337710 - (view) Author: C.A.M. Gerlach (CAM-Gerlach) * Date: 2019-03-12 02:14
I propose changing tarfile.DEFAULT_FORMAT to be tarfile.PAX_FORMAT , rather than the legacy tarfile.GNU_FORMAT for Python 3.8. This would offer several benefits:

• Removes limitations of the old GNU tar format, including in max UID/GID values and bits in device major and minor numbers, and is the most flexible and feature-rich tar format currently
• Encodes all filenames as UTF-8 in a portable way, ensuring consistent and correct handling on all platforms, avoid errors like [this one](https://stackoverflow.com/questions/19902544/tarfile-produce-garbled-file-name-in-the-tar-gz-archivement) and generally ensure expected, sensible defaults
• Is the current interoperable POSIX standard, used by all modern platforms (Linux, Unix, macOS, and third-party unarchivers on Windows) rather than a vendor-specific extension like GNU tar
• Backwards compatible with any unarchiver capable of reading ustar format, unlike GNU tar as the extended pax headers will just be ignored
• Fixes bpo-30661, support tarfile.PAX_FORMAT in shutil.make_archive (was proposed as a fix to the same, but it was never followed up on and the issue remains open)

This change would have no effect on reading existing archives, only writing new ones, and should be broadly compatible with any remotely modern system, as pax support is included in all the widely used libraries/systems:

* POSIX 2001 (major Unix vendors), released in 2001 (18 years ago)
* GNU tar 1.14 (Linux, etc), released in 2004 (15 years ago)
* bsdtar/libtar ~1.2.51 (BSD, macOS, etc), at least as of 2006 (13 years ago), with significant bug fixes up through 2011 (8 years ago)
* 7-zip (Windows) at some point before 2011 (>8 years ago), with significant bug fixes up to 2011 (8 years ago)
* Python 2.6, released in 2008 (11 years ago)

Furthermore, essentially every existing archiver supports ustar format, which would allow interoperability on very old/exotic platforms that don't support pax for some reason (and would certainly not support GNU). Therefore, it should be more than safe to make the change now, with archivers on the three major platforms supporting the modern standard for nearly a decade, and any esoteric ones at least as likely to support the POSIX standard as the vendor-specific GNU extension.

Is there any particular reason why we shouldn't make this change? Is there a particular group/list I should contact to follow up about seeing this implemented? It seems it should only require a one-line change [here](https://stackoverflow.com/questions/19902544/tarfile-produce-garbled-file-name-in-the-tar-gz-archivement), aside from updating the docs and presumably the NEWS file, which I would be willing to do (I would think it should make a fairly straightforward first contribution).
msg337860 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-13 16:39
Looks reasonable.

Do you know whether it is supported on OpenBSD and NetBSD? In other popular programming languages?
msg337871 - (view) Author: C.A.M. Gerlach (CAM-Gerlach) * Date: 2019-03-13 19:33
In general, since pax is a backwards-compatible superset of the standard, portable ustar unlike the vendor-specific GNU format that even GNU tar itself no longer recommends in favor of switching to pax by default, it is to my understanding essentially always the better choice. The only exception would be systems that support GNU tar but not POSIX 2001 and where the limitations of the old ustar must be bypassed, which as far I'm aware is basically just really old (>10-15 years) GNU/Linux.

NetBSD and OpenBSD both use bsdtar implementations, which as far as I could find means they support the POSIX 2001-standard pax format, and (unless they use libarchive which supports all three) likely *don't* support the current GNU format which is specific to GNU tar. Even if they don't, their ustar support means they can read pax archives as legacy ustar archives (as pax is backwards-compatible), while the same is not necessarily true of GNU tar archives. Therefore, pax is strictly a better choice than GNU or ustar.

Most other programming languages I could find did not have internal/standard library implementations, instead relying on the aforementioned libraries or varying third party packages:

* For C/C++, Libarchive and GNU tar are the modern two heavy hitters, and they both have supported it for a very long long. Modern version of old-style bsdtar should, but if not then they don't support GNU tar either. These are commonly used when needed with C/C++, or programmers implement their own bespoke solutions.
* Libtar (C) does not, but it hasn't been updated for 6 years (and has been in minimal maintenance mode for over 15) so I'm not sure its really relevant anymore. Virtually any platform will also have one of the previous.
* The major implementation for Java, Apache Commons Compress, added support for both pax and GNU in its 1.2 version, back in 2011 (8 years ago)
* R uses the system's tar executable (or bundled modern tar), so will have the same support as that (i.e. any remotely modern system should be compatible). Their documentation explicitly recommends against GNU tar in favor of pax or ustar instead for portability: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/tar.html
* git-archive uses pax exclusively
* PHP supports ustar only, not pax or GNU; in that case pax is generally the more compatible of the two extended formats
* The node-tar library, the apparent standard for Javascript, support it
* The standard tar package for Go supports it
* What seems to be the major current implementation for C#, SharpZipLib, supports it
* Ruby has no apparent standard implementation; a few third-party libraries have a mix of support
msg337929 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-14 15:02
Do you mind to create a PR?
msg337951 - (view) Author: C.A.M. Gerlach (CAM-Gerlach) * Date: 2019-03-14 17:19
Sure, in work now. Its my first contribution to CPython, so bear with me. I presume this is too trivial to go in the What's New in Python article, but does merit a NEWS entry so users are aware of the change? Aside from changing [this line](https://github.com/python/cpython/blob/3fe7fa316f74ed630fbbcdf54564f15cda7cb045/Lib/tarfile.py#L108), updating the documentation to reflect the change, and possibly adding a NEWS entry, is there anything else that needs to be done? Thanks.
msg338020 - (view) Author: C.A.M. Gerlach (CAM-Gerlach) * Date: 2019-03-15 19:18
PR is up with CI checks green as [GH-12355](https://github.com/python/cpython/pull/12355). I also had to fix one test which implicitly assumed that DEFAULT_FORMAT == GNU_FORMAT.
msg338151 - (view) Author: C.A.M. Gerlach (CAM-Gerlach) * Date: 2019-03-18 00:44
Also, one additional minor note (since I apparently can't edit comments here). Windows 10 (since the April 2018 update a year ago) now includes libarchive-based bsdtar built-in by default and accessible from the standard command prompt, which as mentioned fully supports pax.

Therefore, all modern platforms should support extracting them out of the box (aside from Windows 7/Server 2008, for which extended support will end within two months from Python 3.8's initial release, Windows 10 pre-1803 for which enterprise support will end a few months after that, and Windows 8.1/Server 2012, which will be in extended support for a few more years but very low enterprise/developer/power user adoption; of course, these don't include any built-in tar support at all anyway).
msg338546 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-21 14:45
New changeset e680c3db80efc4a1d637dd871af21276db45ae03 by Serhiy Storchaka (CAM Gerlach) in branch 'master':
bpo-36268: Change default tar format to pax from GNU. (GH-12355)
https://github.com/python/cpython/commit/e680c3db80efc4a1d637dd871af21276db45ae03
msg338547 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-03-21 14:46
Thank you for your contribution!
History
Date User Action Args
2019-03-30 17:36:23CAM-Gerlachsetpull_requests: + pull_request12567
2019-03-21 14:46:40serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg338547

stage: patch review -> resolved
2019-03-21 14:45:05serhiy.storchakasetmessages: + msg338546
2019-03-18 00:44:50CAM-Gerlachsetmessages: + msg338151
2019-03-15 19:18:16CAM-Gerlachsetmessages: + msg338020
2019-03-15 18:32:37CAM-Gerlachsetkeywords: + patch
stage: patch review
pull_requests: + pull_request12320
2019-03-14 17:19:42CAM-Gerlachsetmessages: + msg337951
2019-03-14 15:02:50serhiy.storchakasetmessages: + msg337929
2019-03-13 19:33:33CAM-Gerlachsetmessages: + msg337871
2019-03-13 16:39:28serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg337860
2019-03-12 02:39:16xtreaksetnosy: + lars.gustaebel
2019-03-12 02:14:28CAM-Gerlachcreate