This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile module considers anything starting with 512 bytes of zero bytes to be a valid tar file
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, cks, eitan.adler, hitbox, lars.gustaebel, rthugh02, serhiy.storchaka
Priority: normal Keywords:

Created on 2019-04-11 02:07 by cks, last changed 2022-04-11 14:59 by admin.

Messages (4)
msg339915 - (view) Author: Chris Siebenmann (cks) Date: 2019-04-11 02:07
The easiest reproduction of this is:

    import tarfile
    tarfile.open("/dev/zero", "r:")

(If you use plain "r" you get a hang in attempted lzma decoding.)

I believe this is probably due to a missing 'elif self.offset == 0:' in the 'except EOFHeaderError' exception handling case that almost all of the other exception handlers have.

This appears to be a very long standing issue based on the history of the code.
msg340488 - (view) Author: Read Hughes (rthugh02) Date: 2019-04-18 14:05
GNU description of tar file format: http://www.gnu.org/software/tar/manual/html_node/Standard.html

Particular quotes that are relevant:

>Physically, an archive consists of a series of file entries terminated by an end-of-archive entry, which consists of two 512 blocks of zero bytes

>Each file archived is represented by a header block which describes the file, followed by zero or more blocks which give the contents of the file. At the end of the archive file there are two 512-byte blocks filled with binary zeros as an end-of-file marker

The header itself is 257 bytes padded with NUL until it reaches 512.

No input other than this, just trying to bring any relevant information to this issue that may help
msg342764 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-05-17 20:47
I did some testing with BSD and GNU tar to compare with Python's behavior.

jfoo:~ jeff$ tar --version
bsdtar 2.8.3 - libarchive 2.8.3

jeff@albarino:~$ tar --version
tar (GNU tar) 1.28

Both BSD tar and GNU tar can create an empty tar file that consists of all zero bytes. BSD tar creates a 1 KB file:

jfoo:~ jeff$ tar -cf tarfilename.tar -T /dev/null
jfoo:~ jeff$ hexdump tarfilename.tar 
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000400
jfoo:~ jeff$ tar -tf tarfilename.tar
jfoo:~ jeff$ echo $?
0

while GNU tar creates a 10 KB file:

jeff@albarino:~$ tar -cf tarfilename.tar -T /dev/null
jeff@albarino:~$ hexdump tarfilename.tar
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0002800
jeff@albarino:~$ tar -tf tarfilename.tar 
jeff@albarino:~$ echo $?
0

GNU tar will also leave a tar file with 10 KB of zeros when all contents have been deleted (BSD tar doesn't support deletion):

jeff@albarino:~$ tar cf empty.tar tarfilename.tar 
jeff@albarino:~$ hexdump empty.tar 
0000000 6174 6672 6c69 6e65 6d61 2e65 6174 0072
0000010 0000 0000 0000 0000 0000 0000 0000 0000
*
0000060 0000 0000 3030 3030 3636 0034 3030 3130
0000070 3537 0031 3030 3130 3537 0031 3030 3030
0000080 3030 3432 3030 0030 3331 3634 3637 3430
0000090 3331 0037 3130 3432 3637 2000 0030 0000
00000a0 0000 0000 0000 0000 0000 0000 0000 0000
*
0000100 7500 7473 7261 2020 6a00 6665 0066 0000
0000110 0000 0000 0000 0000 0000 0000 0000 0000
0000120 0000 0000 0000 0000 6a00 6665 0066 0000
0000130 0000 0000 0000 0000 0000 0000 0000 0000
*
0005000
jeff@albarino:~$ tar --delete -f empty.tar tarfilename.tar
jeff@albarino:~$ hexdump empty.tar 
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0002800
jfoo:~ jeff$ tar -tf empty.tar
jfoo:~ jeff$ echo $?
0


According to the POSIX.1 standard, "[t]he last physical block shall always be the full size, so logical records after the two zero logical records may contain undefined data." (http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#).

It looks like any file starting with 1,024 bytes of zeros is a valid tar archive per BSD tar, GNU tar, and the POSIX.1 standard.

However, BSD tar and GNU tar disagree about files starting with 512 bytes of zeros followed by 512 bytes of garbage. First, I constructed such a file for testing (zr.tar):

jfoo:~ jeff$ dd if=/dev/zero of=zr.tar bs=512 count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.000060 secs (8521761 bytes/sec)
jfoo:~ jeff$ dd if=/dev/random of=zr.tar bs=512 count=1 oseek=1
1+0 records in
1+0 records out
512 bytes transferred in 0.000056 secs (9138228 bytes/sec)
jfoo:~ jeff$ ls -l zr.tar
-rw-r--r--  1 jeff  staff  1024 May 17 13:14 zr.tar
jfoo:~ jeff$ hexdump zr.tar 
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000200 d7 56 a9 8d 26 11 a4 d8 9a 96 15 04 8d 4b 31 5d
0000210 33 2b 20 ae a2 23 09 8c 60 a1 73 12 a1 ab 73 61
0000220 69 eb 88 bf 8a 7d 6b 9a c5 79 b6 c9 9b a9 5a 6d
0000230 4b 4a 81 a7 71 da 90 24 3f 8f 43 a9 95 a0 20 bb
0000240 93 0f b2 be 7e 4d 80 49 aa 61 19 a2 6b b5 5c f4
0000250 e0 34 7f 99 a0 d3 29 08 9a 25 97 96 d4 d0 07 e4
0000260 90 1c 60 97 9a 23 d3 25 38 54 97 8b 71 a0 83 40
0000270 a6 f9 19 1b 3f 6e bc 5b 06 22 20 fc ff fe 7b eb
0000280 35 9b 52 57 14 83 90 7f d3 e8 f4 72 58 96 16 8c
0000290 09 ad 2a 2f ad fd 43 09 96 eb 7c 8f fc a6 14 d9
00002a0 18 34 38 b6 6a 5a ff 66 6d 46 cb 77 7a 5c 1e 72
00002b0 3e 27 05 3a b0 c4 52 7b c8 cc 26 b9 c3 5f 39 27
00002c0 a3 49 9e f1 3f f8 7e 46 98 df 7c 9d e3 86 c3 72
00002d0 e1 ef 98 7d a1 96 4e 4b 82 bb f4 2b f3 71 6f 16
00002e0 fe 38 2d bc 2b 70 b3 e6 db 1b ad 44 13 06 28 e5
00002f0 3d 05 07 3c 5f 09 5b 90 67 09 0b 5a db 79 b7 27
0000300 8a 4b e5 b3 66 f0 7a 9d a5 c4 e3 a8 b4 b2 d2 c8
0000310 5d d1 27 81 03 25 33 f4 fb 6f 77 b1 df 9d fa cf
0000320 01 a7 70 40 b4 7f 6b ac 04 70 5c 29 06 6a 73 64
0000330 4f 15 92 3b 5e a4 34 95 e0 4b 04 be ca 87 e9 73
0000340 1e 63 98 f3 f1 fd be 7a de fe 84 27 b7 e4 db e0
0000350 fb 04 7f 9d f0 ae af a3 8e 0f c2 a7 80 e0 32 38
0000360 17 1e 47 37 48 9b 99 35 58 9d d5 83 1b 67 d4 e8
0000370 15 0d 00 bb 79 f3 37 59 c3 5e e9 1d 87 79 96 de
0000380 6c 89 35 34 0b b1 12 b2 a8 2d 61 dd f5 9a 19 e7
0000390 c1 c5 24 46 fa 23 f0 db 72 7f a5 18 aa e2 db 04
00003a0 1e cc a6 0f 9e 4e 00 d9 2d eb f9 fc c4 d5 8e 46
00003b0 ab c3 ed 53 98 df a8 81 26 f4 b5 0f b4 7f 12 a4
00003c0 4a aa 14 4c f5 aa dd ba 69 e5 a8 d5 b3 68 0b 9f
00003d0 1a aa 34 a4 60 09 c2 30 22 32 72 dd 2e f9 7a 79
00003e0 88 a3 6a 99 13 4f f4 27 db 02 2e cb a0 ec d8 4d
00003f0 fe 68 44 0c 7b 3a 74 8d 8e cd ba 3e d8 ef cb 97
0000400


GNU tar outputs a warning message, but still returns zero:

jeff@albarino:~$ tar -tvf zr.tar 
tar: A lone zero block at 1
jeff@albarino:~$ echo $?
0

while BSD tar silently accepts the file:

jfoo:~ jeff$ tar -tvf zr.tar 
jfoo:~ jeff$ echo $?
0

Python also accepts the file as valid:

>>> tarfile.open("zr.tar", "r")
<tarfile.TarFile object at 0x10efa2820>

Personally, I think that an error should be returned if the file starts with a zero block followed by a non-zero block. However, changing Python to do that would make its behavior inconsistent with two of the most widely used tar utilities.
msg343732 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-05-28 05:00
I recommend closing this issue since the behavior is the same as the BSD and GNU tar utilities.
History
Date User Action Args
2022-04-11 14:59:13adminsetgithub: 80777
2019-05-28 05:00:37Jeffrey.Kintschersettype: behavior
messages: + msg343732
2019-05-17 20:47:58Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
messages: + msg342764
2019-04-18 14:05:03rthugh02setnosy: + rthugh02
messages: + msg340488
2019-04-12 03:14:00eitan.adlersetnosy: + eitan.adler
2019-04-11 07:14:24hitboxsetnosy: + hitbox
2019-04-11 02:16:39xtreaksetnosy: + lars.gustaebel, serhiy.storchaka
2019-04-11 02:07:20ckscreate