classification
Title: tarfile: accessing (listing and extracting) tarball fails with UnicodeDecodeError
Type: behavior Stage: resolved
Components: Unicode Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Tomas Tomecek, ezio.melotti, iritkatriel, lars.gustaebel, snizio, vstinner
Priority: normal Keywords:

Created on 2016-04-12 08:32 by Tomas Tomecek, last changed 2021-05-28 16:44 by iritkatriel. This issue is now closed.

Messages (5)
msg263237 - (view) Author: Tomas Tomecek (Tomas Tomecek) Date: 2016-04-12 08:32
I have a tarball (generated by docker-1.10 via `docker export`) and am trying to extract it with python 2.7 tarfile:

```
with tarfile.open(name=tarball_path) as tar_fd:
    tar_fd.extractall(path=path)
```

Output from a pytest run:

```
/usr/lib64/python2.7/tarfile.py:2072: in extractall
    for tarinfo in members:
/usr/lib64/python2.7/tarfile.py:2507: in next
    tarinfo = self.tarfile.next()
/usr/lib64/python2.7/tarfile.py:2355: in next
    tarinfo = self.tarinfo.fromtarfile(self)
/usr/lib64/python2.7/tarfile.py:1254: in fromtarfile
    return obj._proc_member(tarfile)
/usr/lib64/python2.7/tarfile.py:1276: in _proc_member
    return self._proc_pax(tarfile)
/usr/lib64/python2.7/tarfile.py:1406: in _proc_pax
    value = value.decode("utf8")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

input = '\x01\x00\x00\x02\xc0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', errors = 'strict'

    def decode(input, errors='strict'):
>       return codecs.utf_8_decode(input, errors, True)
E       UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 4: invalid start byte

/usr/lib64/python2.7/encodings/utf_8.py:16: UnicodeDecodeError
```

Since I know nothing about tars, I have no idea if this is a bug or there is a proper solution/workaround.

When using GNU tar, I'm able to to list and extract the tarball.
msg263239 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-12 08:42
Can you give a link to the tar archive, or for example the first 256 KB of the archive?
msg263241 - (view) Author: Tomas Tomecek (Tomas Tomecek) Date: 2016-04-12 10:00
Unfortunately I can't, since it's internal docker image. I have found a bug report in Red Hat bugzilla with more info: https://bugzilla.redhat.com/show_bug.cgi?id=1194473 Here's even a commit with a fix (via monkeypatching): https://github.com/goldmann/docker-squash/commit/81d1c4c18960a5d940be9b986ccbfaa7853aceb1

If needed, I can construct a minimal reporoducer.
msg329285 - (view) Author: SÅ‚awomir Nizio (snizio) Date: 2018-11-05 08:01
I had the same problem with entries:

SCHILY.xattr.system.posix_acl_default, SCHILY.xattr.system.posix_acl_access

in a tarball with pax header.

This seems to be fixed for Python 3 in the issue 8633, commit 1465cc2 in cpython.

Tarfile from Python 2 assumes (in _proc_pax) that the values can be always decoded as utf-8 string.
msg394670 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-05-28 16:44
Python 2.7 is no longer maintained. There aren't enough details here to tell whether the issue was fixed in python 3.

If you are having this problem with python 3.9+, please create a new issue.
History
Date User Action Args
2021-05-28 16:44:34iritkatrielsetstatus: open -> closed

nosy: + iritkatriel
messages: + msg394670

resolution: out of date
stage: resolved
2018-11-05 08:01:21sniziosetnosy: + snizio
messages: + msg329285
2016-04-12 10:00:44Tomas Tomeceksetmessages: + msg263241
2016-04-12 08:42:52vstinnersetmessages: + msg263239
2016-04-12 08:36:25SilentGhostsetnosy: + lars.gustaebel
type: behavior
2016-04-12 08:32:18Tomas Tomecekcreate