Title: tarfile extract fails when Unicode in pathname
Type: behavior Stage: resolved
Components: Library (Lib), Unicode Versions: Python 2.7
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Vadim Markovtsev2, ZackerySpytz, ezio.melotti, hynek, lars.gustaebel, vinay.sajip
Priority: normal Keywords:

Created on 2013-02-07 16:43 by vinay.sajip, last changed 2021-05-31 22:27 by vinay.sajip. This issue is now closed.

File name Uploaded Description Edit
failing.tar.gz vinay.sajip, 2013-02-07 16:44 Failing archive
Messages (7)
msg181631 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2013-02-07 16:43
The attached file failing.tar.gz contains a path with UTF-8-encoded Unicode. This causes extractall() to fail, but only when the destination path is Unicode. That's because it leads to a implicit str->unicode conversion using ASCII.

Test script:

import shutil, tarfile, tempfile

tf ='failing.tar.gz', 'r:gz')
workdir = tempfile.mkdtemp()
    # N.B. ensure dest path is Unicode to trigger the failure


$ python
Traceback (most recent call last):
  File "", line 8, in <module>
  File "/usr/lib/python2.7/", line 2046, in extractall
    self.extract(tarinfo, path)
  File "/usr/lib/python2.7/", line 2083, in extract
    self._extract_member(tarinfo, os.path.join(path,
  File "/usr/lib/python2.7/", line 71, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 44: ordinal not in range(128)
msg221135 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-20 23:46
@Lars can we have a response on this issue please?
msg222553 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2014-07-08 10:40
IIRC, tarfile under 2.7 has never been explicitly unicode-safe, support for unicode objects is heterogeneous at best. The obvious work-around is to work exclusively with str objects.

What we can't do is to decode the utf-8 pathname from the archive to a unicode object, because we have no way to detect an archive's encoding. We can either emit a warning if the user passes a unicode object to extract() or we implicitly encode the passed unicode object using TarFile.encoding, so that the os.path.join() succeeds.

Unfortunately, I am not entirely sure if there was possibly a rationale behind the current behaviour of extract(). This needs more inspection.
msg272329 - (view) Author: Vadim Markovtsev (Vadim Markovtsev2) Date: 2016-08-10 12:50
So... The bug persists in 3.5 ad 3.6. It prevents from e.g. unpacking tarballs coming from GitHub repos with Unicode file names.
msg272330 - (view) Author: Vadim Markovtsev (Vadim Markovtsev2) Date: 2016-08-10 12:54
Relevant issue in pip:
msg272370 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2016-08-10 20:01
Could you point to some suitable projects from GitHub whose tarballs fail on 3.5 / 3.6? My script in the first post, with the replacing of "unicode(...)" with "str(...)" and my original failing archive, works on Python 3.5 and 3.6 on Linux. Which platform have you seen failures on?
msg394828 - (view) Author: Zackery Spytz (ZackerySpytz) * (Python triager) Date: 2021-05-31 21:06
Python 2.7 is no longer supported, so I think this issue should be closed.
Date User Action Args
2021-05-31 22:27:36vinay.sajipsetstatus: open -> closed
resolution: out of date
stage: resolved
2021-05-31 21:06:36ZackerySpytzsetnosy: + ZackerySpytz
messages: + msg394828
2016-08-11 15:17:31BreamoreBoysetnosy: - BreamoreBoy
2016-08-10 20:01:27vinay.sajipsetmessages: + msg272370
2016-08-10 12:54:14Vadim Markovtsev2setmessages: + msg272330
2016-08-10 12:50:55Vadim Markovtsev2setnosy: + Vadim Markovtsev2
messages: + msg272329
2014-07-08 10:40:12lars.gustaebelsetmessages: + msg222553
2014-06-20 23:46:17BreamoreBoysetnosy: + BreamoreBoy
messages: + msg221135
2013-02-08 10:19:47hyneksetnosy: + hynek
2013-02-07 16:45:09vinay.sajipsetnosy: + lars.gustaebel
2013-02-07 16:44:07vinay.sajipsetfiles: + failing.tar.gz
2013-02-07 16:43:21vinay.sajipcreate