This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile.py: fix GNU and USTAR formats to properly handle paths with special characters that are encoded with more than one byte each
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: Roddy Shuler, lars.gustaebel, python-dev, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2015-08-10 18:04 by Roddy Shuler, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
fix-tarfile-path-truncation.patch Roddy Shuler, 2015-08-10 18:04 Patch to fix tarfile truncation with multi-byte characters
Messages (7)
msg248363 - (view) Author: Roddy Shuler (Roddy Shuler) * Date: 2015-08-10 18:04
GNU and USTAR formats use a special case if the file path is longer than 100 bytes. The detection for this, though, incorrectly checked for 100 characters rather than 100 bytes. So, if the length was close to but not exceeding 100 characters and included special characters such that the encoded length is greater than 100 bytes, the encoded string was truncated to 100 bytes and thus the resulting file name was truncated within the tar file.

For example...

/gt-education/Colección Educativa Guatemala/thumbs/Libro de Texto Comunicacion y Lenguaje 1 Grado.jpg

is truncated as:

/gt-education/Colección Educativa Guatemala/thumbs/Libro de Texto Comunicacion y Lenguaje 1 Grado.jp

The attached patch fixes this.  Initially found on Python 3.3.  Patch is tested on Linux with version 3.4.3-6 from Debian.  Looking at the source code, I am pretty confident that the problem still exists upstream in Python 3.5.
msg248576 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2015-08-14 09:38
Thanks for the detailed report and the patch. I haven't checked yet, but I suppose that the entire 3.x branch is affected. The first thing I have to do now is to come up with a comprehensive testcase.
msg263713 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-04-19 06:54
New changeset d08d6b776694 by Lars Gustäbel in branch '3.5':
Issue #24838: tarfile's ustar and gnu formats now correctly calculate name and
https://hg.python.org/cpython/rev/d08d6b776694

New changeset e281a57d5b29 by Lars Gustäbel in branch 'default':
Issue #24838: Merge tarfile fix from 3.5.
https://hg.python.org/cpython/rev/e281a57d5b29
msg263719 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-19 08:04
Tests fail on FreeBSD:

http://buildbot.python.org/all/builders/AMD64%20FreeBSD%209.x%203.5/builds/713/steps/test/logs/stdio

Example:



======================================================================
FAIL: test_unicode_link1 (test.test_tarfile.UstarUnicodeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/home/buildbot/python/3.5.koobs-freebsd9/build/Lib/test/test_tarfile.py", line 1807, in test_unicode_link1
    self._test_ustar_link("0123456789" * 9 + "01234567\xff")
  File "/usr/home/buildbot/python/3.5.koobs-freebsd9/build/Lib/test/test_tarfile.py", line 1826, in _test_ustar_link
    self.assertEqual(name, t.linkname)
AssertionError: '0123[44 chars]89012345678901234567890123456789012345678901234567\xff' != '0123[44 chars]89012345678901234567890123456789012345678901234567\udcc3\udcbf'
- 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567\xff
?                                                                                                   ^
+ 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567\udcc3\udcbf
?                                                                                                   ^^
msg263722 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-04-19 10:01
New changeset 78ede2baa146 by Lars Gustäbel in branch '3.5':
Issue #24838: Fix test_tarfile.py for non-utf8 filesystem encodings.
https://hg.python.org/cpython/rev/78ede2baa146

New changeset 08835d1e7a50 by Lars Gustäbel in branch 'default':
Issue #24838: Merge test_tarfile.py fix from 3.5.
https://hg.python.org/cpython/rev/08835d1e7a50
msg263723 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2016-04-19 10:02
Sorry for the glitch, I suppose everything works fine now.
msg281986 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-29 12:04
FYI the first release including the fix 78ede2baa146 is Python 3.5.2.
History
Date User Action Args
2022-04-11 14:58:19adminsetgithub: 69026
2016-11-29 12:04:09vstinnersetmessages: + msg281986
2016-11-29 12:02:10serhiy.storchakalinkissue28831 superseder
2016-04-19 10:55:31berker.peksagsetresolution: fixed
2016-04-19 10:02:47lars.gustaebelsetstatus: open -> closed

messages: + msg263723
2016-04-19 10:01:33python-devsetmessages: + msg263722
2016-04-19 10:01:21serhiy.storchakasetnosy: + serhiy.storchaka
2016-04-19 08:04:49vstinnersetstatus: closed -> open

nosy: + vstinner
messages: + msg263719

resolution: fixed -> (no value)
2016-04-19 06:56:12lars.gustaebelsetstatus: open -> closed
stage: test needed -> resolved
resolution: fixed
versions: - Python 3.2, Python 3.3, Python 3.4
2016-04-19 06:54:49python-devsetnosy: + python-dev
messages: + msg263713
2015-08-14 09:38:57lars.gustaebelsetassignee: lars.gustaebel
components: + Library (Lib)
versions: + Python 3.2, Python 3.3, Python 3.4, Python 3.6
nosy: + lars.gustaebel

messages: + msg248576
stage: test needed
2015-08-10 18:04:23Roddy Shulercreate