This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: behavioral differences between shutil.unpack_archive and ZipFile.extractall
Type: behavior Stage: patch review
Components: Documentation, Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: MarSoft, andrei.avk, docs@python, eric.araujo, huantian, peter@psantoro.net, pitrou, r.david.murray
Priority: normal Keywords: patch

Created on 2014-03-13 09:46 by peter@psantoro.net, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
shutil.diff peter@psantoro.net, 2014-03-13 09:46 patch
test_unpack.zip peter@psantoro.net, 2014-03-13 23:12 test script and test zips
Pull Requests
URL Status Linked Edit
PR 29910 open andrei.avk, 2021-12-04 00:12
Messages (10)
msg213374 - (view) Author: Peter Santoro (peter@psantoro.net) * Date: 2014-03-13 09:46
Since Python 3.3.1, ZipFile.extractall was enhanced to better handle absolute paths and illegal characters.  The associated logic within shutil._unpack_zipfile essentially skips zip members with these issues.

If a zip file contains all absolute paths, ZipFile.extractall works as expected (i.e. the zip file is unpacked), but shutil._unpack_zipfile (normally called indirectly via shutil.unpack_archive) appears to do nothing (i.e. it silently fails to unpack the zip file).

The attached patch attempts to unify the behavior of extracting zip files between shutil.unpack_archive with ZipFile.extractall.
msg213494 - (view) Author: Peter Santoro (peter@psantoro.net) * Date: 2014-03-13 23:12
I've attached a zip file which contains a test script and test zip files for the previously submitted Python 3.3.5 patch.  See the included README.txt for more information.  To view the contents of the included bad.zip file, use the following command:

> unzip -l bad.zip
msg214082 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-03-19 11:40
Well, you have to assume the code you're removing is there for a reason (e.g. perhaps this is meant to protect from attacks when opening a zip file uploaded by a user). 

I'd like to hear from Éric on this.
msg214093 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-19 13:19
First step would be to get rid of the warning in the zipfile docs and replace it with the info that the absolute path '/' and any relative path are stripped silently before the file is extracted.

It would also be worth adding an enhancement to zipfile to optionally not do it silently.

I hope the same considerations apply to tarfile, but I haven't checked.

In other words, I do think that code is a holdover from when zipfile *wasn't* safe, but since I didn't write it I don't know for sure.
msg214127 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2014-03-19 19:24
shutil.unpack_archive was extracted from distutils by Tarek.  I can do some Mercurial archaelogy to find more about the behaviour.
msg214196 - (view) Author: Peter Santoro (peter@psantoro.net) * Date: 2014-03-20 09:56
It seems clear to me that the logic in shutil._unpack_zipfile that silently skips paths that start with '/' (indicates absolute path) or that contain references to the parent directory ('..') was added to prevent malicious zip files from making potential malicious/unwanted modifications to the filesystem (perhaps at a time when zipfile did not itself contain such logic).  This conservative approach works, but it can have unexpected results.  For example, if all entries in a zip file contain these invalid characters, then shutil._unpack_zipfile appears to do nothing (i.e. the zip file is not unpacked).  This is good (except for the silent part), if the zip file is truly malicious.  However, I recently had to deal with thousands of zip files created by well known software vendors where hundreds of the zip files were created incorrectly and contained these invalid characters.  These files were not malicious, but they were created improperly. Note that shutil._unpack_zipfile silently failed to unzip these files, but by using ZipFile.extractall I could unzip them.

It appears that most unzipping software today either either ignores (sometimes silently) potentially malicious zip entries (e.g. Windows 7 Explorer displays an invalid zip file error) or it attempts to filter out/replace known bad characters so that the zip entries can be extracted (e.g. WinZip, gnu unzip).  I created this issue because the Python library uses both approaches, which may need rethinking.

The newer logic in ZipFile._extract_member, which is used by ZipFile.extractall, takes a different approach.  Instead of silently ignoring potentially malicious zip entries, it attempts to filter out or replace known invalid characters before extracting the zip entries.

From the Python zipfile docs:
---
If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar becomes foo/bar on Unix, and C:\foo\bar becomes foo\bar on Windows. And all ".." components in a member filename will be removed, e.g.: ../../foo../../ba..r becomes foo../ba..r. On Windows illegal characters (:, <, >, |, ", ?, and *) replaced by underscore (_).
---

As ZipFile._extract_member filters out/replaces more invalid characters than shutil._unpack_zipfile handles, one could argue that the (apparent older) approach used by shutil._unpack_zipfile is less safe.

The approaches used by shutil._unpack_zipfile and ZipFile.extractall to deal with potentially malicious zip file entries are different.  This issue could be closed if not deemed important by the Python core developers or it could be handled by documentation and/or coding changes.
msg214214 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-03-20 12:45
Note that unix unzip does exactly the same thing as zipfile extractall (except that it does issue warnings), and I believe this is considered "best practice" these days for extraction tools: strip out absolute/relative path components and extract to the destination directory.
msg397062 - (view) Author: huantian (huantian) Date: 2021-07-07 03:38
This issue is somewhat old, but I think either applying the patch attached earlier or noting this difference in behavior in the documentation would be ideal, as it can be quite difficult to debug shutil.unpack_archive silently skipping files. If the files you expect to unpack do contain '..' for non-malicious purposes this makes shutil.unpack_archive unsuitable for the situation.
msg407620 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-12-04 00:29
I think it may be good enough to add a warning on skipped files in _unpack_zipfile().

 - this way we keep backwards compatibility (especially since behavior in both modules differed for such a long time.)

 - it's not clear that ZipFile behavior is superior -- for example, what if a file with stripped path components overwrites existing files?

 - if requested in the future, a parameter can be added to enable ZipFile-like behavior

 - it can be very confusing if files are silently skipped, especially if an archive has thousands of files. 

I've added a PR, note that the test in PR also tests that files with '..' are indeed skipped, we don't have a test for that now, so that's an added benefit.
msg407622 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-12-04 01:34
I forgot to add this:

 - we may not want to follow the behavior of command line unzip - it's interactive so considerations are somewhat different. For example, it will warn if file is being overwritten and show a prompt on whether to skip, overwrite, abort, etc.
History
Date User Action Args
2022-04-11 14:57:59adminsetgithub: 65106
2022-04-07 21:02:08MarSoftsetnosy: + MarSoft
2021-12-12 18:58:58iritkatrielsetversions: + Python 3.11, - Python 3.3, Python 3.4
2021-12-04 01:34:53andrei.avksetmessages: + msg407622
2021-12-04 00:29:55andrei.avksetmessages: + msg407620
2021-12-04 00:12:50andrei.avksetnosy: + andrei.avk

pull_requests: + pull_request28135
stage: needs patch -> patch review
2021-07-07 03:38:31huantiansetnosy: + huantian
messages: + msg397062
2014-03-20 12:45:25r.david.murraysetmessages: + msg214214
2014-03-20 09:56:39peter@psantoro.netsetmessages: + msg214196
2014-03-19 19:24:30eric.araujosetmessages: + msg214127
2014-03-19 13:19:05r.david.murraysetnosy: + r.david.murray, docs@python
messages: + msg214093

assignee: docs@python
components: + Documentation
stage: test needed -> needs patch
2014-03-19 11:40:54pitrousetnosy: + eric.araujo, pitrou
messages: + msg214082
2014-03-13 23:12:18peter@psantoro.netsetfiles: + test_unpack.zip

messages: + msg213494
2014-03-13 09:51:20eric.araujosetstage: test needed
2014-03-13 09:46:05peter@psantoro.netcreate