This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: zipfile does not handle arcnames with non-ascii characters on Windows
Type: behavior Stage: resolved
Components: Windows Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Shane Lee, paul.moore, serhiy.storchaka, steve.dower, tim.golden, zach.ware
Priority: normal Keywords: patch

Created on 2019-02-21 01:59 by Shane Lee, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test.zip Shane Lee, 2019-02-21 01:59 A zip file containing files with non-ascii filenames
Pull Requests
URL Status Linked Edit
PR 11965 closed python-dev, 2019-02-21 02:19
Messages (2)
msg336172 - (view) Author: Shane Lee (Shane Lee) * Date: 2019-02-21 01:59
Python 2.7.15 (probably affects newer versions as well)

Given an archive with any number of files inside that have non-ascii characters in their filename `zipfile` will crash when extracting them to the file system.

```
Traceback (most recent call last):
  File "c:\dev\salt\salt\modules\archive.py", line 1081, in unzip
    zfile.extract(target, dest, password)
  File "c:\python27\lib\zipfile.py", line 1028, in extract
    return self._extract_member(member, path, pwd)
  File "c:\python27\lib\zipfile.py", line 1069, in _extract_member
    targetpath = os.path.join(targetpath, arcname)
  File "c:\python27\lib\ntpath.py", line 85, in join
    result_path = result_path + p_path
UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 3: ordinal not in range(128)
```
msg336183 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-21 05:40
You can not just add .decode('cp437') to arcname.

1. This will fail if the ZIP archive contains file names encoded with UTF-8. They are already unicode and contains non-ascii characters. For decode() they will be implicit encoded to str, that will fail.

2. This will fail when targetpath is a 8-bit string containing non-ascii characters. Currently this works (maybe incorrectly).

3. While cp437 is the only official encoding in ZIP archives if UTF-8 is not used, de facto different encodings (like cp866) are used on localized Windows.

Fixing the problem without introducing other problems and breaking existing working code is hard. One possible solution is using Python 3.

I suggest to close this issue as "won't fix".
History
Date User Action Args
2022-04-11 14:59:11adminsetgithub: 80242
2019-02-21 11:06:19methanesetstatus: open -> closed
resolution: wont fix
stage: patch review -> resolved
2019-02-21 05:40:42serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg336183
2019-02-21 02:19:01python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request11991
2019-02-21 01:59:30Shane Leecreate