classification
Title: ZipFile does not supports Unicode Path Extra Field (0x7075) zip header field
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: andreaerdna, ivan.sorokin.tech
Priority: normal Keywords: patch

Created on 2020-10-04 11:21 by ivan.sorokin.tech, last changed 2021-01-22 01:30 by andreaerdna.

Files
File name Uploaded Description Edit
23.zip ivan.sorokin.tech, 2020-10-04 11:21
Pull Requests
URL Status Linked Edit
PR 23736 open andreaerdna, 2020-12-10 19:45
Messages (3)
msg377931 - (view) Author: Ivan Sorokin (ivan.sorokin.tech) Date: 2020-10-04 11:21
See attached sample. Well-known unzip command line tool lists its contents correctly:

$ unzip -l 23.zip
Archive:  23.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
    81408  2012-10-23 19:03   Β' ΦΑΣΗ ΠΕ06 ΣΧΟΛΕΙΑ ΕΑΕΠ (ΙΝΤ).xls
---------                     -------
    81408                     1 file

But ZipFile lists the same file inside this archive as
ü' öÇæå Åä06 æòÄèäêÇ äÇäÅ (êîÆ).xls

It's because ZipFile completely ignores Unicode Path Extra Field (0x7075) zip header field.

See .ZIP specification for details on this field meaning and usage:
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
msg377945 - (view) Author: Ivan Sorokin (ivan.sorokin.tech) Date: 2020-10-04 15:24
Grand unified algorithm to read filenames from zip files correctly:

1. Do zip entry have «Unicode Path Extra Field» (0x7075)? Use it for file name.
2. Is Unicode flag (0x800) set in «Flags» Field of zip entry? Assume «Filename» Field is in UTF-8.
3. Do «HostOS» Field of zip entry have values of 0 (FAT) or 11 (NTFS)? Assume «Filename» Field is in OEM charset corresponding to system locale.
4. Assume «Filename» Field is in UTF-8.

p7zip with oemcp patch (https://github.com/unxed/oemcp/) uses exactly this method, and is able to process all zip files in my test set correctly (my test set contains several zips generated by different packers on windows, macos, linux, and by online services). The same algorithm should be used in any zip unpacker wishing to process non-latin filenames as gently as possible.
msg385467 - (view) Author: Andrea Giudiceandrea (andreaerdna) * Date: 2021-01-22 01:30
I submitted more than a month ago a PR that adds support for Unicode Path Extra Field in ZipFile.
The PR https://github.com/python/cpython/pull/23736 is awaiting a review in order to be merged.
History
Date User Action Args
2021-01-22 01:30:26andreaerdnasetmessages: + msg385467
2020-12-10 19:45:19andreaerdnasetkeywords: + patch
stage: patch review
pull_requests: + pull_request22595
2020-12-09 17:19:56andreaerdnasetnosy: + andreaerdna
2020-10-04 15:24:54ivan.sorokin.techsetmessages: + msg377945
2020-10-04 11:21:41ivan.sorokin.techcreate