This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ivan.sorokin.tech
Recipients ivan.sorokin.tech
Date 2020-10-04.15:24:54
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1601825094.51.0.960614764386.issue41928@roundup.psfhosted.org>
In-reply-to
Content
Grand unified algorithm to read filenames from zip files correctly:

1. Do zip entry have «Unicode Path Extra Field» (0x7075)? Use it for file name.
2. Is Unicode flag (0x800) set in «Flags» Field of zip entry? Assume «Filename» Field is in UTF-8.
3. Do «HostOS» Field of zip entry have values of 0 (FAT) or 11 (NTFS)? Assume «Filename» Field is in OEM charset corresponding to system locale.
4. Assume «Filename» Field is in UTF-8.

p7zip with oemcp patch (https://github.com/unxed/oemcp/) uses exactly this method, and is able to process all zip files in my test set correctly (my test set contains several zips generated by different packers on windows, macos, linux, and by online services). The same algorithm should be used in any zip unpacker wishing to process non-latin filenames as gently as possible.
History
Date User Action Args
2020-10-04 15:24:54ivan.sorokin.techsetrecipients: + ivan.sorokin.tech
2020-10-04 15:24:54ivan.sorokin.techsetmessageid: <1601825094.51.0.960614764386.issue41928@roundup.psfhosted.org>
2020-10-04 15:24:54ivan.sorokin.techlinkissue41928 messages
2020-10-04 15:24:54ivan.sorokin.techcreate