Author dhillier
Recipients dhillier, iritkatriel, yudilevi
Date 2021-05-27.01:45:35
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
Looking into this more and it appears that while Appendix D of says "If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" where the original encoding is IBM 437 (cp437), this is not always followed. This isn't too surprising as cp437 doesn't have every character for every language! In particular, some archive programs on windows will use the user's locale code page.

A UTF filename can be stored in the extra field 0x7075 in addition to a filename encoded in an arbitrary code page stored in the header's filename section. There is a open issue to add handling these fields (for reading) to zipfile: and that issue may be related to this one

For this issue, with regards to encoding, I prefer the current situation where general purpose bit 11 for UTF is preferentially used because it doesn't change the behaviour compared to previous Python versions and it reduces file size as the filename isn't repeated in the extra field.

For compatibility with other archive programs that don't support the general purpose bit 11, I suggest we add an additional mechanism to allow the code page for the path name (and comment) to be set and use the 0x7075 extra field to store the UTF name in those cases where the filename can't be encoded in ascii (and 0x6075 to store the utf comment where it can't be encoded in ascii)
Date User Action Args
2021-05-27 01:45:36dhilliersetrecipients: + dhillier, yudilevi, iritkatriel
2021-05-27 01:45:36dhilliersetmessageid: <>
2021-05-27 01:45:36dhillierlinkissue40172 messages
2021-05-27 01:45:35dhilliercreate