Message 394505 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dhillier
Recipients	dhillier, iritkatriel, yudilevi
Date	2021-05-27.01:45:35
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1622079936.22.0.965645884092.issue40172@roundup.psfhosted.org>
In-reply-to

Content
Looking into this more and it appears that while Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT says "If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" where the original encoding is IBM 437 (cp437), this is not always followed. This isn't too surprising as cp437 doesn't have every character for every language! In particular, some archive programs on windows will use the user's locale code page. https://superuser.com/questions/1321371/proper-encoding-for-file-names-in-zip-archives-created-in-windows-and-unpacked-i A UTF filename can be stored in the extra field 0x7075 in addition to a filename encoded in an arbitrary code page stored in the header's filename section. There is a open issue to add handling these fields (for reading) to zipfile: https://bugs.python.org/issue41928 and that issue may be related to this one https://bugs.python.org/issue40407 For this issue, with regards to encoding, I prefer the current situation where general purpose bit 11 for UTF is preferentially used because it doesn't change the behaviour compared to previous Python versions and it reduces file size as the filename isn't repeated in the extra field. For compatibility with other archive programs that don't support the general purpose bit 11, I suggest we add an additional mechanism to allow the code page for the path name (and comment) to be set and use the 0x7075 extra field to store the UTF name in those cases where the filename can't be encoded in ascii (and 0x6075 to store the utf comment where it can't be encoded in ascii)

Looking into this more and it appears that while Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT says "If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" where the original encoding is IBM 437 (cp437), this is not always followed. This isn't too surprising as cp437 doesn't have every character for every language! In particular, some archive programs on windows will use the user's locale code page.

https://superuser.com/questions/1321371/proper-encoding-for-file-names-in-zip-archives-created-in-windows-and-unpacked-i

A UTF filename can be stored in the extra field 0x7075 in addition to a filename encoded in an arbitrary code page stored in the header's filename section. There is a open issue to add handling these fields (for reading) to zipfile: https://bugs.python.org/issue41928 and that issue may be related to this one https://bugs.python.org/issue40407

For this issue, with regards to encoding, I prefer the current situation where general purpose bit 11 for UTF is preferentially used because it doesn't change the behaviour compared to previous Python versions and it reduces file size as the filename isn't repeated in the extra field.

For compatibility with other archive programs that don't support the general purpose bit 11, I suggest we add an additional mechanism to allow the code page for the path name (and comment) to be set and use the 0x7075 extra field to store the UTF name in those cases where the filename can't be encoded in ascii (and 0x6075 to store the utf comment where it can't be encoded in ascii)

History
Date	User	Action	Args
2021-05-27 01:45:36	dhillier	set	recipients: + dhillier, yudilevi, iritkatriel
2021-05-27 01:45:36	dhillier	set	messageid: <1622079936.22.0.965645884092.issue40172@roundup.psfhosted.org>
2021-05-27 01:45:36	dhillier	link	issue40172 messages
2021-05-27 01:45:35	dhillier	create