Issue 40172: ZipInfo corrupts file names in some old zip archives

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84353

classification

Title:	ZipInfo corrupts file names in some old zip archives
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	dhillier, gregory.p.smith, iritkatriel, yudilevi
Priority:	normal	Keywords:	patch

Created on 2020-04-03 14:55 by yudilevi, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
example.zip	yudilevi, 2020-04-03 14:55

Pull Requests
URL	Status	Linked	Edit
PR 19335	open	yudilevi, 2020-04-03 15:19

Messages (8)
msg365701 - (view)	Author: Yudi Levi (yudilevi) *	Date: 2020-04-03 14:55
Some old zip files that don't yet use unicode file names might have entries with characters beyond the ascii range. ZipInfo seems to encode these file names with 'cp437' codepage (correct for old zips) but decode them back with 'ascii' code page which might corrupt them.
msg393766 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-05-16 23:08
Can you suggest a unit test for this?
msg393767 - (view)	Author: Yudi Levi (yudilevi) *	Date: 2021-05-16 23:15
Hey :) Sorry that I'm not responsive, just busy. I'll add one soon. Yudi On Mon, May 17, 2021 at 12:08 AM Irit Katriel <report@bugs.python.org> wrote: > > Irit Katriel <iritkatriel@yahoo.com> added the comment: > > Can you suggest a unit test for this? > > ---------- > nosy: +iritkatriel > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue40172> > _______________________________________ >
msg394293 - (view)	Author: Daniel Hillier (dhillier) *	Date: 2021-05-25 04:31
zipfile decodes filenames using cp437 or unicode and encodes using ascii or unicode. It seems like zipfile has a preference for writing filenames in unicode rather than cp437. Is zipfile's preference for writing filenames in unicode rather than cp437 intentional? Is the bug you're seeing related to using zipfile to open and rewrite old zips and not being able to open the rewritten files in an old program that doesn't support the unicode flag? We could address this two ways: - Change ZipInfo._encodeFilenameFlags() to always encode to cp437 if possible - Add a flag to write filenames in cp437 or unicode, otherwise the current situation of ascii or unicode I guess the choice will depend on if preferring unicode rather than cp437 is intentional and if writing filenames in cp437 will break anything (it shouldn't break anything according to Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) Here's a test for your current patch (I'd probably put it alongside OtherTests.test_read_after_write_unicode_filenames as this test was adapted from that one) class OtherTests(unittest.TestCase): ... def test_read_after_write_cp437_filenames(self): fname = 'test_cp437_é' with zipfile.ZipFile(TESTFN2, 'w') as zipfp: zipfp.writestr(fname, b'sample') with zipfile.ZipFile(TESTFN2) as zipfp: zinfo = zipfp.infolist()[0] # Ensure general purpose bit 11 (Language encoding flag # (EFS)) is unset to indicate the filename is not unicode self.assertFalse(zinfo.flag_bits & 0x800) self.assertEqual(zipfp.read(fname), b'sample')
msg394505 - (view)	Author: Daniel Hillier (dhillier) *	Date: 2021-05-27 01:45
Looking into this more and it appears that while Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT says "If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" where the original encoding is IBM 437 (cp437), this is not always followed. This isn't too surprising as cp437 doesn't have every character for every language! In particular, some archive programs on windows will use the user's locale code page. https://superuser.com/questions/1321371/proper-encoding-for-file-names-in-zip-archives-created-in-windows-and-unpacked-i A UTF filename can be stored in the extra field 0x7075 in addition to a filename encoded in an arbitrary code page stored in the header's filename section. There is a open issue to add handling these fields (for reading) to zipfile: https://bugs.python.org/issue41928 and that issue may be related to this one https://bugs.python.org/issue40407 For this issue, with regards to encoding, I prefer the current situation where general purpose bit 11 for UTF is preferentially used because it doesn't change the behaviour compared to previous Python versions and it reduces file size as the filename isn't repeated in the extra field. For compatibility with other archive programs that don't support the general purpose bit 11, I suggest we add an additional mechanism to allow the code page for the path name (and comment) to be set and use the 0x7075 extra field to store the UTF name in those cases where the filename can't be encoded in ascii (and 0x6075 to store the utf comment where it can't be encoded in ascii)
msg414601 - (view)	Author: Yudi Levi (yudilevi) *	Date: 2022-03-05 23:46
The main issue is that when extracting older zip files, files are actually written to disk with corrupted (altered) names. Unfortunately it's been a while since I saw this issue and I can't tell if it was fixed or if I simply can't reproduce it. I do see that encoding/decoding in ZipInfo is still inconsistent, sometimes uses ascii codepage and sometimes uses cp437 codepage which seems wrong to me. Not sure how we should handle it but I think that switching the default ascii encoding to cp437 to be consistent with the old implementation (and with the filename decoding) seems like the right way to go.
msg415741 - (view)	Author: Daniel Hillier (dhillier) *	Date: 2022-03-22 04:21
Related to issue https://bugs.python.org/issue28080 which has a patch that covers a bit of this issue
msg415745 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2022-03-22 06:31
Examining Lib/zipfile.py code, the existing code makes sense. Python's zipfile module produces modern zipfiles when writing by setting the utf-8 flag and storing the filename as utf-8 when it is not ASCII. This is desirable for use with all normal zip implementations in the past 10-15 years. When decoding a zipfile, if the utf-8 flag is not set, we assume cp437 per the pkware zip appnotes.txt "spec". So our reading is correct as well, even for very old files. This is being strict in what we produce an lenient in what we accept. caveats? yes: If someone does need to produce zipfiles for use with ancient software that does not support utf-8, that also does not identify the unknown utf-8 flag as an error condition, it will interpret the name in a corrupt manner for non-ascii names. Similarly, even if written with cp437 names (as PR 19335 would do), in old zip system implementations where the implementation blindly uses the users locale encoding instead of cp437, it will always see corrupt data in that scenario. (aka mojibake?) These are not what I'd expect to be normal use cases. Do you have a common practical example of a need for this? (The PR on issue28080 provides a way to _read_ legacy zip files that used a codec other than cp437 if you know what it was.) --- https://www.loc.gov/preservation/digital/formats/fdd/fdd000354.shtml may also be of interest regarding the zip format.

History
Date	User	Action	Args
2022-04-11 14:59:29	admin	set	github: 84353
2022-03-22 06:31:22	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg415745
2022-03-22 04:21:02	dhillier	set	messages: + msg415741
2022-03-05 23:46:14	yudilevi	set	messages: + msg414601
2021-05-27 01:45:36	dhillier	set	messages: + msg394505
2021-05-25 04:31:47	dhillier	set	nosy: + dhillier messages: + msg394293
2021-05-16 23:15:01	yudilevi	set	messages: + msg393767
2021-05-16 23:08:54	iritkatriel	set	nosy: + iritkatriel messages: + msg393766
2020-04-03 15:19:26	yudilevi	set	keywords: + patch stage: patch review pull_requests: + pull_request18697
2020-04-03 14:55:40	yudilevi	create