classification
Title: ZipInfo corrupts file names in some old zip archives
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: dhillier, iritkatriel, yudilevi
Priority: normal Keywords: patch

Created on 2020-04-03 14:55 by yudilevi, last changed 2021-05-27 01:45 by dhillier.

Files
File name Uploaded Description Edit
example.zip yudilevi, 2020-04-03 14:55
Pull Requests
URL Status Linked Edit
PR 19335 open yudilevi, 2020-04-03 15:19
Messages (5)
msg365701 - (view) Author: Yudi Levi (yudilevi) * Date: 2020-04-03 14:55
Some old zip files that don't yet use unicode file names might have entries with characters beyond the ascii range.
ZipInfo seems to encode these file names with 'cp437' codepage (correct for old zips) but decode them back with 'ascii' code page which might corrupt them.
msg393766 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-05-16 23:08
Can you suggest a unit test for this?
msg393767 - (view) Author: Yudi Levi (yudilevi) * Date: 2021-05-16 23:15
Hey :)

Sorry that I'm not responsive, just busy.
I'll add one soon.

Yudi

On Mon, May 17, 2021 at 12:08 AM Irit Katriel <report@bugs.python.org>
wrote:

>
> Irit Katriel <iritkatriel@yahoo.com> added the comment:
>
> Can you suggest a unit test for this?
>
> ----------
> nosy: +iritkatriel
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40172>
> _______________________________________
>
msg394293 - (view) Author: Daniel Hillier (dhillier) * Date: 2021-05-25 04:31
zipfile decodes filenames using cp437 or unicode and encodes using ascii or unicode. It seems like zipfile has a preference for writing filenames in unicode rather than cp437. Is zipfile's preference for writing filenames in unicode rather than cp437 intentional?

Is the bug you're seeing related to using zipfile to open and rewrite old zips and not being able to open the rewritten files in an old program that doesn't support the unicode flag?

We could address this two ways:
- Change ZipInfo._encodeFilenameFlags() to always encode to cp437 if possible
- Add a flag to write filenames in cp437 or unicode, otherwise the current situation of ascii or unicode

I guess the choice will depend on if preferring unicode rather than cp437 is intentional and if writing filenames in cp437 will break anything (it shouldn't break anything according to Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT)

Here's a test for your current patch (I'd probably put it alongside OtherTests.test_read_after_write_unicode_filenames as this test was adapted from that one)

class OtherTests(unittest.TestCase):
    ...

    def test_read_after_write_cp437_filenames(self):
        fname = 'test_cp437_é'
        with zipfile.ZipFile(TESTFN2, 'w') as zipfp:
            zipfp.writestr(fname, b'sample')

        with zipfile.ZipFile(TESTFN2) as zipfp:
            zinfo = zipfp.infolist()[0]
            # Ensure general purpose bit 11 (Language encoding flag
            # (EFS)) is unset to indicate the filename is not unicode
            self.assertFalse(zinfo.flag_bits & 0x800)
            self.assertEqual(zipfp.read(fname), b'sample')
msg394505 - (view) Author: Daniel Hillier (dhillier) * Date: 2021-05-27 01:45
Looking into this more and it appears that while Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT says "If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" where the original encoding is IBM 437 (cp437), this is not always followed. This isn't too surprising as cp437 doesn't have every character for every language! In particular, some archive programs on windows will use the user's locale code page.

https://superuser.com/questions/1321371/proper-encoding-for-file-names-in-zip-archives-created-in-windows-and-unpacked-i

A UTF filename can be stored in the extra field 0x7075 in addition to a filename encoded in an arbitrary code page stored in the header's filename section. There is a open issue to add handling these fields (for reading) to zipfile: https://bugs.python.org/issue41928 and that issue may be related to this one https://bugs.python.org/issue40407

For this issue, with regards to encoding, I prefer the current situation where general purpose bit 11 for UTF is preferentially used because it doesn't change the behaviour compared to previous Python versions and it reduces file size as the filename isn't repeated in the extra field.

For compatibility with other archive programs that don't support the general purpose bit 11, I suggest we add an additional mechanism to allow the code page for the path name (and comment) to be set and use the 0x7075 extra field to store the UTF name in those cases where the filename can't be encoded in ascii (and 0x6075 to store the utf comment where it can't be encoded in ascii)
History
Date User Action Args
2021-05-27 01:45:36dhilliersetmessages: + msg394505
2021-05-25 04:31:47dhilliersetnosy: + dhillier
messages: + msg394293
2021-05-16 23:15:01yudilevisetmessages: + msg393767
2021-05-16 23:08:54iritkatrielsetnosy: + iritkatriel
messages: + msg393766
2020-04-03 15:19:26yudilevisetkeywords: + patch
stage: patch review
pull_requests: + pull_request18697
2020-04-03 14:55:40yudilevicreate