Author jgoerzen
Recipients jgoerzen
Date 2019-11-20.02:52:21
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1574218342.23.0.381286067182.issue38861@roundup.psfhosted.org>
In-reply-to
Content
The zipfile.py standard library component contains a number of pieces of questionable handling of non-UTF8 filenames.  As the ZIP file format predated Unicode by a significant number of years, this is actually fairly common with older code.

Here is a very simple reproduction case. 

mkdir t
cd t
echo hi > `printf 'test\xf7.txt'`
cd ..
zip -9r t.zip t

0xf7 is the division sign in ISO-8859-1.  In the "t" directory, "ls | hd" displays:

00000000  74 65 73 74 f7 2e 74 78  74 0a                    |test..txt.|
0000000a


Now, here's a simple Python3 program:

import zipfile

z = zipfile.ZipFile("t.zip")
z.extractall()

If you run this on the relevant ZIP file, the 0xf7 character is replaced with a Unicode sequence; "ls | hd" now displays:

00000000  74 65 73 74 e2 89 88 2e  74 78 74 0a              |test....txt.|
0000000c

The impact within Python programs is equally bad.  Fundamentally, the zipfile interface is broken; it should not try to decode filenames into strings and should instead treat them as bytes and leave potential decoding up to applications.  It appears to try, down various code paths, to decode filenames as ascii, cp437, or utf-8.  However, the ZIP file format was often used on Unix systems as well, which didn't tend to use cp437 (iso-8859-* was more common).  In short, there is no way that zipfile.py can reliably guess the encoding of a filename in a ZIP file, so it is a data-loss bug that it attempts and fails to do so.  It is a further bug that extractall mangles filenames; unzip(1) is perfectly capable of extracting these files correctly.  I'm attaching this zip file for reference.

At the very least, zipfile should provide a bytes interface for filenames for people that care about correctness.
History
Date User Action Args
2019-11-20 02:52:22jgoerzensetrecipients: + jgoerzen
2019-11-20 02:52:22jgoerzensetmessageid: <1574218342.23.0.381286067182.issue38861@roundup.psfhosted.org>
2019-11-20 02:52:22jgoerzenlinkissue38861 messages
2019-11-20 02:52:21jgoerzencreate