Message 357023 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jgoerzen
Recipients	jgoerzen
Date	2019-11-20.02:52:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1574218342.23.0.381286067182.issue38861@roundup.psfhosted.org>
In-reply-to

Content
The zipfile.py standard library component contains a number of pieces of questionable handling of non-UTF8 filenames. As the ZIP file format predated Unicode by a significant number of years, this is actually fairly common with older code. Here is a very simple reproduction case. mkdir t cd t echo hi > `printf 'test\xf7.txt'` cd .. zip -9r t.zip t 0xf7 is the division sign in ISO-8859-1. In the "t" directory, "ls \| hd" displays: 00000000 74 65 73 74 f7 2e 74 78 74 0a \|test..txt.\| 0000000a Now, here's a simple Python3 program: import zipfile z = zipfile.ZipFile("t.zip") z.extractall() If you run this on the relevant ZIP file, the 0xf7 character is replaced with a Unicode sequence; "ls \| hd" now displays: 00000000 74 65 73 74 e2 89 88 2e 74 78 74 0a \|test....txt.\| 0000000c The impact within Python programs is equally bad. Fundamentally, the zipfile interface is broken; it should not try to decode filenames into strings and should instead treat them as bytes and leave potential decoding up to applications. It appears to try, down various code paths, to decode filenames as ascii, cp437, or utf-8. However, the ZIP file format was often used on Unix systems as well, which didn't tend to use cp437 (iso-8859-* was more common). In short, there is no way that zipfile.py can reliably guess the encoding of a filename in a ZIP file, so it is a data-loss bug that it attempts and fails to do so. It is a further bug that extractall mangles filenames; unzip(1) is perfectly capable of extracting these files correctly. I'm attaching this zip file for reference. At the very least, zipfile should provide a bytes interface for filenames for people that care about correctness.

The zipfile.py standard library component contains a number of pieces of questionable handling of non-UTF8 filenames.  As the ZIP file format predated Unicode by a significant number of years, this is actually fairly common with older code.

Here is a very simple reproduction case. 

mkdir t
cd t
echo hi > `printf 'test\xf7.txt'`
cd ..
zip -9r t.zip t

0xf7 is the division sign in ISO-8859-1.  In the "t" directory, "ls | hd" displays:

00000000  74 65 73 74 f7 2e 74 78  74 0a                    |test..txt.|
0000000a


Now, here's a simple Python3 program:

import zipfile

z = zipfile.ZipFile("t.zip")
z.extractall()

If you run this on the relevant ZIP file, the 0xf7 character is replaced with a Unicode sequence; "ls | hd" now displays:

00000000  74 65 73 74 e2 89 88 2e  74 78 74 0a              |test....txt.|
0000000c

The impact within Python programs is equally bad.  Fundamentally, the zipfile interface is broken; it should not try to decode filenames into strings and should instead treat them as bytes and leave potential decoding up to applications.  It appears to try, down various code paths, to decode filenames as ascii, cp437, or utf-8.  However, the ZIP file format was often used on Unix systems as well, which didn't tend to use cp437 (iso-8859-* was more common).  In short, there is no way that zipfile.py can reliably guess the encoding of a filename in a ZIP file, so it is a data-loss bug that it attempts and fails to do so.  It is a further bug that extractall mangles filenames; unzip(1) is perfectly capable of extracting these files correctly.  I'm attaching this zip file for reference.

At the very least, zipfile should provide a bytes interface for filenames for people that care about correctness.

History
Date	User	Action	Args
2019-11-20 02:52:22	jgoerzen	set	recipients: + jgoerzen
2019-11-20 02:52:22	jgoerzen	set	messageid: <1574218342.23.0.381286067182.issue38861@roundup.psfhosted.org>
2019-11-20 02:52:22	jgoerzen	link	issue38861 messages
2019-11-20 02:52:21	jgoerzen	create