classification
Title: zipfile: Corrupts filenames containing non-UTF8 characters
Type: behavior Stage:
Components: Library (Lib) Versions:
process
Status: open Resolution:
Dependencies: 28080 Superseder:
Assigned To: Nosy List: jgoerzen, jnalley, serhiy.storchaka
Priority: normal Keywords:

Created on 2019-11-20 02:52 by jgoerzen, last changed 2019-11-27 11:45 by serhiy.storchaka.

Files
File name Uploaded Description Edit
t.zip jgoerzen, 2019-11-20 02:52 Test ZIP file with ISO-8859-1 filename
Messages (6)
msg357023 - (view) Author: John Goerzen (jgoerzen) Date: 2019-11-20 02:52
The zipfile.py standard library component contains a number of pieces of questionable handling of non-UTF8 filenames.  As the ZIP file format predated Unicode by a significant number of years, this is actually fairly common with older code.

Here is a very simple reproduction case. 

mkdir t
cd t
echo hi > `printf 'test\xf7.txt'`
cd ..
zip -9r t.zip t

0xf7 is the division sign in ISO-8859-1.  In the "t" directory, "ls | hd" displays:

00000000  74 65 73 74 f7 2e 74 78  74 0a                    |test..txt.|
0000000a


Now, here's a simple Python3 program:

import zipfile

z = zipfile.ZipFile("t.zip")
z.extractall()

If you run this on the relevant ZIP file, the 0xf7 character is replaced with a Unicode sequence; "ls | hd" now displays:

00000000  74 65 73 74 e2 89 88 2e  74 78 74 0a              |test....txt.|
0000000c

The impact within Python programs is equally bad.  Fundamentally, the zipfile interface is broken; it should not try to decode filenames into strings and should instead treat them as bytes and leave potential decoding up to applications.  It appears to try, down various code paths, to decode filenames as ascii, cp437, or utf-8.  However, the ZIP file format was often used on Unix systems as well, which didn't tend to use cp437 (iso-8859-* was more common).  In short, there is no way that zipfile.py can reliably guess the encoding of a filename in a ZIP file, so it is a data-loss bug that it attempts and fails to do so.  It is a further bug that extractall mangles filenames; unzip(1) is perfectly capable of extracting these files correctly.  I'm attaching this zip file for reference.

At the very least, zipfile should provide a bytes interface for filenames for people that care about correctness.
msg357422 - (view) Author: Jon Nalley (jnalley) Date: 2019-11-25 01:04
I think the Python implementation is adhering to the zip specification.

From the specification v6.3.6 (Revised: April 26, 2019):

If general purpose bit 11 is unset, the file name and comment SHOULD conform 
to the original ZIP character encoding.  If general purpose bit 11 is set, the 
filename and comment MUST support The Unicode Standard, Version 4.1.0 or 
greater using the character encoding form defined by the UTF-8 storage 
specification.

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
msg357424 - (view) Author: John Goerzen (jgoerzen) Date: 2019-11-25 03:01
I can tell you that the zip(1) on Unix systems has never done re-encoding to cp437; on a system that uses latin-1 (or any other latin-* for that matter) the filenames in the ZIP will be encoded in latin-1.  Furthermore, this doesn't explain the corruption that extractall() causes.
msg357446 - (view) Author: Jon Nalley (jnalley) Date: 2019-11-25 16:56
Please see a detailed explanation of the behavior here:
https://gist.github.com/jnalley/cec21bca2d865758bc5e23654df28bd5
msg357480 - (view) Author: John Goerzen (jgoerzen) Date: 2019-11-26 04:28
Hi Jon,

I've read your article in the gist, the ZIP spec, and the article you linked to.  As the article you linked to (https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/) states, "Implementers just encode file names however they want (usually byte for byte as they are in the OS".  That is certainly my observation.  CP437 has NEVER been guaranteed, *even on DOS*.  See https://en.wikipedia.org/wiki/Category:DOS_code_pages and https://www.aivosto.com/articles/charsets-codepages-dos.html for details on DOS code pages.  I do not recall any translation between DOS codepages being done in practice, or even possible - since the whole point of multiple codepages was the need for more than 256 symbols.  So (leaving aside utf-8 encodings for a second) no operating system or ZIP implementation I am aware of performs a translation to cp437, such translation is often not even possible, and they're just copying literal bytes to ZIP -- as the POSIX filesystem itself is.

So, from the above paragraph, it's clear that the assumption in zipfile that cp437 is in use is faulty.  Your claim that Python "fixes" a problem is also faulty.  Converting from a latin-1 character, using a cp437 codeset, and generating a filename with that cp437 character represented as a Unicode code point is wrong in many ways.  Python should not take an opinion on this; it should be agnostic and copy the bytes that represent the filename in the ZIP to bytes that represent the filename on the filesystem.

POSIX filenames contain any of 254 characters (only 0x00 and '/' are invalid).  The filesystem is encoding-agnostic; POSIX filenames are just stream of bytes.  There is no alternative but to treat ZIP filenames (without the Unicode flag) the same way.  Copy bytes to bytes.  It is not possible to identify the encoding of the filename in the absence of the Unicode flag.

zipfile should:

1) expose a bytes interface to filename
2) use byte-for-byte extraction when no Unicode flag is present
3) not make the assumption that cp437 was the original encoding

Your proposal only "works" cross-platform because it is broken on every platform!
msg357565 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-27 11:45
The standard requires interpreting filename encoding as cp470 or utf8. But for practical reasons it would be handy to allow to specify other encoding (which is not necessary equal ti the local filesystem encoding)
. This is issue28080. But i left this issue open so that we will not forget to ensure that it will be the option of using extractall() with the encoding that matches the encoding of the local zip tool. There may be percularities on Windows and macOS. Lso there should be an option for CLI.
History
Date User Action Args
2019-11-27 11:45:52serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg357565

dependencies: + Allow reading member names with bogus encodings in zipfile
components: + Library (Lib)
2019-11-26 04:28:17jgoerzensetmessages: + msg357480
2019-11-25 16:56:57jnalleysetmessages: + msg357446
2019-11-25 03:01:14jgoerzensetmessages: + msg357424
2019-11-25 01:04:39jnalleysetnosy: + jnalley
messages: + msg357422
2019-11-20 02:52:22jgoerzencreate