classification
Title: zipfile: wrong encoding charset of member filename
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: duplicate
Dependencies: Superseder: ZipFile: add a filename_encoding argument
View: 10614
Assigned To: Nosy List: loewis, monson
Priority: normal Keywords:

Created on 2012-08-09 06:20 by monson, last changed 2012-08-09 08:47 by loewis. This issue is now closed.

Messages (2)
msg167760 - (view) Author: monson (monson) * Date: 2012-08-09 06:20
In /cpython/Lib/zipfile.py, there are some codes like

            if flags & 0x800:
                # UTF-8 file names extension
                filename = filename.decode('utf-8')
            else:
                # Historical ZIP filename encoding
                filename = filename.decode('cp437')


But actually there is no "Historical ZIP filename encoding", because zip files contain no charset info.
In English countries, it's usually not a big deal. But if the files zip on a non-cp437-based system (especially like China or Japan), filename is encoded from charsets like gb18030, but ZipFile decodes the byte stream to cp437, then everything goes wrong and people are hard to find the reason.

It's a problem new in py3k, and I found it on python3.2 and python3.4.
I suggest the filename returned in Bytes objects, or add decoding parameter when opening zipfile.
msg167775 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-08-09 08:47
You are mistaken: there *is* a character set specification for file names in zip files, see

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Appendix D says

"The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437.  This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages."

Using bytes objects for file names is not acceptable; in Python 3, file names are (unicode) strings.

Adding a new parameter is an option, and already discussed in issue 10614 .

People using non-437 code sets should really start using UTF-8 encoded file names in the zip files, and set the general purpose bit 11.

Closing this report as a duplicate.
History
Date User Action Args
2012-08-09 08:47:49loewissetstatus: open -> closed

nosy: + loewis
messages: + msg167775

superseder: ZipFile: add a filename_encoding argument
resolution: duplicate
2012-08-09 06:20:45monsoncreate