classification
Title: ZipFile: add a filename_encoding argument
Type: enhancement Stage:
Components: Extension Modules Versions: Python 3.3, Python 3.2
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, haypo, loewis, ocean-city, serhiy.storchaka, umedoblock
Priority: normal Keywords: patch

Created on 2010-12-03 07:41 by ocean-city, last changed 2012-07-13 14:44 by umedoblock.

Files
File name Uploaded Description Edit
non-ascii-cp932.zip ocean-city, 2010-12-04 10:45 built with python2.7
zipfile.patch umedoblock, 2012-07-13 04:16 decode_filename zipfile.patch
encodings.py umedoblock, 2012-07-13 14:44
Messages (10)
msg123197 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-12-03 07:41
Currently, ZipFile only accepts ascii or utf8 as file
name encodings. On Windows (Japanese), usually CP932
is used for it. So currently, when we melt ZipFile
via py3k, non-ascii file name becomes strange. Can we handle
this issue? (ie: adding encoding option for ZipFile#__init__)
msg123201 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-12-03 08:07
The ZIP format specification mentions only cp437 and utf8: http://www.pkware.com/documents/casestudies/APPNOTE.TXT see Apeendix D.
Do zip files created on Japanese Windows contain some information about the encoding they use?
Or do some programs write cp932 where they are supposed to use one of the encodings above?
msg123202 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-12-03 08:13
No, there is no indication in the zipfile that it deviates from the spec. That doesn't stop people from creating such zipfiles, anyway; many zip tools ignore the spec and use instead CP_ACP (which, of course, will then get misinterpreted if extracted on a different system).

I think we must support this case somehow, but must be careful to avoid creating such files unless explicitly requested. One approach might be to have two encodings given: one to interpret the existing filenames, and one to be used for new filenames (with a recommendation to never use that parameter since zip now supports UTF-8 in a well-defined manner).
msg123229 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-12-03 12:27
@Hirokazu: Can you attach a small test archive?

Yes, we can add a "default_encoding" attribute to ZipFile and add an optional default_encoding argument to its constructor.
msg123332 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2010-12-04 10:45
I'm not sure why, but I got BadZipFile error now. Anyway,
here is cp932 zip file to be created with python2.7.
msg126791 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-01-21 22:39
In #10972, I propose to add an option for the filename encoding to UTF-8. But I would like to force UTF-8 to create a ZIP file, it doesn't concern the decompression of a ZIP file.

Proposal of a specification to fix both issues at the same time.


"default_encoding" name is confusing because it doesn't specify if it is the encoding of (text?) file content or the encoding the filename. Why not simply "filename_encoding"?

The option can be added in multiple places:
 - argument to ZipFile constructor: this is needed to decompress
 - argument to ZipFile.write() and ZipInfo, because they are 3 different manners to add files

ZipFile.filename_encoding (and ZipInfo.filename_encoding) will be None by default: in this case, use the current algorithm (try cp437 or use UTF-8). Otherwise, use the encoding. If the encoding is UTF-8: set unicode flag.

Examples:
---
zipfile.ZipFile("non-ascii-cp932.zip", filename_encoding="cp932")

f = zipfile.ZipFile("test.zip", "w")
f.write(filename, filename_encoding="UTF-8")
info = ZipInfo(filename, filename_encoding="UTF-8")
f.writestr(info, b'data')
---

Don't add filename_encoding argument to ZipFile.writestr(), because it may conflict if a ZipInfo is passed and ZipInfo.filename_encoding and filename_encoding are different.
msg136233 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-05-18 12:02
I closed issue #12048 as a duplicate of this issue: yaoyu wants to uncompress a ZIP file having filenames encoded to GBK.
msg165351 - (view) Author: umedoblock (umedoblock) Date: 2012-07-13 04:16
I fixed this problem.
I make new methos _decode_filename().
msg165384 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-07-13 14:02
umedoblock: your patch is incorrect, as it produces moji-bake. if there is a file name b'f\x94n', it will decode as sjis under your patch (to u'f\u99ac'), even though it was meant as cp437 (i.e. u'f\xf6n').
msg165386 - (view) Author: umedoblock (umedoblock) Date: 2012-07-13 14:44
Hi, Martin.
I tried your test case with attached file.
And I got below result.

p3 ./encodings.py
encoding: sjis, filename: f馬
encoding: cp437, filename: fön
sjis_filename = f馬
cp437_filename = fön

There are two success cases.
So I think that the patch needs to change default_encoding
before or in _decode_filename().

But I have no idea about how to change a default_encoding.
History
Date User Action Args
2012-08-09 08:47:49loewislinkissue15602 superseder
2012-07-13 14:44:19umedoblocksetfiles: + encodings.py

messages: + msg165386
2012-07-13 14:02:39loewissetmessages: + msg165384
2012-07-13 04:16:20umedoblocksetfiles: + zipfile.patch

nosy: + umedoblock
messages: + msg165351

keywords: + patch
2012-04-07 19:21:35serhiy.storchakasetnosy: + serhiy.storchaka
2011-05-18 12:02:34hayposetmessages: + msg136233
2011-02-01 00:03:16hayposetnosy: loewis, amaury.forgeotdarc, haypo, ocean-city
title: ZipFile and CP932 encoding -> ZipFile: add a filename_encoding argument
2011-01-21 22:39:16hayposetnosy: loewis, amaury.forgeotdarc, haypo, ocean-city
messages: + msg126791
2010-12-04 10:45:06ocean-citysetfiles: + non-ascii-cp932.zip

messages: + msg123332
2010-12-03 12:27:16hayposetnosy: + haypo
messages: + msg123229
2010-12-03 08:13:29loewissetnosy: + loewis
messages: + msg123202
2010-12-03 08:07:38amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg123201
2010-12-03 07:41:56ocean-citycreate