New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZipFile: add a filename_encoding argument #54823
Comments
Currently, ZipFile only accepts ascii or utf8 as file |
The ZIP format specification mentions only cp437 and utf8: http://www.pkware.com/documents/casestudies/APPNOTE.TXT see Apeendix D. |
No, there is no indication in the zipfile that it deviates from the spec. That doesn't stop people from creating such zipfiles, anyway; many zip tools ignore the spec and use instead CP_ACP (which, of course, will then get misinterpreted if extracted on a different system). I think we must support this case somehow, but must be careful to avoid creating such files unless explicitly requested. One approach might be to have two encodings given: one to interpret the existing filenames, and one to be used for new filenames (with a recommendation to never use that parameter since zip now supports UTF-8 in a well-defined manner). |
@hirokazu: Can you attach a small test archive? Yes, we can add a "default_encoding" attribute to ZipFile and add an optional default_encoding argument to its constructor. |
I'm not sure why, but I got BadZipFile error now. Anyway, |
In bpo-10972, I propose to add an option for the filename encoding to UTF-8. But I would like to force UTF-8 to create a ZIP file, it doesn't concern the decompression of a ZIP file. Proposal of a specification to fix both issues at the same time. "default_encoding" name is confusing because it doesn't specify if it is the encoding of (text?) file content or the encoding the filename. Why not simply "filename_encoding"? The option can be added in multiple places:
ZipFile.filename_encoding (and ZipInfo.filename_encoding) will be None by default: in this case, use the current algorithm (try cp437 or use UTF-8). Otherwise, use the encoding. If the encoding is UTF-8: set unicode flag. Examples: zipfile.ZipFile("non-ascii-cp932.zip", filename_encoding="cp932")
f = zipfile.ZipFile("test.zip", "w")
f.write(filename, filename_encoding="UTF-8")
info = ZipInfo(filename, filename_encoding="UTF-8")
f.writestr(info, b'data') Don't add filename_encoding argument to ZipFile.writestr(), because it may conflict if a ZipInfo is passed and ZipInfo.filename_encoding and filename_encoding are different. |
I closed issue bpo-12048 as a duplicate of this issue: yaoyu wants to uncompress a ZIP file having filenames encoded to GBK. |
I fixed this problem. |
umedoblock: your patch is incorrect, as it produces moji-bake. if there is a file name b'f\x94n', it will decode as sjis under your patch (to u'f\u99ac'), even though it was meant as cp437 (i.e. u'f\xf6n'). |
Hi, Martin. p3 ./encodings.py There are two success cases. But I have no idea about how to change a default_encoding. |
I'd like to submit patch to support zip archives created on systems that use non-US codepage (e.g. russian CP866). --- zipfile.py-orig 2013-09-18 16:45:56.000000000 +0400 - def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=False):
+ def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=False, codepage='cp437'):
"""Open the ZIP file with mode read "r", write "w" or append "a"."""
if mode not in ("r", "w", "a"):
raise RuntimeError('ZipFile() requires mode "r", "w", or "a"')
@@ -901,6 +901,7 @@
self.mode = key = mode.replace('b', '')[0]
self.pwd = None
self._comment = b''
+ self.codepage = codepage
# Check if we were passed a file-like object
if isinstance(file, str):
@@ -1002,7 +1003,7 @@
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
- filename = filename.decode('cp437')
+ filename = filename.decode(self.codepage)
# Create ZipInfo instance to store file information
x = ZipInfo(filename)
x.extra = fp.read(centdir[_CD_EXTRA_FIELD_LENGTH])
@@ -1157,7 +1158,7 @@
# UTF-8 filename
fname_str = fname.decode("utf-8")
else:
- fname_str = fname.decode("cp437")
+ fname_str = fname.decode(self.codepage) if fname_str != zinfo.orig_filename:
raise BadZipFile( |
Please rename codepage to encoding. By the way, 437 is a codepage, cp437 is I don't think that ZIP is limited to windows. I uncompressed zip files many |
OK, here you are: --- zipfile.py-orig 2013-09-18 16:45:56.000000000 +0400 - def __init__(self, file, mode="r", compression=ZIP_STORED,
allowZip64=False):
+ def __init__(self, file, mode="r", compression=ZIP_STORED,
allowZip64=False, encoding='cp437'):
"""Open the ZIP file with mode read "r", write "w" or append
"a"."""
if mode not in ("r", "w", "a"):
raise RuntimeError('ZipFile() requires mode "r", "w", or "a"')
@@ -901,6 +901,7 @@
self.mode = key = mode.replace('b', '')[0]
self.pwd = None
self._comment = b''
+ self.encoding = encoding
# Check if we were passed a file-like object
if isinstance(file, str):
@@ -1001,8 +1002,8 @@
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
- # Historical ZIP filename encoding
- filename = filename.decode('cp437')
+ # Historical ZIP filename encoding, default is CP437
+ filename = filename.decode(self.encoding)
# Create ZipInfo instance to store file information
x = ZipInfo(filename)
x.extra = fp.read(centdir[_CD_EXTRA_FIELD_LENGTH])
@@ -1157,7 +1158,7 @@
# UTF-8 filename
fname_str = fname.decode("utf-8")
else:
- fname_str = fname.decode("cp437")
+ fname_str = fname.decode(self.encoding) if fname_str != zinfo.orig_filename:
raise BadZipFile( On Fri, Oct 18, 2013 at 11:47 AM, STINNER Victor <report@bugs.python.org>wrote:
|
See also bpo-28080. |
Thanks. Patch posted in bpo-28080 looks better than mine. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: