classification
Title: zipfile.write, arcname should be allowed to be a byte string
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.1, Python 3.2
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Patrik Dufresne, aimacintyre, connexion2000, loewis, ozialien, r.david.murray
Priority: normal Keywords:

Created on 2010-12-22 12:44 by connexion2000, last changed 2016-01-02 23:23 by Patrik Dufresne.

Messages (7)
msg124499 - (view) Author: Jacek Jabłoński (connexion2000) Date: 2010-12-22 12:44
file = 'somefile.dat'
filename = "ółśąśółąś.dat"
zip = zipfile.ZipFile('archive.zip', 'w', zipfile.ZIP_DEFLATED)
zip.write(file, filename)

above produces very nasty filename in zip archive.
*************************************************************
file = 'somefile.dat'
filename = "ółśąśółąś.dat"
zip = zipfile.ZipFile('archive.zip', 'w', zipfile.ZIP_DEFLATED)
zip.write(file, filename.encode('cp852'))

this produces TypeError: expected an object with the buffer interface

Documentation says that:
There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write().

I convert them to byte string but it ends with an error.
If it is documentation bug, what is the proper way to have filenames like "ółśąśółąś" in zip archive?
msg124518 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-12-22 20:07
This is not a bug. Your code that produces "very nasty filename" is the right one - the file name is actually the one you asked for. The second code is also behaving correctly: filename already *is* a bytestring, calling .encode for it is meaningless.
msg124519 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-12-22 20:12
Oops, I take this back - I didn't notice you were using Python 3.1.

In any case, your first code is correct. What you get is the best you can ask for.

That the second case fails is indeed a bug.
msg124641 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-25 16:37
See also msg79724 of issue 4871.  From looking at the code it appears that the filename must be a string, and if it contains only ASCII characters it is entered as ascii, while if it contains non-ascii it is encoded to utf-8 and the appropriate flag bits set in the archive to indicate this (I know nothing about the archive format, by the way, I'm just looking at the code).

So, in reverse of issue 4871, it appears that in this case the API should reject bytes input with an appropriate error message.
msg124686 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-12-26 23:54
> So, in reverse of issue 4871, it appears that in this case the API should reject bytes input with an appropriate error message.

-1. It is quite common to produce ill-formed zipfiles, and other
ziptools are interpreting them in violation of the format spec.
Python needs to support creation of such broken zipfiles,
even though it may not be able to read them back.
msg124690 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-27 00:45
Well, this is the same treat-strings-and-byte-strings-equivalently-in-the-same-API problem that we've had elsewhere.  It'll require a bit of refactoring to make it work.

On read zipfile decodes filenames using cp437 if the utf-8 flag isn't set.  Logically, then, a binary string should be encoded using cp437.  Since cp437 has a character corresponding to each of the 256 bytes, it seems to me it should be enough to decode a binary filename using cp437 and set a flag that _encodeFilenameFlags would respect and re-encode to cp437 instead of utf-8.  That might produce unexpected results if someone passes in a binary filename encoded in some other character set, but it would be consistent with how zipfiles work and so should be at least as interoperable as zipfiles normally are.
msg257385 - (view) Author: Patrik Dufresne (Patrik Dufresne) Date: 2016-01-02 23:23
This bug is very old, any development on the subject. This issue is hitting me trying to port my project (rdiffweb) to python3. It received a lot of broken filename with invalid encoding and I need to create a meaningful Zip archive with it. Currently, it just fail because zipfile doesn't accept arcname as bytes.
History
Date User Action Args
2016-01-02 23:23:30Patrik Dufresnesetnosy: + Patrik Dufresne
messages: + msg257385
2015-07-21 07:19:00ethan.furmansetnosy: - ethan.furman
2015-04-13 21:25:50ozialiensetnosy: + ozialien
2013-10-14 22:39:46ethan.furmansetnosy: + ethan.furman
2010-12-27 00:45:06r.david.murraysetnosy: loewis, aimacintyre, r.david.murray, connexion2000
messages: + msg124690
title: zipfile.write, arcname should be bytestring -> zipfile.write, arcname should be allowed to be a byte string
2010-12-26 23:54:25loewissetnosy: loewis, aimacintyre, r.david.murray, connexion2000
messages: + msg124686
2010-12-25 16:37:05r.david.murraysetnosy: + r.david.murray
messages: + msg124641
2010-12-24 21:54:48terry.reedysetnosy: + aimacintyre
stage: test needed
type: compile error -> behavior

versions: + Python 3.2
2010-12-22 20:12:05loewissetstatus: closed -> open

messages: + msg124519
resolution: not a bug ->
2010-12-22 20:07:48loewissetstatus: open -> closed

nosy: + loewis
messages: + msg124518

resolution: not a bug
2010-12-22 12:44:03connexion2000create