classification
Title: zipfile returns string but expects binary
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: francescor, skreft, talund, v+python, vstinner
Priority: normal Keywords: patch

Created on 2008-12-10 16:27 by francescor, last changed 2011-05-18 12:00 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
test.zip francescor, 2008-12-10 16:27
x.zip vstinner, 2008-12-20 14:18
patch.diff skreft, 2008-12-21 02:58
testzip.py v+python, 2010-03-27 05:48 test case for opening zip members using \ separator
Messages (14)
msg77555 - (view) Author: Francesco Ricciardi (francescor) Date: 2008-12-10 16:27
Each entry of a zip file, as read by the zipfile module, can be accessed
via a ZipInfo object. The filename attribute of ZipInfo is a string.
However, the read method of a ZipFile object expects a binary as
argument, or at least this is what I can deduct from the following behavior:

>>> import zipfile
>>> testzip = zipfile.ZipFile('test.zip')
>>> t1 = testzip.infolist()[0]
>>> t1.filename
'tést.xml'
>>> data = testzip.read(testzip.infolist()[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python30\lib\zipfile.py", line 843, in read
    return self.open(name, "r", pwd).read()
  File "C:\Python30\lib\zipfile.py", line 883, in open
    % (zinfo.orig_filename, fname))
zipfile.BadZipfile: File name in directory 'tést.xml' and header
b't\x82st.xml' differ.

The test.zip file is attached as help in reproducing this error.
msg78004 - (view) Author: (skreft) Date: 2008-12-18 01:34
The error you got is caused by giving the wrong parameters. You gave a
ZipInfo object instead of a filename.

If you execute data = testzip.read(t1.filename) yo will have no problems.
msg78014 - (view) Author: Francesco Ricciardi (francescor) Date: 2008-12-18 07:33
If that is what is requested, then the manual entry for ZipFile.read
must be corrected, because it states:

"ZipFile.read(name[, pwd]) .... name is the name of the file in the
archive, or a ZipInfo object."


However, Eddie, you haven't tried what you suggested, because this is
what you would get:

>>> import zipfile
>>> testzip = zipfile.ZipFile('test.zip')
>>> t1 = testzip.infolist()[0]
>>> t1.filename
'tést.xml'
>>> data = testzip.read(t1.filename)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python30\lib\zipfile.py", line 843, in read
    return self.open(name, "r", pwd).read()
  File "C:\Python30\lib\zipfile.py", line 883, in open
    % (zinfo.orig_filename, fname))
zipfile.BadZipfile: File name in directory 'tést.xml' and header
b't\x82st.xml' differ.
msg78025 - (view) Author: (skreft) Date: 2008-12-18 14:17
Sorry, my bad.
I did tried it but with the wrong version (2.5). And it worked perfectly.

So sorry again for my mistake.

Anyways, I've found the error.

The problem is caused by different encodings used when zipping.

In open, the method is comparing b't\x82st.xml' against
b't\xc3\xa9st.xml', and of course they are different.
But they are no so different, because b't\x82st.xml' is
'tést'.encode('cp437') and b't\xc3\xa9st.xml' is 'tést'.encode(utf-8).

The problem arises because the open method supposes the filename is in
utf-8 encoding, but in __init__ it realizes that the encoding depends on
the flags. 
if flags & 0x800:
    filename = filename.decode.('utf-8')
else:
    filename = filename.decode.('cp437')
msg78104 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-20 14:03
In the ZIP file format, a filename is a byte string because we don't 
know the encoding. You can not guess the encoding because it's not 
stored in the ZIP file and it depends on the OS and the OS 
configuration. So t1.filename have to be a byte string and  
testzip.read() have to use bytes and not str.
msg78105 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-20 14:06
Oh, I see that zipfile.py uses the following code to choose the 
filename encoding:
            if flags & 0x800:
                # UTF-8 file names extension
                filename = filename.decode('utf-8')
            else:
                # Historical ZIP filename encoding
                filename = filename.decode('cp437')

So I'm maybe wrong: the encoding is known using a flag?
msg78107 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-20 14:18
Test on Ubuntu Gutsy (utf8 file system) with zip 2.32:
$ mkdir x
$ touch x/hé
$ zip -r x.zip x
  adding: x/ (stored 0%)
  adding: x/hé (stored 0%)

$ python # 3.0 trunk
>>> import zipfile
>>> testzip = zipfile.ZipFile('x.zip')
>>> testzip.infolist()[1].filename
'x/hé'
>>> print(ascii(testzip.infolist()[1].filename))
'x/h\u251c\u2310'

Using my own file parse (hachoir-wx), I can see that flags=0 and 
filename=bytes {78 2f 68 c3 a9} ("x/hé" in UTF-8).

You can try x.zip: I attached the file.
msg78111 - (view) Author: (skreft) Date: 2008-12-20 16:06
The problem is not about reading the filenames, but reading the contents
of a file with filename that has non-ascii charaters.
msg78137 - (view) Author: (skreft) Date: 2008-12-21 02:52
I read again what STINNER Victor and I think that he found another bug.

Because, when listing the filenames of that zip file, the names are not
displayed correctly. In fact
'x/h├⌐' == 'x/hé'.encode('utf-8').decode('cp437')

So, there is again a problem with encodings when reading the contents.

The problem here is that when reading one can not give the filename,
because is not a key in the NameToInfo dictionary.
msg78138 - (view) Author: (skreft) Date: 2008-12-21 02:58
Attached is a patch that solves (I hope) the initial problem, the one
from Francesco Ricciardi.
msg101820 - (view) Author: Glenn Linderman (v+python) * Date: 2010-03-27 05:48
I just "discovered" that attempting to open zip member "test\file" fails where attempting to open "test/file" works.  Granted the zip contains "/" not "\" characters, but using the os.path stuff (on windows) to manipulate the names before attempting to open the zip member produces "\" characters.  Clearly, I could switch them back.  It seems pretty clear that zipfile should do that for me, though.

A small, self-contained zip file test case is attached, being a zip that is named .py 

My testing using Python 3.1.1
msg136224 - (view) Author: Tor Arvid Lund (talund) Date: 2011-05-18 11:04
I was wondering what has prevented Eddies patch from being included into python. Has nobody volunteered to verify that it works? I would be willing to do that, though I have never compiled python on any platform before.

It just seems a bit silly to me that python cannot work with zip files with unicode file names... I just now had to do 'os.system("unzip.exe ...")' because zipfile did not work for me...
msg136227 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-18 11:31
This issue looks to be a duplicate of #10801 which was only fixed (33543b4e0e5d) in Python 3.2. See also #12048: similar issue in Python 3.1.
msg136231 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-05-18 12:00
The initial problem is clearly a duplicate of issue #10801 which is now fixed in Python 3.1+ (I just backported the fix to Python 3.1).

> I just "discovered" that attempting to open zip member "test\file"
> fails where attempting to open "test/file" works. (...)
> It seems pretty clear that zipfile should do that for me, though.

@v+python: I don't think so, but others may agree with you. Please open a new issue, because it is unrelated to the initial bug report.

I'm closing this issue because the initial is now fixed.

For x.zip (UTF-8 encoded filenames with the "Unicode" flag) problem, there is already the issue #10614 which handles this case.
History
Date User Action Args
2011-05-18 12:00:18vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg136231
2011-05-18 11:31:32vstinnersetmessages: + msg136227
2011-05-18 11:04:32talundsetnosy: + talund
messages: + msg136224
2010-03-27 05:48:21v+pythonsetfiles: + testzip.py
nosy: + v+python
messages: + msg101820

2008-12-21 02:58:08skreftsetfiles: + patch.diff
keywords: + patch
messages: + msg78138
2008-12-21 02:52:40skreftsetmessages: + msg78137
2008-12-20 16:06:55skreftsetmessages: + msg78111
2008-12-20 14:18:26vstinnersetfiles: + x.zip
messages: + msg78107
2008-12-20 14:06:34vstinnersetmessages: + msg78105
2008-12-20 14:03:03vstinnersetnosy: + vstinner
messages: + msg78104
2008-12-18 14:17:54skreftsetmessages: + msg78025
2008-12-18 07:33:03francescorsetmessages: + msg78014
2008-12-18 01:34:14skreftsetnosy: + skreft
messages: + msg78004
2008-12-10 16:27:46francescorcreate