classification
Title: zipfile can't extract file
Type: behavior Stage: needs patch
Components: Library (Lib), Windows Versions: Python 2.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: tim.golden Nosy List: NewerCookie, alanmcintyre, amaury.forgeotdarc, chuck, ronaldoussoren, terry.reedy, tim.golden
Priority: normal Keywords: patch

Created on 2009-09-04 19:56 by NewerCookie, last changed 2010-09-14 11:35 by ronaldoussoren.

Files
File name Uploaded Description Edit
test.zip NewerCookie, 2009-09-04 19:56
zlib_forward_slash.patch chuck, 2009-09-19 17:01 review
Messages (10)
msg92265 - (view) Author: Kim Kyung Don (NewerCookie) Date: 2009-09-04 19:57
The following exception occured when I tried to extract on Windows.

"zipfile.BadZipfile: File name in directory "test\test2.txt" and header
"test/test2.txt" differ."

It seems like problem about slash.
I tested using by zipfile Revision 72893.
msg92297 - (view) Author: Kim Kyung Don (NewerCookie) Date: 2009-09-06 04:02
P.S
I tested extraction by using 7-zip.
It works fine.
msg92309 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2009-09-06 12:40
The zipfile is technically incorrect, the zipfile specification prescribes 
that all filenames use '/' as the directory separator.

Even without that caveat the file is corrupt because the zipfile directory 
header and the per-file header don't agree on the name of the file.

That said: IMHO the current code in zipfile.ZipFile.open is too strict, it 
shouldn't raise an error when the two names aren't exactly the same 
because there are valid reasons for them to be different (such as renaming 
a file in the zipfile without rewriting the entire zipfile).
msg92326 - (view) Author: Alan McIntyre (alanmcintyre) (Python committer) Date: 2009-09-06 18:58
FileRoller doesn't complain about the mismatched slashes either.  Where
did the ZIP come from, by the way?  I seem to recall that there have
been other instances in which ZIP applications were more "forgiving"
than the zipfile module.  How far should zipfile go in bending the
interpretation of the ZIP standard?  

As far as the renaming goes, it seems the standard says the header name
should be used if the two names are different.  If nobody else has time
to make a patch and tests I can take a stab at it in the next few days.
msg92330 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2009-09-06 20:41
alan: I don't quite understand which filename you want to use when the 
name in the per-file header and the central directory don't match. 

Where in the standard is this prescribed? I couldn't find anything in 
the PKWare zipfile appnote [1]

My preference would be to use the central directory as the canonical 
value because scanning the entire zipfile to read the per-file header 
would give a significant overhead. This might not be very noticable with 
small zipfiles, but I regularly use zipfiles with over 100K files in 
them in those files a scan of the zipfile is prohibitively expensive.

Furthermore, when the two are different the most reasonably explaination 
is that an in-place edit of the zipfile changed the directory without 
rewriting the entire zipfile (just like you can "delete" files from a 
zipfile by dumping them from the directory rather than completely 
rewriting the entire archive)

[1] 
APPNOTE.TXT - .ZIP File Format Specification Version: 6.3.2 
Revised: September 28, 2007 
Copyright (c) 1989 - 2007 PKWARE Inc., All Rights Reserved.
msg92335 - (view) Author: Alan McIntyre (alanmcintyre) (Python committer) Date: 2009-09-06 21:26
Sorry about the confusion--I think I confused myself by looking at the
bit about CRC checksums in the "Info-ZIP Unicode Path Extra Field"
section before I posted.  I meant to say that the central directory name
looks preferred over the per-file header.

n section J, under "file name (Variable)" there's a bit that says:

"If input came from standard input, there is no file name field.  If
encrypting the central directory and general purpose bit flag 13 is set
indicating masking, the file name stored in the Local Header will not be
the actual file name.  A masking value consisting of a unique
hexadecimal value will be stored."

So in these cases the central directory name has to be used.  And, as
you pointed out, some operations like "deleting" a member from the
archive are implemented by editing the central directory, so it would
seem that the central directory should be used if there's a conflict.
msg92516 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2009-09-11 18:57
In the case at issue, the file name is the same (contrary to the error
message). The two representations of the *path* are different, but
equivalent. There is no ambiguity: the file should be put in directory
'test' and named 'test2.txt'. So I think zipfile should do what 7zip
does and do just that.

An actual filename difference might be argued differently.
msg92874 - (view) Author: chuck (chuck) Date: 2009-09-19 17:01
I added a patch to replace back slashes by forward slashes in three 
places, only one if them actually relevant to the errors in the attached 
.zip file.

I kept the exception for mismatching filenames, but if you think it is 
appropriate to remove it I could do that as well.
msg116384 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-14 11:26
I agree with the change, but the code should be factorized in a function (normalize_filename for example)
msg116385 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-09-14 11:35
I'd prefer if the code no longer checked if the filename in the directory matches the name in the per-file header.

The reason of that is that the two don't have to match: it is relatively cheap to rename a file in the zipfile by rewriting the directory while rewriting the entire zipfile can be pretty expensive when zipfiles get large.

It's probably worthwhile to test what other zipfile tools do in the respect (e.g., create a zipfile where the filename in the header doesn't match the name in the directory and extract that zip using a number of popular tools).


(I have a slightly odd perspective on this because I regularly deal with zipfiles containing over 100K files and over 10GByte of data).
History
Date User Action Args
2010-09-14 11:35:01ronaldoussorensetmessages: + msg116385
2010-09-14 11:26:24amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg116384
2010-08-06 15:35:16tim.goldensetassignee: tim.golden

nosy: + tim.golden
2009-09-19 17:01:06chucksetfiles: + zlib_forward_slash.patch

nosy: + chuck
messages: + msg92874

keywords: + patch
2009-09-11 21:01:53amaury.forgeotdarcsetstage: needs patch
2009-09-11 18:57:17terry.reedysetnosy: + terry.reedy
messages: + msg92516
2009-09-06 21:26:51alanmcintyresetmessages: + msg92335
2009-09-06 20:41:54ronaldoussorensetmessages: + msg92330
2009-09-06 18:58:40alanmcintyresetnosy: + alanmcintyre
messages: + msg92326
2009-09-06 12:40:05ronaldoussorensetnosy: + ronaldoussoren
messages: + msg92309
2009-09-06 04:02:01NewerCookiesetmessages: + msg92297
2009-09-04 19:57:57NewerCookiesetmessages: + msg92265
2009-09-04 19:56:50NewerCookiecreate