This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ZipFile.namelist() does not match the actual files in .zip file
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: alanmcintyre, andrei.avk, longavailable, serhiy.storchaka, twouters
Priority: normal Keywords:

Created on 2020-06-24 12:19 by longavailable, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (17)
msg372247 - (view) Author: Xiaolong Liu (longavailable) Date: 2020-06-24 12:19
I used zipfile module to archive thousands of .geojson file to zip files and access those .geojson file by ZipFile.open() method. In my hundreds of runnings, one of them was abnormal.
As the title says, the ZipFile.namelist() did not match all the files in .zip file. And I extracted it by extractall() method and it only got those files included in the namelist. On the other hand, I extracted it by my compress software (360zip). I got the other files unincluded in the namelist(). Only one file (2564.geojson) appeared with these two extract methods.
ZipFile.extractall() method got 674 files from '2654.geojson' to '3989.geojson'.
360zip got 1399 files from '0000.geojson' to '2654.geojson'.
The abnormal file is too big to upload this page and I uploaded to google drive:
https://drive.google.com/file/d/1UE2N2qwjn4m7uE6YF2A1FhdXYHP_7zQr/view?usp=sharing
msg399590 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-14 12:49
Xiaolong: The file no longer exists on the google drive. How big was the file?
msg399706 - (view) Author: Xiaolong Liu (longavailable) Date: 2021-08-17 01:48
Dear Andrei,

It's about 63MB. I reshared it on Google Drive. Please check the following link.
https://drive.google.com/file/d/1534MdIcGbXtMwYfuo2zeFxm6BVgHa4XX/view?usp=sharing 

Bruce / Xiaolong Liu / 刘小龙

From: Andrei Kulakov
Date: 2021-08-14 20:49
To: liuxiaolong125
Subject: [issue41102] ZipFile.namelist() does not match the actual files in .zip file

Andrei Kulakov <andrei.avk@gmail.com> added the comment:

Xiaolong: The file no longer exists on the google drive. How big was the file?

----------
nosy: +andrei.avk

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue41102>
_______________________________________
msg400115 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-23 02:05
Liu: Thanks!

Google zip also shows 674 files in this archive, so it appears that it's not a Python-only issue. Can you test it with a few zip programs, ideally on a 2 or more platforms, and check which of them report more than 674 files?

(I simply opened the google drive link and google showed the listing and the total # of 674 files)
msg400116 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-23 02:07
Liu: also please delete the quoted text when replying via email, otherwise it's hard to read on the issue page; much thanks!
msg400117 - (view) Author: Xiaolong Liu (longavailable) Date: 2021-08-23 03:29
Andrei: Exactly, different extraction tool gave different results. 

File no.        tool           platform
674        the built-in tool on Win10        win10
674        winrar        win10
1399        7zip        win10
1399        360zip        win10
674        unzip            Ubuntu 20.10
1399        7zip            Ubuntu 20.10
msg400118 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-23 04:30
Liu: 

builtin MacOS archive tool also reports 674.

since Google archiver, and built-in tools on all major platforms report 674 files; it looks like 7zip might be the outlier here [note also that 7zip have their own format].

Unless it can be shown that Python and all of the archivers showing 674 have a common bug, I think there's not much we can do, as Python seems to be reading the archive in a defacto standard way.
msg400120 - (view) Author: Xiaolong Liu (longavailable) Date: 2021-08-23 05:16
Andrei: 

The zipped file was created by zipfile module of Python. That's the reason I posted it here. 

I achived more than 2000 files to the abnormal zipped file. And no tool can extract whole files archived within. 7zip got the first part, and other tools got the left.
msg400121 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-08-23 05:32
It is not a bug in archivers, it is just a broken archive. The ZIP archive contains a central directory which lists all files in the archive, so the archiver can just read the central directory without reading all archive to get the list of files. There are also local headers containing file names for every file. In 000.zip the central directory does not match local headers, it contains only the part of files. It is not fault of archivers that they trust the central directory, because it is a purpose of the central directory.

There are specialized tools which allow to restore files not in the central directory, similarly as there are tools which allow to restore just deleted files on the disk. But it is not a task of general archivers.
msg400122 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-08-23 05:34
If the zipped file was created by zipfile module of Python, it is a bug. Could you reproduce the sequence of actions that creates such broken archive?
msg400140 - (view) Author: Xiaolong Liu (longavailable) Date: 2021-08-23 15:08
Serhiy: Thanks for your explanation. 

The file was created by zipfile module. I used the script hundreds of times, while only once (the uploaded zipped file) was abnormal. 

Since the project ended a long time ago, I cannot reproduce the error right now. I will post the related code snippet I used.

import zipfile
import pathlib

def fileIsValid(filename):
filename = pathlib.Path(filename)
return True if filename.is_file() and filename.stat().st_size > 0 else False

def compress2zip(sourceFile,zipFile,destinationFile):
if not fileIsValid(zipFile):
pathlib.Path(zipFile).parent.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(zipFile,'w',compression=zipfile.ZIP_DEFLATED) as myzip:
myzip.write(sourceFile,destinationFile)
dest_size = myzip.getinfo(destinationFile).file_size
else:
with zipfile.ZipFile(zipFile,'a',compression=zipfile.ZIP_DEFLATED) as myzip:
if not destinationFile in myzip.namelist():
myzip.write(sourceFile,destinationFile)
dest_size = myzip.getinfo(destinationFile).file_size
source_size = pathlib.Path(sourceFile).stat().st_size
if source_size == dest_size:
print('Succeeded -- compress -- %s' % str(destinationFile))
return True
else:
print('Failed -- compress -- %s' %str(destinationFile))
return False

files = list(pathlib.Path('000-original').glob('*.geojson'))
zipFile = pathlib.Path('0000.zip')
for file in files:
comp_re = compress2zip(file, zipFile, file.name)
msg400142 - (view) Author: Xiaolong Liu (longavailable) Date: 2021-08-23 15:34
It seems the indents were automatically removed in the message box. I shared the code snippet formmated here:

https://www.online-python.com/cDojvl2CMS
msg400176 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-23 20:32
Liu: apologies for the confusion, I missed that when coming back to the thread for the 2nd time.

I took a look at the code and one potential issue I see is that you are opening the file in 'w' mode if stat size is 0 (or if file isn't valid), and in 'a' mode otherwise. It's better to just use the 'a' mode as then the file will be created if it doesn't exist.

This may lead to file being opened in 'w' mode after some files already were meant to be written, - due to buffering.

A second issue is that you're closing and reopening the archive for each file written. I'm not sure if you had a reason for that but it's better to open the archive and write all of the files if you have them ready.

It's hard to say exactly what lead to this incomplete archive without a reproducible way to create it.
msg400190 - (view) Author: Xiaolong Liu (longavailable) Date: 2021-08-24 01:35
Andrei: Never mind. 

Yes. It is hardly impossible to sort out a problem when it cannot be reproduced. Just close it plz.

Anyway, many thanks for your suggestions.
msg400191 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-24 01:57
Liu: Thanks for the report!

By the way on re-reading my message, of course by opening and closing archive on each file using the `with` block, you take care of the first issue I mentioned (although it's not very efficient).

And I assume there was no multiprocessing / multithreading involved in creating the zip files? (I know it's not in the code you posted, but just to be sure?)

I agree with closing this until someone else runs into it and is able to reproduce reliably.
msg400197 - (view) Author: Xiaolong Liu (longavailable) Date: 2021-08-24 04:23
Andrei: No multiprocessing or multithreading was used when creating the zip file.
msg407344 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-11-30 00:39
Closing by request of OP.
History
Date User Action Args
2022-04-11 14:59:32adminsetgithub: 85274
2021-11-30 00:39:09andrei.avksetstatus: open -> closed
resolution: not a bug
messages: + msg407344

stage: resolved
2021-08-24 04:23:33longavailablesetmessages: + msg400197
2021-08-24 01:57:10andrei.avksetmessages: + msg400191
2021-08-24 01:35:15longavailablesetmessages: + msg400190
2021-08-23 20:32:45andrei.avksetmessages: + msg400176
2021-08-23 15:34:49longavailablesetmessages: + msg400142
2021-08-23 15:08:11longavailablesetmessages: + msg400140
2021-08-23 05:34:57serhiy.storchakasetmessages: + msg400122
2021-08-23 05:32:38serhiy.storchakasetmessages: + msg400121
2021-08-23 05:16:20longavailablesetmessages: + msg400120
2021-08-23 04:30:55andrei.avksetmessages: + msg400118
2021-08-23 03:29:01longavailablesetmessages: + msg400117
2021-08-23 02:07:36andrei.avksetmessages: + msg400116
2021-08-23 02:05:06andrei.avksetmessages: + msg400115
2021-08-17 01:48:47longavailablesetmessages: + msg399706
2021-08-14 12:49:46andrei.avksetnosy: + andrei.avk
messages: + msg399590
2020-06-24 12:19:05longavailablecreate