Message 376862 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	danieljewell
Recipients	Arfrever, benjamin.peterson, danieljewell, eric.snow, flox, georg.brandl, gregory.p.smith, iMath, larry, ncoghlan, python-dev, serhiy.storchaka, stutzbach, superluser, tulir
Date	2020-09-14.01:44:07
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1600047847.83.0.682392282365.issue19081@roundup.psfhosted.org>
In-reply-to

Content
In playing with Lib/zipfile.py and Lib/zipimport.py, I noticed that zipfile has supported opportunistic loading of bz2/lzma for ~9 years. However, zipimport assumes only zlib will be used. (Yet, zipfile.PyZipFile will happily create zlib/bz2/lzma ZIP archives ... zipfile.PyZipFile('mod.zip', 'w', compression=zipfile.ZIP_LZMA) for example.) At first I wondered why zipimport essentially duplicates a lot from zipfile but then realized (after reading some of the commit messages around the pure-python rewrite of zipimport a few years ago) that since zipimport is called as part of startup, there's a need to avoid importing certain modules. I'm wondering if this specific issue with zipimport is possibly more of an indicator of a larger issue? Specifically: * The duplication of code between zipfile and zipimport seems like a potential source of bugs - I get the rationale but perhaps the "base" ZIP functionality ought to be refactored out of both zipimport and zipfile so they can share... And I mean the low-level stuff (compressor, checksum, etc.). Zipfile definitely imports more than zipimport but I haven't looked at what those imports are doing extensively. Ultimately: the behavior of the new ZipImport appears to be, essentially, the same as zipimport.c: Per PEP-302 [https://www.python.org/dev/peps/pep-0302/], zipimport.zipimporter gets registered into sys.path_hooks. When you import anything in a zip file, all of the paths get cached into sys.path_importer_cache as zipimport.zipimporter objects. The zipimporter objects, when instantiated, run zipimport._read_directory() which returns a low level dict with each key being a filename (module) and each value being a tuple of low-level metadata about that file including the byte offset into the zip file, time last modified, CRC, etc. (See zipimport.py:330 or so). This is then stored in zipimporter._files. Critically, the contents of the zip file are not decompressed at this stage: only the metadata of what is in the zip file and (most importantly) where it is, is stored in memory: only when a module is actually called for loading is the data read utilizing the cached metadata. There appears to be no provision for (a) verifying that the zip file itself hasn't changed or (b) refreshing the metadata. So it's no surprise really that this error is happening: the cached file contents metadata instructs zipimporter to decompress a specific byte offset in the zip file when an import is called. If the zip file changes on disk between the metadata scan (e.g. first read of the zip file) and actual loading, bam: error. There appear to be several ways to fix this ... I'm not sure of the best: * Possibly lock the ZIP file on first import so it doesn't change (this presents many new issues) * Rescan the ZIP before each import; but the point of caching the contents appears to be the avoidance of this * Hash the entire file and compare (expensive CPU-wise) * Rely on modified time? e.g. cache the whole zip modified time at read and then if that's not the same invalidate the cache and rescan * Cache the entire zip file into memory at first load - this has some advantages (can store the ZIP data compressed; would make the import all or nothing; faster?) But then there would need to be some kind of variable to limit the size/total size - it becomes a memory hog...

In playing with Lib/zipfile.py and Lib/zipimport.py, I noticed that zipfile has supported opportunistic loading of bz2/lzma for ~9 years. However, zipimport assumes only zlib will be used. (Yet, zipfile.PyZipFile will happily create zlib/bz2/lzma ZIP archives ... zipfile.PyZipFile('mod.zip', 'w', compression=zipfile.ZIP_LZMA) for example.)

At first I wondered why zipimport essentially duplicates a lot from zipfile but then realized (after reading some of the commit messages around the pure-python rewrite of zipimport a few years ago) that since zipimport is called as part of startup, there's a need to avoid importing certain modules. 

I'm wondering if this specific issue with zipimport is possibly more of an indicator of a larger issue? 

Specifically:

* The duplication of code between zipfile and zipimport seems like a potential source of bugs - I get the rationale but perhaps the "base" ZIP functionality ought to be refactored out of both zipimport and zipfile so they can share... And I mean the low-level stuff (compressor, checksum, etc.). Zipfile definitely imports more than zipimport but I haven't looked at what those imports are doing extensively. 

Ultimately: the behavior of the new ZipImport appears to be, essentially, the same as zipimport.c:

Per PEP-302 [https://www.python.org/dev/peps/pep-0302/], zipimport.zipimporter gets registered into sys.path_hooks. When you import anything in a zip file, all of the paths get cached into sys.path_importer_cache as zipimport.zipimporter objects.

The zipimporter objects, when instantiated, run zipimport._read_directory() which returns a low level dict with each key being a filename (module) and each value being a tuple of low-level metadata about that file including the byte offset into the zip file, time last modified, CRC, etc. (See zipimport.py:330 or so). This is then stored in zipimporter._files.

Critically, the contents of the zip file are not decompressed at this stage: only the metadata of what is in the zip file and (most importantly) where it is, is stored in memory: only when a module is actually called for loading is the data read utilizing the cached metadata. There appears to be no provision for (a) verifying that the zip file itself hasn't changed or (b) refreshing the metadata. So it's no surprise really that this error is happening: the cached file contents metadata instructs zipimporter to decompress a specific byte offset in the zip file *when an import is called*. If the zip file changes on disk between the metadata scan (e.g. first read of the zip file) and actual loading, bam: error.

There appear to be several ways to fix this ... I'm not sure of the best:

* Possibly lock the ZIP file on first import so it doesn't change (this presents many new issues)
* Rescan the ZIP before each import; but the point of caching the contents appears to be the avoidance of this
* Hash the entire file and compare (expensive CPU-wise)
* Rely on modified time? e.g. cache the whole zip modified time at read and then if that's not the same invalidate the cache and rescan
* Cache the entire zip file into memory at first load - this has some advantages (can store the ZIP data compressed; would make the import all or nothing; faster?) But then there would need to be some kind of variable to limit the size/total size - it becomes a memory hog...

History
Date	User	Action	Args
2020-09-14 01:44:07	danieljewell	set	recipients: + danieljewell, georg.brandl, gregory.p.smith, ncoghlan, larry, benjamin.peterson, stutzbach, Arfrever, flox, python-dev, eric.snow, serhiy.storchaka, superluser, iMath, tulir
2020-09-14 01:44:07	danieljewell	set	messageid: <1600047847.83.0.682392282365.issue19081@roundup.psfhosted.org>
2020-09-14 01:44:07	danieljewell	link	issue19081 messages
2020-09-14 01:44:07	danieljewell	create