Issue 652586: New import hooks + Import from Zip files

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37602

classification

Title:	New import hooks + Import from Zip files
Type:		Stage:
Components:	Interpreter Core	Versions:	Python 2.3

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ahlstromjc, gvanrossum, jvr, loewis, nnorwitz, paul.moore
Priority:	normal	Keywords:	patch

Created on 2002-12-12 10:34 by jvr, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
importhooks_zipimport.patch	jvr, 2002-12-23 22:31	importhooks & zipimport #11

Messages (34)
msg41948 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 10:34
This patch implements two things: - a new set of import hooks, modelled after iu.py - builtin support for imports from Zip archives (a competing implementation for PEP 273) The new set of hooks probably need a better document explaining them (perhaps a PEP). My motivations have been posted to python-dev. Here's a brief description. Three new objects are added to the sys module: - path_hooks - path_importer_cache - meta_path sys.path_hooks is a list of callable objects that take a string as their only argument. A hook will be called with a sys.path or pkg.__path__ item. It should return an "importer" object (see below), or raise ImportError or return None if it can't deal with the path item. By default, sys.path_hooks only contains the zipimporter type, if the zipimport module is available. sys.path_importer_cache is a dict that caches the results of sys.path_hooks to avoid repeated hook lookups. sys.meta_path is a list of importer objects that are invoked before the builtin import mechanism kicks in. This allows overriding of builtin module and frozen module import, but the main feature is that it allows importer objects without a corresponding sys.path item (just like builtin and frozen modules). Importer objects must conform to the following protocol: i.find_module(fullname) -> None or an importer object i.load_module(fullname) -> the imported module (or raise ImportError) The 'fullname' is always the fully qualified module name, ie. a dotted name for a submodule. This patch adds one more feature: a sys.path item may itself be an importer object. This is convenient for experimentation, but using it may break third-party code that assumes sys.path contains only strings.
msg41949 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 10:37
Logged In: YES user_id=92689 btw. the attached file contains a patch for various files as well as a new file: zipimporter.c. Place the latter in the Modules/ directory.
msg41950 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-12-12 11:15
Logged In: YES user_id=21627 Some comments: - zipimport should normally be built as a builtin module, there should also be a patch for the Windows build procedure - zipimporters need to support garbage collection, as they can occur in cycles - there should be a mechanism to prevent a stack overflow in case of recursive import of zlib. - get_long needs to accept an unsigned char*
msg41951 - (view)	Author: Paul Moore (paul.moore) *	Date: 2002-12-12 11:23
Logged In: YES user_id=113328 I can provide details on how to patch the Windows build process - it's not hard. The question of whether to have zipimport as builtin or dynamic should be resolved first (the processes are different in the 2 cases). It should be pointed out that if zipimport is builtin, it still relies on a dynamic module (zlib) for importing from compressed zips. I don't know whether this affects the decision, though...
msg41952 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 12:35
Logged In: YES user_id=92689 Martin: - Have you actually seen a recursive import of zlib? I don't see how it's possible to cause a stackoverflow. Hm, maybe if someone stuffs a zlib.py in a zip archive? I'll add a guard. - Are you really saying it should be a builtin module, or was that comment supposed to start with "if"? See also Paul's remark about zlib: if zlib remains a shared lib, having zipimport as a builtin only helps for non-compressed archives. - gc: I'll give this a go (never done it before!) - get_long() -> it's called with a signed char *, and the buf[i] & 0xff should take care of the rest, so I'm not sure I understand. But trust your judgement and changed it in my working copy. Thank you very much for the quick reply!
msg41953 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 12:39
Logged In: YES user_id=92689 Hm, regarding gc again: zipimporter objects can only theoretically be involved in cycles, and only if people muck with the "files" attribute. As it is, the "files" dict only contains strings (keys) and tuples (values) which contain strings and ints. So is it really neccesary?
msg41954 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-12-12 13:16
Logged In: YES user_id=21627 recursive imports: having foo.zip as the first thing in sys.path, and having a compressed zlib.py in there was indeed the case I was thinking of (without actually trying). GC: It is absolutely necessary. If this is not done now, somebody will spend days of research five years from now, because somebody else thought that invoking .update on this files attribute was a good idea. This could be reduced to C- level documentation if the dictionary was not exposed to Python. builtin: I think it ought to be builtin. It's a small module, it does not rely on additional libraries, it is not maintained externally, and it reduces bootstrap dependencies to have it builtin. OTOH, zlib can't be builtin, as it relies on an additional library which may not always be present. get_long: you are right; the 0xff does make the values positive, again. I somehow thought the result might still be negative.
msg41955 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 14:05
Logged In: YES user_id=92689 I've attached test_zipimport.py, a simple test script, to be placed in Lib/test/.
msg41956 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2002-12-12 14:28
Logged In: YES user_id=33168 Just, in case you didn't know, you can do a cvs diff -N to include new/removed files in a patch. I think you need to do a cvs add/remove before -N works though.
msg41957 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 16:34
Logged In: YES user_id=92689 I didn't know, thanks! I actually did find -N but it "didn't work" ;-) I've attached a new patch (slightly updated as well) as one file. Also included is Lib/test/output/test_zipimport.
msg41958 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 19:34
Logged In: YES user_id=92689 I've uploaded a fresh version, with GC support added. Tested to the extent that it doesn't crash ;-)
msg41959 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 21:59
Logged In: YES user_id=92689 I've uploaded a slightly updated version: the potential zlib recursion has been fixed (I was in fact able to trigger it); I've added a test for it, although it's tricky to do. Also cleaned up a few exception messages in zipimport.
msg41960 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-12 23:19
Logged In: YES user_id=92689 Yet another new version: - zipimport is a builtin module now; includes patches for PC\config.c and PBbuild\pythoncore.dsp, contributed and tested by Paul Moore. - tweaked doc strings in zipimport. I'll quit for today...
msg41961 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-12-13 17:52
Logged In: YES user_id=6380 Thanks -- I'm reviewing this now (if not too many distracting email arrives :-).
msg41962 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-13 19:29
Logged In: YES user_id=92689 I've attached a new version that fixes a problem that Thomas Heller noticed: package imports failed when the zip archive didn't contain a separate entry for the package directory.
msg41963 - (view)	Author: James C. Ahlstrom (ahlstromjc)	Date: 2002-12-13 19:55
Logged In: YES user_id=64929 The PEP 273 addition of the standard zip archive name /usr/local/lib/python23.zip (or Windows name) to sys.path to support import of the Python standard lib (including site.py) from a zip archive is missing. That was the point of my patch to site.py, getpath.c, getpathp.c, main.c, pythonrun.c, and sysmodule.c. I believe those patches can be used with this import.c if you want the feature.
msg41964 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-12-13 20:02
Logged In: YES user_id=6380 Just: it looks like your latest version requires that the entries for package directories are present. Is that really a good idea?
msg41965 - (view)	Author: Paul Moore (paul.moore) *	Date: 2002-12-13 21:20
Logged In: YES user_id=113328 I'd say it's pretty much guaranteed not to be a good idea. I am fairly certain that at least some Windows zip utilities fail to have a separate zip entry for the directory itself.
msg41966 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-13 22:44
Logged In: YES user_id=92689 The previous version requires it and failed if they weren't, the present version still requires it, but missing entries are added in the read_directory() function (around line 542). So it's guaranteed they're there if needed. I could have solved it differently at import time, but I thought it be better to do slightly more work at zipdirectory-read time so we have to do less work later. I doubt it matters much in reality, this was just the simplest way to solve the problem... (I'll update the test to reflect the solution of this problem.)
msg41967 - (view)	Author: Paul Moore (paul.moore) *	Date: 2002-12-14 19:39
Logged In: YES user_id=113328 I have some concerns about a design issue. I've been playing about writing my own hooks in Python, and I have found that for most of my tests, the find_module method is simply not useful. I understand that it can be cheaper (sometimes a lot cheaper) to just check for the existence of a module than actually loading up the code, etc, etc. So in those cases, find_module() offers a shortcut if the module isn't found. But if the module is found, and find_module has to do significant work, then all this work has to be repeated at load_module time. There's no mechanism for the result of find_module (say, a pointer to the module in a zip directory, or whatever) to be passed to load_module. So load_module has no alternative than to redo the work (short of the hook implementing some form of internal cache mechanism, which could be quite messy to do). I don't know what the likely uses of hooks will be. So I've no idea if find_module will turn out in practice to be a cheap operation. But I can certainly imagine cases (the hypthetical "load a module from a URL" hook) where find_module could be far from cheap. Is it possible to let find_module be optional? If the hook implements it, let it be used as a quick check for existence. If the hook doesn't implement it, just go for it - call load_module and be prepared for a "can't find it" response. This probably means that more extensive changes to import.c will be needed. But once we expose the design, we're going to be stuck with backward compatibility issues forever, so it's important we get it right now, rather than later. (Although having find_module mandatory initially, but made optional later, is probably less problematic than most other forms of design change...)
msg41968 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-14 20:17
Logged In: YES user_id=92689 Paul, this is the reason why the find_module() method returns an object (and I guess now is a good time to explain this <wink>; it also answers a question from Guido he asked in private mail). zipimporter returns self, but you don't have to do that: you can return anything with a load_module() method. So all state you gather during find_module() you can simply pack in a different object and return that. Something like this class MyHook: ... def find_module(self, fullname): info = self._find_module(fullname) if info is None: return None return Loader(info) class Loader: def __init__(self, info): self.info = info def load_module(self, fullname): ...load module using self.info... A good showcase for this pattern would be a wrapper for the builtin import mechanism: it could stuff the (file, filename, (suffix, mode, type)) tuple that imp.find_module() returns in a loader object, whose load_module() method would simply call imp.load_module() with that stuff. (I think such a wrapper is needed in import.c if we want to properly expose the new hooks in the imp module.) You can see the importer protocol as a further abstraction (and OO-ification) of the current imp.find_module()/imp.load_module() calls.
msg41969 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-14 20:46
Logged In: YES user_id=92689 Btw. it would be nicer if I indeed provided an "importer" wrapper for the builtin mechanism and simply placed it on and invoke it through sys.meta_path. The reason I haven't done that yet is that A) this would change import.c even more (although it would be better factored then) and B) it would have _some_ impact on the performance of the builtin import mechanism, as all imports would then pass through yet another Python call. But I hope to play with this a bit later next week. It could also be a project for 2.4. Note that this is exactly what Gordon does in iu.py, except he goes even further: he has separate importer objects on the metapath for builtin modules, frozen modules and file system imports. I think that's fantastic, but it would be more controversial as it might affect performance and might be harder to integrate with the current imp module.
msg41970 - (view)	Author: Paul Moore (paul.moore) *	Date: 2002-12-14 21:52
Logged In: YES user_id=113328 Thanks for the explanation of find_module. That sounds very useful. On a separate note, why do you expose sys.path_importer_cache? I can't imagine any reason why the user would ever touch it manually (other than for interactive experimentation).
msg41971 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-14 22:12
Logged In: YES user_id=92689 If you install a hook on sys.path_hooks and want it to affect sys.path items that may already have been used you'd need to clear sys.path_importer_cache. Say if you install a directory caching hook after site.py has been imported. Tricky, but possibly not uncommon. But at the very least the test script needs it <wink>. I think it's on a similar level as sys.modules: you normally don't need it, except when you're mucking with custom import hooks... (I know, that's not an entirely fair comparison, but still.)
msg41972 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-15 14:28
Logged In: YES user_id=92689 Uploaded a new version. Many changes, but most are fairly small. Addressed a fairly long list of comments from Guido. General: - lots more comments, especially for functions. - wrapped long lines - site.py: doesn't check for existence in the file system at all anymore zipimport.c: - fixed a really dumb bug that prevented zipimporter objects be found in the cache; this improved performance a bit in the presence of packages. (On my box, zipimports are now always faster than file system imports, but the difference is much better for uncompressed zip files.) - removed zipimporter_new (PyType_GenericNew is all we need) - replaced all PyMapping_* calls with PyDict_* calls - reworked module/package finding code: now really doesn't depend on dir entries to be present in the archive. Basically went back to how ahlstromjc did it, which makes much more sense to me now. - fixed some refcount bugs - fixed a potential buffer overrun - reduced the number of fseek() calls somewhat (XXX there's an unused and unfinished replacement for read_directory() in zipimport.c that reads the entire directory into memory in one gulp. This avoids more fseek() calls and avoids many PyMarshal_Read* calls. Nevertheless I only managed to shave off about 10% of dir-reading time, which I think makes it not worth the hassle. If you think this may have more effect on your platform: please play!)
msg41973 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-16 18:41
Logged In: YES user_id=92689 Uploaded new patch; no code changes, but a new test script: test_importhooks.py (uses unittest). Rewrote test_zipimport.py to also use unittest. test_importhooks.py contains some examples of trivial importer objects as well as an importer wrapper for imp.find_module/imp.load_module calls. Also contains some notes about the role and constraints of pkg.__path__.
msg41974 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-17 22:28
Logged In: YES user_id=92689 Fresh patch. Added zipimporter.get_data(path) method: returns the (uncompressed) data for a file as a string. Implemented Wiktor Sadowski's suggestion: a module loaded with zipimport will have an __importer__ attribute, which is set to the zipimporter instance that did the load. Possible quirk: for a package __init__.py file, this is the zipimporter that handles the directory containing the package dir, so __importer__.get_data(filename) looks for filename in that directory, not the package directory. An __importer__ of an actual submodule does look in the package directory itself. (Btw. I did this without adding a new PyImport_ExecCodeModuleEx2 API.)
msg41975 - (view)	Author: Paul Moore (paul.moore) *	Date: 2002-12-18 09:26
Logged In: YES user_id=113328 Be aware that this doesn't solve all of the issues (although it does solve a large class of them). Some applications require a real filename (wxPython did until recently for image data) and some need to stat the file (Guido quoted Zope). At the very least it should be made clear that this is an optional part of the interface. Application code should always be prepared for an importer not to support get_data (if only by saying it can't work from a data store managed by this sort of hook). My preference is for keeping the importer interface minimal and clean. At the moment 2 functions (find_module and get_module) are all that is required - and these can be in separate classes. Which class should have the get_data method? Which class gets assigned to __importer__? Is it the hook's responsibility to set __importer__ (yet another responsibility)? I think it's time for a PEP. Not for zipimporter, but purely for the import hook interface. That would be the place to argue such issues as I raised in the paragraph above. We need to remain clear on what the API requires vs what zipimporter does, to give writers of new hooks a decent chance.
msg41976 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-18 09:47
Logged In: YES user_id=92689 I totally agree that get_data() should be an optional part of the protocol. I consider it an experimental addition right now. Quick thoughts: __importer__ should be set to the object with the load_module() method. Yes, load_module() is responsible for setting it. It has to be, as it is load_module() that actually runs the code, and __import__ must be available to the running code. PEP: Yes. I started on it last night, doing some work on it as we speak.
msg41977 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-12-18 13:31
Logged In: YES user_id=6380 OTOH I will certainly be able to work around the lacking stat()ability; but I need to access the data. I'm occupied with urgent Zope Corp work this week, so I don't have time to review more of this code, alas. But I recommend that Just incorporate the changes that Jim Ahlstrom suggested (from Jim's patch) to allow the standard library to be a zip archive (adding a standard zip file on the path). If after that you still have time and energy, I'd like the search for builtins and for frozen modules to be turned into meta-importers (or whatever you call them) that are placed on the meta-importers hook list. That would be a nice refactoring of import.c!
msg41978 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-18 14:54
Logged In: YES user_id=92689 Std zip entry on sys.path: agreed, I've been meaning to look at that for a while now... Refactoring: would be nice (but needs a third item: PathImporter), but it's quite a bit of extra work and makes it harder (I think) to keep imp.find_module/load_module compatible. I will have a closer look, but I tend to think this is a better project for 2.4. I currently prefer working on a PEP, if only to increase the chance my work hasn't been a waste of time ;-) Btw. zi.get_data() currently raises ZipImporterError if a file is not found. IOError is probably better. Will also consider list_dir() & list_modules(). (Hm, I really dislike those underscores there, but I also don't want to remove them from find/load_module...)
msg41979 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-19 21:29
Logged In: YES user_id=92689 I added the default sys.path zip file item for unix builds, as PEP 273 prescribes. I've not done the refactoring that the PEP 273 implementation does (I really don't see why), but this means I can't use Jim Ahlstrom's PC/getpathp.c patch as is. I don't develop on Windows, so I wouldn't be able to test. Paul, do you think you can provide a minimal patch for PC/getpathp.c that adds the right thing to the default path? No rush. Other changes: - zipimporter.get_data() now raises IOError instead of ZipImportError if the "file" wasn't found. - removed the (dubious) mechanism that would not add a None value on sys.path_importer_cache for path items that have no hook and are not directories. Needed for the default zip item change: if the default zipfile isn't available it would otherwise keep trying to find a hook for it. This change also means that if you install a hook on sys.path_hooks and you want that hook to handle items that are already on sys.path, you must clear sys.path_importer_cache.
msg41980 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-23 22:31
Logged In: YES user_id=92689 Several changes/fixes. general: - incorporated patch from Paul Moore that adds a default zip archive path to sys.path on Windows (it was already there on unix). Thanks Paul! import.c: - matches latest version of PEP 302 (rev 1.3), regarding the new 'path' argument of importer.find_module(fullname, path=None) zipimporter.c - removed the subdir feature, which allowed the path to the zip archive to be extended with a subdirectory. PEP 273 stated this was needed for package support (and only for that). However, with the new import hooks this is no longer true: a path item containing the plain zip archive path can also deal with submodules (find_module receives the full module name after all). Therefore a pkg.__path__ from a package loaded from a zip archive will contain the plain zip archive path. - as a consequence I could simplify and clean up lots of things (esp. zipimporter_init: eg. it no longer needs to check sys.path_importer_cache; yay). Getting rid of the zipimporter.prefix attribute altogether helped a lot in other places. - this change additionally enabled me to get rid of the restriction that zip paths must end in .ZIP or .zip; any extension (or even no extension) will now work. - implemented all the (optional) extensions of the Importer Protocol that the latest version of PEP 302 defines.
msg41981 - (view)	Author: Just van Rossum (jvr) *	Date: 2002-12-30 22:34
Logged In: YES user_id=92689 Checked in with several changes: - importers on sys.path are no longer allowed - zipimporter was reverted to properly support pkg.__path__ and archive.zip/subdir/ again, this time without caring about the zip file extension. - get_data() semantics adapted to latest version of PEP 302 - minor import.c tweaks from Guido Commit details: Include/pythonrun.h, rev. 2.57 Lib/site.py, rev. 1.47 Lib/test/test_importhooks.py, rev. 1.1 Lib/test/test_zipimport.py, rev. 1.1 Modules/Setup.dist, rev. 1.35 Modules/getpath.c, rev. 1.45 Modules/zipimport.c, rev. 1.1 PC/config.c, rev. 1.37 PC/getpathp.c, rev. 1.29 PCbuild/pythoncore.dsp, rev. 1.40 Python/import.c, rev. 2.215 Python/importdl.h, rev. 2.19 Python/pythonrun.c, rev. 2.172

History
Date	User	Action	Args
2022-04-10 16:05:59	admin	set	github: 37602
2002-12-12 10:34:35	jvr	create