Issue 15031: Split .pyc parsing from module loading

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/59236

classification

Title:	Split .pyc parsing from module loading
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	brett.cannon	Nosy List:	Arfrever, Marc.Abramowitz, Ronan.Lamy, brett.cannon, dino.viehland, eric.snow, jcea, ncoghlan, pjenvey, python-dev
Priority:	normal	Keywords:	patch

Created on 2012-06-07 19:10 by Ronan.Lamy, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
loads_pyc.diff	Ronan.Lamy, 2012-06-07 19:09		review
parse_pyc.diff	brett.cannon, 2012-11-17 02:43		review
issue15031_parse_container_function.diff	ncoghlan, 2012-11-18 10:50	Module level function with private helpers	review

Messages (33)
msg162489 - (view)	Author: Ronan Lamy (Ronan.Lamy) *	Date: 2012-06-07 19:09
I think it would be beneficial to extract the handling of the binary format of bytecode files from the rest of the loader/finder code in importlib._bootstrap. That means exposing, internally at least, functions that do nothing more than convert the binary contents of a .pyc file to/from metadata + code object. They would help in testing and implementing alternate loaders and would prevent the kind of code duplication that led to issue 15030. I'm adding a patch implementing extracting a _loads_pyc() function from _bytes_to_bytecode(). Note that: * _loads_pyc() has only one parameter, instead of 5 for _bytes_from_bytecode. * The raising of exceptions is more consistent: _loads_pyc being an IO helper, it never raises ImportError. Thanks to exception chaining, no information is lost, though. Obviously, this should be complemented with a _dumps_pyc(). Then there's the question of whether these functions should be public and what the best interface is - I actually feel that the natural home of both functions is in a lower level module, either imp or marshal. So, should I get on with it, or are there things I overlooked that make this a bad idea?
msg162517 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-06-08 02:49
So the problem with the function is that had this been implemented in Python 3.2 it already would be outdated thanks to the 3.3 change of storing the file size in the .pyc file, and simply changing the length of the sequence returned would probably break code. I have thought about exposing something like this over the years, and I always come back to the conclusion that it is really touchy and the best you could do is pass in the data and info you have from the source and either raise the proper exception or return the code object/bytes.
msg162535 - (view)	Author: Ronan Lamy (Ronan.Lamy) *	Date: 2012-06-08 16:53
I see. However, I think that breaking code noisily is better than breaking it silently, which is what happens when the same protocol is reimplemented many times. And _loads_pyc() could be made more forward-compatible by returning (path_stats, code) instead of (timestamp, size, code).
msg162543 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-06-08 18:16
I point is that it shouldn't break at all if possible, although path_stats does potentially provide a way to deal with this (as does using keyword-only arguments for what needs to be verified). I'll have to think about this. One issue I have with this is I don't love exposing bytecode files too much as people should not typically muck with bytecode files, and if they do they should really know what they are doing. I honestly view them as an implementation detail and not something for people to play with. I also don't want people to get to into bytecode files as they are not necessarily the best way to store bytecode. E.g. a sqlite DB storing bytecode really shouldn't store all of that info in a single column but instead in separate columns and exposing a function as proposed kind of pushes against that idea.
msg164115 - (view)	Author: Marc Abramowitz (Marc.Abramowitz) *	Date: 2012-06-26 22:50
Another package that inspects pyc files and which also ran into trouble because of the 8 to 12 byte change is distribute. See: https://bitbucket.org/tarek/distribute/issue/283/bdist_egg-issues-with-python-330ax Some kind of abstraction for loading pyc files would be nice.
msg164156 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-06-27 13:52
Why is distribute reading bytecode to begin with? What's the use case, especially considering that using bytecode screws over other Python VMs like Jython and IronPython.
msg164158 - (view)	Author: Marc Abramowitz (Marc.Abramowitz) *	Date: 2012-06-27 14:08
Well, it may be a vestige from setuptools and I don't know if it's still needed/appropriate, but distribute scans the pyc modules to try to see whether stuff is zip_safe or not when you run `python setup.py bdist_egg`: https://bitbucket.org/tarek/distribute/src/da2848d34282/setuptools/command/bdist_egg.py#cl-420 Personally, I always set zip_safe to True as I find zipped distributions to be a pain in the butt, but not all distributions out there always set it and currently `python setup.py bdist_egg` (and probably other commands?) bombs on such packages. We should probably get Tarek to weigh in here.
msg164161 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-06-27 14:37
This all goes back to my original point: I don't want to promote people shipping around bytecode only as it hampers the use of other VMs. Anyway, I'll continue to contemplate this. Any function would have to verify the magic number and flat-out fail otherwise to prevent parsing issues (e.g. Marc's patch works with or without the file size field to stay compatible with PyPycLoader.write_bytecode() implementations that might not set the file size field; a design flaw that I fixed in SourceLoader by taking away the ability to write bytecode). It would also have it return a dict much like path_stats() for easy comparison / verification. But just to be very clear in case someone is hoping, I will NOT support an API to write bytecode as it make compatibility a nightmare and hampers changing the format as necessary.
msg168014 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-08-12 00:39
So the more I think about this, the more I'm willing to do this in Python 3.4. First, though, the docs would need to be updated in importlib to distinguish between bytecode and byte-compiled files. From there, I think importlib.abc.SourceLoader.parse_byte_compiled_file(data) could exist. It would be a classmethod so people can call it directly if they really want. It would verify the magic number, and if that's good, return a dict that can be directly compared against what path_stats() returns plus the bytecode for the module. This should properly shield the format of bytecode from users while still providing an API people can rely on.
msg168457 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-08-17 18:10
While thinking about terminology, source files/modules vs. bytecode files/modules make the distinction clear enough to differentiate just straight bytecode for an object (i.e. what dis outputs) vs. bytecode plsu the other stuff that goes with .pyc files.
msg175634 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-15 19:56
Here is my current plan:: parse_bytecode_file(data, source_stats, source_path) -> code This will take in the bytes from the bytecode file and a stats dict from SourceLoader.path_stats(). The method will parse the bytecode file, verify the magic number, timestamp, and file size. If all of that passes, then the bytecode will be compiled into a code object (with its source path fixed if necessary based on source_path) and returned. All arguments are required as you should explicitly state you want to skip sanity checks of the bytecode and that you don't want to fix the file path.
msg175636 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-15 19:58
Might name it compile_bytecode_file() instead.
msg175661 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-11-16 00:27
Perhaps go a bit more generic and call it something like "load_cached_data"?
msg175664 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-16 02:30
You can't use "data" as there are connotations for that word thanks to get_data(). Same goes for "load" since this is something that get_code() would use more directly than load_module().
msg175705 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-16 19:08
In order to have exceptions that have messages like "bad magic number in module" there would need to be a technically unneeded fullname parameter. People cool with that? I personally dislike having info passed in just for error reporting, but in this case import exceptions typically try to clarify what actually caused the issue by naming the module since you can end up with deep implicit module chains. The other stuff related to bytecode paths are just for verbose logging or for setting path on ImportError, both of which can be worked around.
msg175711 - (view)	Author: Eric Snow (eric.snow) *	Date: 2012-11-16 20:01
Don't underestimate the potential value of having the fullname when overriding the method in a subclass.
msg175721 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-17 02:43
Attached is a patch that introduces _LoaderBasics.parse_bytecode_file(). It takes in basically what _bytes_from_bytecode() did plus a keyword-only boolean flag for exception suppression as that's needed to ignore stale bytecode when source is available. I still need to write specific tests for the method, but test.test_importlib passes.
msg175753 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-17 15:15
Another name for the method I thought of last night was parse_bytecode_container(). People like that more? That would lead to new terminology to distinguish bytecode ala dis module compared to bytecode file format ala importlib that makes it clear that what is written to bytecode files does not have to be an actual file.
msg175829 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-11-17 23:38
"parse_cache_contents" is my last proposed colour for the shed, but I can live with "parse_bytecode_container"
msg175839 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-18 01:22
I can go with parse_cache_contents() as that aligns with imp.cache_from_source(). Does the API look reasonable (other than the parameter names since those would have to be renamed to be more in line with "cache")? Only thing I'm worried about.
msg175844 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-11-18 08:18
The suppression flag rings alarm bells for me, as does the fact that all the arguments are optional. Do you remember the rationale for allowing the marshalling errors to propagate rather than falling back to loading from source? It seems weird that a truncated read in the first 12 bytes means to fall back to compiling from source, but truncation after that is a user visible error. Allowing those exceptions to be suppressed as well would simplify things a fair bit. Regardless, looking at the full patch in context on a real computer (rather than through my phone), suggests to me there needs to be two private static methods on _LoaderBasics._parse_bytecode_file: _validate_bytecode_header _unmarshal_code SourcelessLoader.get_code would just call these directly, while SourceLoader.parse_cache_contents would suppress exceptions from the first one. I'll put together a patch illustrating this approach.
msg175858 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-18 10:33
The rationale is that was the way it already was prior to importlib. As for the approach you are suggesting, I am understand it, it will just not give the public method the same functionality which might not be that important. On Nov 18, 2012 5:18 AM, "Nick Coghlan" <report@bugs.python.org> wrote: > > Nick Coghlan added the comment: > > The suppression flag rings alarm bells for me, as does the fact that all > the arguments are optional. Do you remember the rationale for allowing the > marshalling errors to propagate rather than falling back to loading from > source? It seems weird that a truncated read in the first 12 bytes means to > fall back to compiling from source, but truncation after that is a user > visible error. Allowing those exceptions to be suppressed as well would > simplify things a fair bit. > > Regardless, looking at the full patch in context on a real computer > (rather than through my phone), suggests to me there needs to be two > private static methods on _LoaderBasics._parse_bytecode_file: > > _validate_bytecode_header > _unmarshal_code > > SourcelessLoader.get_code would just call these directly, while > SourceLoader.parse_cache_contents would suppress exceptions from the first > one. > > I'll put together a patch illustrating this approach. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue15031> > _______________________________________ >
msg175862 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-11-18 10:50
OK, rereading the whole issue and getting completely back up to speed with the problem we're trying to solve, I think parse_bytecode_container is a better name than any of my suggestions, since there is no cache involved for SourcelessLoader and similar cases. I also think a static method is a bad idea, since that makes the preferred method of invocation unclear as the method is visible via a number of different loaders, and if you're just wanting to get the code object out of a bytecode file, it isn't clear why a loader is involved at all. So, I've gone back to Ronan's approach of a module level function. The attached patch goes down that route - the public function always reraises the header errors, but in a way that SourceLoader.get_code can easily suppress. SourcelessLoader.get_code simply bypasses the public helper entirely (it could call it and unwrap the header exceptions for backwards compatibility, but that seems pointlessly convoluted). The "fullname, data, bytecode_path" part of the API remains consistent across the 3 helpers, with optional source_* parameters at the end where appropriate.
msg175870 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-18 12:42
It's a method so that it can be overridden. Otherwise I can't develop my own format and have that be the only differing option from SourceLoader. On Nov 18, 2012 7:50 AM, "Nick Coghlan" <report@bugs.python.org> wrote: > > Nick Coghlan added the comment: > > OK, rereading the whole issue and getting completely back up to speed with > the problem we're trying to solve, I think parse_bytecode_container is a > better name than any of my suggestions, since there is no cache involved > for SourcelessLoader and similar cases. > > I also think a static method is a bad idea, since that makes the preferred > method of invocation unclear as the method is visible via a number of > different loaders, and if you're just wanting to get the code object out of > a bytecode file, it isn't clear why a loader is involved at all. So, I've > gone back to Ronan's approach of a module level function. > > The attached patch goes down that route - the public function always > reraises the header errors, but in a way that SourceLoader.get_code can > easily suppress. SourcelessLoader.get_code simply bypasses the public > helper entirely (it could call it and unwrap the header exceptions for > backwards compatibility, but that seems pointlessly convoluted). > > The "fullname, data, bytecode_path" part of the API remains consistent > across the 3 helpers, with optional source_* parameters at the end where > appropriate. > > ---------- > Added file: > http://bugs.python.org/file28022/issue15031_parse_container_function.diff > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue15031> > _______________________________________ >
msg175872 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-18 13:45
Or to put it another way, without making it a method other interpreters like IronPython and Jython will never be able to reuse SourceLoader effectively if they want their own cached file format.
msg175873 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-11-18 14:14
FYI I'm talking with Dino for IronPython and I just emailed some Jython folks to try to get an opinion on whether they would prefer to have a method to override, something in _imp that they can implement, or simply implement marshal.loads() and dumps() such that the bytecode headers stay and the rest of the file is up to them to deal with. Their answers will dictate whether the code is a public method or just some private functions in importlib.
msg175933 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2012-11-18 23:49
I thought about that, but decided that being able to hook the compilation step and the reading from the cache step without overriding get code completely is a separate design problem from providing a public API for reading the bytecode container format.
msg176873 - (view)	Author: Philip Jenvey (pjenvey) *	Date: 2012-12-04 04:02
From the perspective of Jython we'd want the easiest way to hook into this as possible of course, but I think that overriding marshal to handle a $py.class or whatever format would be a misappropriation of the marshal module Jython actually has a slow, preliminary .pyc bytecode interpreter, so it needs marshal the way it is. Correct me if I'm wrong but I think the overriding a method of a Loader option could allow you to even have the import system support .pyc and $py.class at the same time in Jython (just by addding another Loader into the mix) I'm not sure anyone would ever want to do that in practice, but it's probably worth considering. Overriding a Loader method is probably the 'most work' for alternative implementations, right? But it's still fairly trivial
msg176921 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2012-12-04 15:42
So we have IronPython who has their own format, Jython who has their own format and .pyc files (which would screw with __cached__ so I assume they won't try to use both at the same time), PyPy which use .pyc files, and CPython which uses .pyc files. My motivation for exposing this publicly is quickly dwindling, although factoring it out for private use is fine.
msg179742 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-01-11 23:09
New changeset a4292889e942 by Brett Cannon in branch 'default': Issue #15031: Refactor some code in importlib pertaining to validating http://hg.python.org/cpython/rev/a4292889e942
msg179743 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2013-01-11 23:14
OK, so I took Nick's suggestion of doing header verification separately from the actual compilation, but I ended up just writing the patch from scratch to see how a possible API might play out. ATM the API is as functions and private, but if I find the need/desire to make it public (or enough ask for it) I will before 3.4 comes out. Probably will know better once I rewrite py_compile/compileall to use this code. Thanks to everyone for spending so much time talking this issue out with me.
msg179795 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2013-01-12 14:12
Nick had some good suggestions on improvements: http://mail.python.org/pipermail/python-dev/2013-January/123602.html Re-opening to remind me to do them.
msg180605 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2013-01-25 18:50
Nick's suggestions done in changeset 792810303239 .

History
Date	User	Action	Args
2022-04-11 14:57:31	admin	set	github: 59236
2013-01-25 18:50:11	brett.cannon	set	status: open -> closed messages: + msg180605
2013-01-12 14:12:57	brett.cannon	set	status: closed -> open messages: + msg179795
2013-01-11 23:14:15	brett.cannon	set	status: open -> closed resolution: fixed messages: + msg179743 stage: test needed -> resolved
2013-01-11 23:09:34	python-dev	set	nosy: + python-dev messages: + msg179742
2012-12-04 15:42:30	brett.cannon	set	messages: + msg176921
2012-12-04 04:02:09	pjenvey	set	nosy: + pjenvey messages: + msg176873
2012-11-18 23:49:53	ncoghlan	set	messages: + msg175933
2012-11-18 14:14:33	brett.cannon	set	nosy: + dino.viehland
2012-11-18 14:14:03	brett.cannon	set	messages: + msg175873
2012-11-18 13:45:54	brett.cannon	set	messages: + msg175872
2012-11-18 12:42:22	brett.cannon	set	messages: + msg175870
2012-11-18 10:50:36	ncoghlan	set	files: + issue15031_parse_container_function.diff messages: + msg175862
2012-11-18 10:33:41	brett.cannon	set	messages: + msg175858
2012-11-18 08:18:03	ncoghlan	set	messages: + msg175844
2012-11-18 01:22:18	brett.cannon	set	messages: + msg175839
2012-11-17 23:38:44	ncoghlan	set	messages: + msg175829
2012-11-17 16:07:08	brett.cannon	unlink	issue16213 dependencies
2012-11-17 15:15:06	brett.cannon	set	messages: + msg175753
2012-11-17 02:43:37	brett.cannon	set	files: + parse_pyc.diff messages: + msg175721 components: + Library (Lib), - Interpreter Core
2012-11-16 20:01:32	eric.snow	set	messages: + msg175711
2012-11-16 19:08:04	brett.cannon	set	messages: + msg175705
2012-11-16 02:30:00	brett.cannon	set	messages: + msg175664
2012-11-16 00:27:39	ncoghlan	set	messages: + msg175661
2012-11-15 19:58:08	brett.cannon	set	messages: + msg175636
2012-11-15 19:57:00	brett.cannon	set	messages: + msg175634 stage: test needed
2012-11-13 05:12:22	eric.snow	set	nosy: + eric.snow
2012-10-12 17:02:52	brett.cannon	link	issue16213 dependencies
2012-08-17 18:10:08	brett.cannon	set	messages: + msg168457
2012-08-13 19:37:07	brett.cannon	set	assignee: brett.cannon
2012-08-12 00:39:29	brett.cannon	set	messages: + msg168014
2012-08-12 00:23:17	brett.cannon	set	versions: + Python 3.4, - Python 3.3
2012-07-02 18:39:16	brett.cannon	unlink	issue15030 dependencies
2012-06-27 15:55:11	Arfrever	set	nosy: + Arfrever
2012-06-27 14:37:49	brett.cannon	set	messages: + msg164161
2012-06-27 14:08:56	Marc.Abramowitz	set	messages: + msg164158
2012-06-27 13:52:47	brett.cannon	set	messages: + msg164156
2012-06-26 22:50:21	Marc.Abramowitz	set	nosy: + Marc.Abramowitz messages: + msg164115
2012-06-08 18:16:25	brett.cannon	set	messages: + msg162543
2012-06-08 16:53:22	Ronan.Lamy	set	messages: + msg162535
2012-06-08 02:49:25	brett.cannon	set	messages: + msg162517
2012-06-08 01:31:32	jcea	link	issue15030 dependencies
2012-06-08 01:30:22	jcea	set	nosy: + jcea
2012-06-07 19:10:12	Ronan.Lamy	create