Message 400454 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	larry
Recipients	BTaskaya, Mark.Shannon, brandtbucher, brett.cannon, eric.snow, gvanrossum, larry, lemburg, nascheme, ronaldoussoren
Date	2021-08-28.01:48:38
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1630115318.9.0.994102735025.issue45020@roundup.psfhosted.org>
In-reply-to

Content
Since nobody's said so in so many words (so far in this thread anyway): the prototype from Jeethu Rao in 2018 was a different technology than what Eric is doing. The "Programs/_freeze_importlib.c" Eric's playing with essentially inlines a .pyc file as C static data. The Jeethu Rao approach is more advanced: instead of serializing the objects, it stores the objects from the .pyc file as pre-initialized C static objects. So it saves the un-marshalling step, and therefore should be faster. To import the module you still need to execute the module body code object though--that seems unavoidable. The python-dev thread covers nearly everything I remember about this. The one thing I guess I never mentioned is that building and working with the prototype was frightful; it had both Python code and C code, and it was fragile and hard to get working. My hunch at the time was that it shouldn't be so fragile; it should be possible to write the converter in Python: read in .pyc file, generate .c file. It might have to make assumptions about the internal structure of the CPython objects it instantiates as C static data, but since we'd ship the tool with CPython this should be only a minor maintenance issue. In experimenting with the prototype, I observed that simply calling stat() to ensure the frozen .py file hadn't changed on disk lost us about half the performance win from this approach. I'm not much of a systems programmer, but I wonder if there are (system-proprietary?) library calls one could make to get the stat info for all files in a single directory all at once that might be faster overall. (Of course, caching this information at startup might make for a crappy experience for people who edit Lib/*.py files while the interpreter is running.) One more observation about the prototype: it doesn't know how to deal with any mutable types. marshal.c can deal with list, dict, and set. Does this matter? ISTM the tree of objects under a code object will never have a reference to one of these mutable objects, so it's probably already fine. Not sure what else I can tell you. It gave us a measurable improvement in startup time, but it seemed fragile, and it was annoying to work with/on, so after hacking on it for a week (at the 2018 core dev sprint in Redmond WA) I put it aside and moved on to other projects.

Since nobody's said so in so many words (so far in this thread anyway): the prototype from Jeethu Rao in 2018 was a different technology than what Eric is doing.  The "Programs/_freeze_importlib.c" Eric's playing with essentially inlines a .pyc file as C static data.  The Jeethu Rao approach is more advanced: instead of serializing the objects, it stores the objects from the .pyc file as pre-initialized C static objects.  So it saves the un-marshalling step, and therefore should be faster.  To import the module you still need to execute the module body code object though--that seems unavoidable.

The python-dev thread covers nearly everything I remember about this.  The one thing I guess I never mentioned is that building and working with the prototype was frightful; it had both Python code and C code, and it was fragile and hard to get working.  My hunch at the time was that it shouldn't be so fragile; it should be possible to write the converter in Python: read in .pyc file, generate .c file.  It might have to make assumptions about the internal structure of the CPython objects it instantiates as C static data, but since we'd ship the tool with CPython this should be only a minor maintenance issue.

In experimenting with the prototype, I observed that simply calling stat() to ensure the frozen .py file hadn't changed on disk lost us about half the performance win from this approach.  I'm not much of a systems programmer, but I wonder if there are (system-proprietary?) library calls one could make to get the stat info for all files in a single directory all at once that might be faster overall.  (Of course, caching this information at startup might make for a crappy experience for people who edit Lib/*.py files while the interpreter is running.)

One more observation about the prototype: it doesn't know how to deal with any mutable types.  marshal.c can deal with list, dict, and set.  Does this matter?  ISTM the tree of objects under a code object will never have a reference to one of these mutable objects, so it's probably already fine.

Not sure what else I can tell you.  It gave us a measurable improvement in startup time, but it seemed fragile, and it was annoying to work with/on, so after hacking on it for a week (at the 2018 core dev sprint in Redmond WA) I put it aside and moved on to other projects.

History
Date	User	Action	Args
2021-08-28 01:48:38	larry	set	recipients: + larry, lemburg, gvanrossum, brett.cannon, nascheme, ronaldoussoren, Mark.Shannon, eric.snow, brandtbucher, BTaskaya
2021-08-28 01:48:38	larry	set	messageid: <1630115318.9.0.994102735025.issue45020@roundup.psfhosted.org>
2021-08-28 01:48:38	larry	link	issue45020 messages
2021-08-28 01:48:38	larry	create