classification
Title: Store startup modules as C structures for 20%+ startup speed improvement
Type: enhancement Stage: patch review
Components: Interpreter Core Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: larry Nosy List: 1ace, barry, eric.smith, eric.snow, larry, nascheme, ncoghlan, phsilva, xtreak
Priority: normal Keywords: patch

Created on 2018-09-14 21:23 by larry, last changed 2020-12-18 18:11 by 1ace.

Pull Requests
URL Status Linked Edit
PR 9320 open larry, 2018-09-14 21:25
Messages (6)
msg325401 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2018-09-14 21:23
This patch was sent to me privately by Jeethu Rao at Facebook.  It's a change they're working with internally to improve startup time.  What I've been told by Carl Shapiro at Facebook is that we have their blessing to post it publicly / merge it / build upon it for CPython.  Their patch was written for 3.6, I have massaged it to the point where it minimally works with 3.8.

What the patch does: it takes all the Python modules that are loaded as part of interpreter startup and deserializes the marshalled .pyc file into precreated objects stored as static C data.  You add this .C file to the Python build.  Then there's a patch to Python itself (about 250 lines iirc) that teaches it to load modules from these data structures.

I wrote a quick dumb test harness to compare this patch vs 3.8 stock.  It runs a command line 500 times and uses time.perf_counter to time the process.  On a fast quiescent laptop I observe a 21-22% improvement:

cmdline: ['./python', '-c', 'pass']
500 runs:

sm38
  average time 0.006302303705982922
          best 0.006055746000129147
         worst 0.00816565500008437

clean38
  average time 0.007969956444008858
          best 0.007829047999621253
         worst 0.008812210000542109

improvement 0.20924239043734505 %

cmdline: ['./python', '-c', 'import io']
500 runs:

sm38
  average time 0.006297688038004708
          best 0.005980765999993309
         worst 0.0072462130010535475

clean38
  average time 0.007996319670004595
          best 0.0078091849991324125
         worst 0.009175700999549008

improvement 0.21242667903482038 %


The downside of the patch: for these modules it ignores the Python files on disk--it doesn't even stat them.  If you add stat calls you lose half of the speed improvement.  I believe they added a work-around, where you can set a flag (command-line? environment variable? I don't know, I didn't go looking for it) that tells Python "don't use the frozen modules" and it loads all those files from disk.


I don't propose to merge the patch in its current state.  I think it would need a lot of work both in terms of "doing things the way Python does it" as well as just code smell (the serializer is implemented in both C and Python and jumps back and forth, also the build process for the serialized modules is pretty tiresome).

Is it worth working on?
msg325403 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2018-09-14 21:34
I should add that there were two novel test failures in the regression test suite: test_module and test_site.
msg325407 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2018-09-14 22:04
As Neil points out on python-dev, my "improvement" should have been multiplied by 100, like 20.924239043734505 % instead of 0.20924239043734505 %, etc.
msg325410 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2018-09-14 22:37
I commented on python-dev but maybe it is better to keep discussion here.  Could we make the frozenmodules thing into a dynamically loaded module?  Then you could have support for end users making their own.  E.g. a command-line param that lists a set of DLLs from which to load frozen modules from.  Then, if the command-line is not provided, we load from .pyc as per default.  If it is provided, we dlopen (or whatever) the DLLs and give the nice performance benefits.

Like Guido, I want to see the toolchain to build the frozenmodules thing.
msg325745 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-19 10:56
Something else this would need is a different name that better distinguishes it from the existing frozen modules, which freeze the bytecode rather than the resulting module state. (That existing approach avoids the stat overhead, but still incurs the module level bytecode execution overhead)

My suggestion would be to use the terms "preexec builtin module" and "preexec extension module", as the concept is similar to precompiled bytecode, just taken a step further: actually executing the module in advance and caching the result, not just compiling it.

This means the preexecuted module can't have any module level conditional logic that depends on runtime state if the resulting binary is going to remain portable across different environments, but there'd be a lot of standard library modules that could satisfy that constraint.
msg325872 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-09-20 13:38
python-dev thread with more discussion of the patch: https://mail.python.org/pipermail/python-dev/2018-September/155188.html

Note that my comment above was based on a misunderstanding of what the patch does - the module level code still gets executed at import time, what's able to be avoided is the unmarshalling costs.
History
Date User Action Args
2020-12-18 18:11:311acesetnosy: + 1ace
2019-08-22 03:52:16phsilvasetnosy: + phsilva
2019-08-20 23:40:39barrysetnosy: + barry
2018-09-20 13:38:22ncoghlansetmessages: + msg325872
2018-09-19 10:56:53ncoghlansetnosy: + ncoghlan
messages: + msg325745
2018-09-15 03:21:34xtreaksetnosy: + xtreak
2018-09-14 22:37:47naschemesetnosy: + nascheme
messages: + msg325410
2018-09-14 22:04:17larrysetmessages: + msg325407
2018-09-14 21:34:59larrysetmessages: + msg325403
2018-09-14 21:27:55eric.snowsetnosy: + eric.snow
2018-09-14 21:26:32eric.smithsetnosy: + eric.smith
2018-09-14 21:25:54larrysetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request8745
2018-09-14 21:23:50larrycreate