classification
Title: Speed up python startup by pre-warming the vm
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.8
process
Status: open Resolution:
Dependencies: 22257 Superseder:
Assigned To: Nosy List: barry, cykerway, inada.naoki, ncoghlan, ronaldoussoren, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2018-07-31 17:31 by cykerway, last changed 2018-08-09 03:08 by inada.naoki.

Messages (14)
msg322800 - (view) Author: Cyker Way (cykerway) Date: 2018-07-31 17:31
I'm currently writing some shell tools with python3. While python can definitely do the job succinctly, there is one problem which made me feel I might have to switch to other languages or even pure bash scripts: python startup time.

Shell tools are used very often, interactively. users do feel the lag when they hit enter after their command. i did 2 implementations in both python and pure bash, python takes about 500ms to run while bash is more than 10 times faster.

I'm not saying bash is better than python, but for this task bash, or even perl, is a better choice. however, i think there is an easy way to improve python as i believe the lag is mostly due to its slow startup: pre-warm its vm. 

I can think of 2 scenarios for python to do a better shell job:

1.  Run a universal python as a daemon, which reads scripts from a socket, runs it, and returns the result to a socket. Because it's running as a daemon, the startup time is avoided each time user runs a script.

2.  Warm a python zygote during system boot. Every time a python script is run, fork from the zygote instead of cold-boot the vm. this is a similar approach to android zygote.

I haven't done experiments to see whether there will be obstacles in implementing these scenarios. But I think this should become a priority because it's real and tangible, and other people may face the slow startup problem as well. If there's ongoing work on these, I'd be happy to have a look. But I don't think these scenarios have already been put into released versions of python.
msg323091 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2018-08-03 22:18
We are aware that startup time is a issue, especially for quick scripts.  I don't know if your ideas have been considered, so I added a couple of people who might.  The python-ideas list would likely be a better place for discussion until there is some idea about a concrete patch.
msg323098 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-08-04 07:41
It isn't currently feasible to do anything along these lines, as the CPython runtime is highly configurable, so it's far from clear what, if anything, could be shared from run to run, and nor is it clear how the interpreter could check whether or not the current configuration settings matched those of the pre-warmed one.

However, the work taking place for PEP 432 (issue dependency added) will potentially make it possible to revisit this, as there may be a way to cache preconfigured interpreters in a fashion that means calculating the cache key from the current configuration and then loading the cached interpreter state is faster that imperatively initialising a fresh interpreter.

Even if it isn't possible to cache an entire interpreter state, there may at least be opportunities to optimise particular configuration substeps.
msg323183 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-08-06 08:07
> i did 2 implementations in both python and pure bash, python takes about 500ms to run while bash is more than 10 times faster.

VM startup + `import site` are done in 10~20ms on common Linux machine.
It won't take 500ms.

While Python VM startup is not lightning fast, library import time is much slower than VM startup in many cases.  And since "importing library" may have many important side effects, we can't "pre warm" it.

Have you tried `PYTHONPROFILEIMPORTTIME=1` environment variable added by Python 3.7?
https://dev.to/methane/how-to-speed-up-python-application-startup-time-nkf
msg323189 - (view) Author: Cyker Way (cykerway) Date: 2018-08-06 09:58
>   VM startup + `import site` are done in 10~20ms on common Linux machine.

>   While Python VM startup is not lightning fast, library import time is much slower than VM startup in many cases.

In my tests, a helloworld python script generally takes about 30-40 ms, while a similar helloworld bash script takes about 3-4 ms. Adding some common library imports (`os`, `sys`, etc.), then the python script's running time goes up to 60-70ms. I'm not sure how to compare this against bash because there are no library imports in bash. The 500ms (python) vs 50ms (bash) comparison is based on minimal implementations of the same simple job and meant to reflect the minimal amount of time needed for such a job in different languages. While it doesn't cover everything and may not even be fair enough, the result does match that of the helloworld test (10 times faster/slower). Plus, in your linked post, it shows importing pipenv takes about 700ms. Therefore I believe some hundreds of milliseconds are necessary for such scripts that do a simple but meaningful job.

I understand many things can happen while importing a library. But for a specific program, its imports are usually fixed and very much likely the same between runs. That's why I believe a zygote/fork/snapshot feature would still be helpful to help avoid the startup delay.

Finally, for simple and quick user scrips, the 30-40 ms startup time without any import statements may not be a huge problem, but it's still tangible and makes the program feel not that sleek.
msg323191 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-08-06 10:12
> In my tests, a helloworld python script generally takes about 30-40 ms.
[snip]
> Finally, for simple and quick user scrips, the 30-40 ms startup time without any import statements may not be a huge problem, but it's still tangible and makes the program feel not that sleek.

What is your environment?
I optimized startup on Python 3.7, especially on macOS (it was very slow before 3.7).

And some legacy features (e.g. legacy "namespace package" system from setuptools) will make startup much slower, because they import some heavy libraries silently even when you just run "hello world".

PYTHONPROFILEIMPORTTIME will help to find them too.  And venv allow to split out such legacy tools from your main Python environment.

> The 500ms (python) vs 50ms (bash) comparison is based on minimal implementations of the same simple job and meant to reflect the minimal amount of time needed for such a job in different languages. 

Would you give us some example script?

> Plus, in your linked post, it shows importing pipenv takes about 700ms. Therefore I believe some hundreds of milliseconds are necessary for such scripts that do a simple but meaningful job.

FYI, it compiles many regular expressions at startup time.
I want to add lazy compilation API to re module in 3.8.  (I'm waiting bpo-21145 is implemented)
msg323194 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-08-06 10:28
> I understand many things can happen while importing a library. But for a specific program, its imports are usually fixed and very much likely the same between runs. That's why I believe a zygote/fork/snapshot feature would still be helpful to help avoid the startup delay.

I agree it can be useful.  It's different from your first post.  (Snapshot whole application, rather than just Python VM).
And this idea is discussed on Python-dev ML several times.

But I think it can be implemented as 3rd party tool at first.  It is better because (1) we can battle test the idea before adding it to stdlib, and (2) we can use the tool even for Python 3.7.
msg323195 - (view) Author: Cyker Way (cykerway) Date: 2018-08-06 10:39
It was tested on a x86_64 Linux system. The machine is not quite new but is OK for building and running python. The test script is actually a management tool for a larger project that is not released in public so I don't have right to disclose it here. When tested on python 3.7 it did run faster than python 3.6 so there were indeed great improvements.

While optimizing standard libraries definitely makes the program start up faster, we should also note that each user program also has its own specific initialization procedure. This is in general out of control of language platform developers, but forking a user-initialized vm or using a snapshot chosen by the user still helps.
msg323234 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-08-07 04:39
* While this issue is "pre warming VM", VM startup is not significant part of your 500ms.
* You're talking about application specific strategy now.  It's different of this issue.

And many ideas like yours are already discussed on ML, again and again.  I feel nothing is new or concrete.  It's not productive at all.

I want to close this issue.  Please give us more concrete idea or patch with target sample application you want to optimize.
msg323253 - (view) Author: Cyker Way (cykerway) Date: 2018-08-07 21:40
>   While this issue is "pre warming VM", VM startup is not significant part of your 500ms.

10-20ms should be OK for shell scripts. But a fork is still faster.  

>   You're talking about application specific strategy now. It's different of this issue.

Actually, this issue is created to look for a generic approach that can optimize the running time for most, or even all, python scripts. Different scripts may import different modules, but this doesn't mean there isn't a method that works for all of them.

>   And many ideas like yours are already discussed on ML, again and again.

I browsed about 6-7 threads on python-dev. I think 2-3 of them provide great information. But I don't think any of them gives concrete solutions. So we are still facing this problem today.

>   I want to close this issue. Please give us more concrete idea or patch with target sample application you want to optimize.

As said I'm looking for a generic approach. So optimizing specific applications isn't really the goal of this issue (though work on specific modules still helps). I did implement a proof of concept (link: <https://github.com/cykerway/pyforkexec>) for the fork-exec startup approach. It's still very rough and elementary, but proves this approach has its value. As Nick said:

>   ...the CPython runtime is highly configurable, so it's far from clear what, if anything, could be shared from run to run...

What I hope is we can inspect these configurations and figure out the invariants. This would help us make a clean environment as the forking base. If this is impossible, I think we can still fork from a known interpreter state chosen by the user script author. You may close this issue if nobody has more to say, but I hope the fork-exec startup can be standardized one day as I believe, for quick scripts, however much you optimize the cold start it can't be faster than a fork.
msg323261 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-08-08 03:35
On Wed, Aug 8, 2018 at 6:40 AM Cyker Way <report@bugs.python.org> wrote:
>
> Cyker Way <cykerway@gmail.com> added the comment:
>
> >   While this issue is "pre warming VM", VM startup is not significant part of your 500ms.
>
> 10-20ms should be OK for shell scripts. But a fork is still faster.
>
> >   You're talking about application specific strategy now. It's different of this issue.
>
> Actually, this issue is created to look for a generic approach that can optimize the running time for most, or even all, python scripts. Different scripts may import different modules, but this doesn't mean there isn't a method that works for all of them.
>

"Fork before loading any modules" and "fork after loading application"
are totally different.
It's not generic.  Former can be done by Python core, but I'm not sure
it's really helpful.
Later can be done in some framework.  And it can be battle tested on
3rd party CLI framework
before adding it to Python stdlib.

> >   And many ideas like yours are already discussed on ML, again and again.
>
> I browsed about 6-7 threads on python-dev. I think 2-3 of them provide great information. But I don't think any of them gives concrete solutions. So we are still facing this problem today.
>

I didn't mean it's solved.  I meant many people said same idea again and again,
and I'm tired to read such random ideas.
Python provides os.fork already.  People can use it.  CLI frameworks can use it.
So what Python should be provide additionally?

> >   I want to close this issue. Please give us more concrete idea or patch with target sample application you want to optimize.
>
> As said I'm looking for a generic approach. So optimizing specific applications isn't really the goal of this issue (though work on specific modules still helps). I did implement a proof of concept (link: <https://github.com/cykerway/pyforkexec>) for the fork-exec startup approach. It's still very rough and elementary, but proves this approach has its value. As Nick said:
>

I doubt there is generic and safe approach which is fit in stdlib.
For example, your PoC includes several considerable points:

* Different Python, venv, or application version may listen the unix socket.
* Where should be the socket listen?  How about socket permission?
* Environment variable may be differ.
* When the server is not used for a long time, it should be exit automatically.
  Client start server if there are no server listening.

I prefer battle-testing idea as 3rd party tool first.  Then, we can
discuss about
we should add it on stdlib or not.

"Add it on PyPI first" approach has several benefits:

* It can be used with older Python.
* It can be evolve quickly than Python's 1.5year release cycle.

> >   ...the CPython runtime is highly configurable, so it's far from clear what, if anything, could be shared from run to run...
>
> What I hope is we can inspect these configurations and figure out the invariants. This would help us make a clean environment as the forking base. If this is impossible, I think we can still fork from a known interpreter state chosen by the user script author. You may close this issue if nobody has more to say, but I hope the fork-exec startup can be standardized one day as I believe, for quick scripts, however much you optimize the cold start it can't be faster than a fork.
>

It relating only with "former" (fork before loading application) approach.
I don't think it's really worth enough.  Benefits will be very small compared to
it's danger and complexity.
msg323266 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2018-08-08 07:29
This might be a useful feature but I think it would be better to develop this outside the stdlib, especially when the mechanism needs application specific code (such as preloading modules used by a specific script).

If/when such a tool has enough traction on PyPI we can always reconsider including it in the stdlib.

BTW. Python runs on a number of platforms where a forking server (option 2 in msg322800) is less useful. In particular Windows which doesn't have "fork" behaviour.
msg323295 - (view) Author: Cyker Way (cykerway) Date: 2018-08-08 18:17
I'm fine with stdlib, 3rd party tools, or whatever. My focus is to understand is whether this idea can be correctly implemented on the python VM or not. I've been searching for similar implementations on standard JVM, but the results mostly come from research projects rather than industrial solutions. That being said, Android does have preloading implemented in its Dalvik/ART VM (which is more or less a variant of JVM). Cited from <https://source.android.com/devices/tech/dalvik/configure>:

>   The preloaded classes list is a list of classes the zygote will initialize on startup. This saves each app from having to run these class initializers separately, allowing them to start up faster and share pages in memory.

I was wondering what makes it difficult for standard JVM (eg. HotSpot) to have such feature and why Dalvik/ART is able to do it, and what would be the case for the python VM?

----

A few more words about my original vision: I was hoping to speed up python script execution using template VMs in which a list of selected modules are preloaded. For example, if you have one script for regex matching, and another for dir listing, then you can create 2 template VMs with `re` and `os` modules preloaded, respectively. The template VMs run as system service so that you can always fork from them to create something like a runtime version of *virtualenv* where only relevant modules are loaded. The preloaded modules can be standard modules or user modules. I don't really see what makes a difference here if the module is standard or not.

----

>   In particular Windows which doesn't have "fork" behaviour.

Forking helps the parent process keep a clean state since it basically does nothing after the fork. If the system doesn't natively support fork then the parent process can do the job by itself instead of forking a child process to do so. But additional work might be needed to remove the artifacts resulting from the execution of user script.
msg323313 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2018-08-09 03:08
On Thu, Aug 9, 2018 at 3:17 AM Cyker Way <report@bugs.python.org> wrote:
>
> Cyker Way <cykerway@gmail.com> added the comment:
>
> I'm fine with stdlib, 3rd party tools, or whatever. My focus is to understand is whether this idea can be correctly implemented on the python VM or not. I've been searching for similar implementations on standard JVM, but the results mostly come from research projects rather than industrial solutions. That being said, Android does have preloading implemented in its Dalvik/ART VM (which is more or less a variant of JVM). Cited from <https://source.android.com/devices/tech/dalvik/configure>:
>
> >   The preloaded classes list is a list of classes the zygote will initialize on startup. This saves each app from having to run these class initializers separately, allowing them to start up faster and share pages in memory.
>
> I was wondering what makes it difficult for standard JVM (eg. HotSpot) to have such feature and why Dalvik/ART is able to do it, and what would be the case for the python VM?
>

Many WSGI servers provides "pre-fork" for (1) faster worker process creation and
(2) sharing static memory.  So it's definitely possible.

When compared with JVM, Python is dynamic language.
for example,

if not os.environ.get('XXX_NO_SPEEDUP'):
    from xxx._speedup import somefunc  # Load somefunc from extension
else:
    from xxx._util import somefunc  # Load somefunc from pure Python

Environment variables, configuration files, or even input from keyboard or
some sensors may affects importing modules, unlike JVM.

So more strict restriction is required for application in Python's case.
It can't be used for general, blindly and automatically from VM-side.
It should be implemented explicitly from Application side.
History
Date User Action Args
2018-08-09 03:08:55inada.naokisetmessages: + msg323313
2018-08-08 18:17:49cykerwaysetmessages: + msg323295
2018-08-08 07:29:54ronaldoussorensetnosy: + ronaldoussoren
messages: + msg323266
2018-08-08 03:35:40inada.naokisetmessages: + msg323261
2018-08-07 21:40:42cykerwaysetmessages: + msg323253
2018-08-07 04:39:32inada.naokisetmessages: + msg323234
2018-08-06 10:39:48cykerwaysetmessages: + msg323195
2018-08-06 10:28:39inada.naokisetmessages: + msg323194
2018-08-06 10:12:38inada.naokisetmessages: + msg323191
2018-08-06 09:58:27cykerwaysetmessages: + msg323189
2018-08-06 08:07:47inada.naokisetmessages: + msg323183
2018-08-06 07:53:26inada.naokisetnosy: + inada.naoki

versions: - Python 3.6, Python 3.7
2018-08-04 07:41:45ncoghlansetdependencies: + PEP 432: Redesign the interpreter startup sequence
messages: + msg323098
2018-08-04 00:37:25barrysetnosy: + barry
2018-08-03 22:18:06terry.reedysetnosy: + vstinner, terry.reedy, ncoghlan
type: enhancement -> performance
messages: + msg323091
2018-07-31 17:31:24cykerwaycreate