classification
Title: gc.freeze() - an API to mark objects as uncollectable
Type: resource usage Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: barry, benjamin.peterson, brainfvck, davin, haypo, inada.naoki, lukasz.langa, nascheme, pitrou, rhettinger, serhiy.storchaka, tim.peters, yselivanov
Priority: normal Keywords: needs review, patch

Created on 2017-09-23 00:21 by lukasz.langa, last changed 2017-10-16 21:39 by lukasz.langa. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 3705 merged python-dev, 2017-09-23 01:43
PR 4013 merged lukasz.langa, 2017-10-16 21:28
Messages (19)
msg302780 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2017-09-23 00:21
When you're forking many worker processes off of a parent process, the resulting children are initially very cheap in memory.  They share memory pages with the base process until a write happens [1]_.

Sadly, the garbage collector in Python touches every object's PyGC_Head during a collection, even if that object stays alive, undoing all the copy-on-write wins.  Instagram disabled the GC completely for this reason [2]_.  This fixed the COW issue but made the processes more vulnerable to memory growth due to new cycles being silently introduced when the application code is changed by developers.  While we could fix the most glaring cases, it was hard to keep the memory usage at bay.  We came up with a different solution that fixes both issues.  It requires a new API to be added to CPython's garbage collector.


gc.freeze()
-----------

As soon as possible in the lifecycle of the parent process we disable the garbage collector.  Then we call a new API called `gc.freeze()` to move all currently tracked objects to a permanent generation.  They won't be considered in further collections.  This is okay since we are assuming that (almost?) all of the objects created until that point are module-level and thus useful for the entire lifecycle of the child process.

After calling `gc.freeze()` we call fork. Then, the child process is free to re-enable the garbage collector.

Why do we need to disable the collector on the parent process as soon as possible?  When the GC cleans up memory in the mean time, it leaves space in pages for new objects.  Those pages become shared after fork and as soon as the child process starts creating its own objects, they will likely be written to the shared pages, initiating a lot of copy-on-write activity.

In other words, we're wasting a bit of memory in the shared pages to save a lot of memory later (that would otherwise be wasted on copying entire pages after forking).


Other attempts
--------------

We also tried moving the GC head to another place in memory.  This creates some indirection but cache locality on that segment is great so performance isn't really hurt.  However, this change introduces two new pointers (16 bytes) per object.  This doesn't sound like a lot but given millions of objects and tens of processes per box, this alone can cost hundreds of megabytes per host.  Memory that we wanted to save in the first place.  So that idea was scrapped.


Attribution
-----------

The original patch is by Zekun Li, with help from Jiahao Li, Matt Page, David Callahan, Carl S. Shapiro, and Chenyang Wu.


.. [1] https://en.wikipedia.org/wiki/Copy-on-write
.. [2] https://engineering.instagram.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172
msg302790 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2017-09-23 17:05
This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.
msg302831 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-09-24 07:40
Nice idea!

I think it helps not only sharing more memory for forking application,
but also long running large application.
There are many static objects which is tracked by GC.
It makes full GC time long.  And CPU cache is filled by unused data.

For example, web worker loading application after fork. (uWSGI's --lazy-app option).
Such application can call `gc.freeze()` after loading full application, before starting processing request.
msg302832 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-09-24 07:51
AFAIK, Python shutdown process calls full GC.

Don't touching permanent generation makes shutdown faster.
On the other hand, there are some downside:

* Some object may be not freed while shutdown.  It looks like "leak" for application embedding Python interpreter.
* Some __del__ methods may be not be called.

Of course, GC permanent generation while shutdown doesn't make sense.
gc.freeze() is used for sharing more memory pages.  Shutdown process
shouldn't unshare them.

So I think these notable downside should be documented.
msg302967 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2017-09-25 18:55
I think the basic idea makes a lot of sense, i.e. have a generation that is never collected.  An alternative way to implement it would be to have an extra generation, e.g. rather than just 0, 1, 2 also have generation 3.  The collection would by default never collect generation 3.  Generation 4 would be equivalent to the frozen generation.  You could still force collection by calling gc.collect(3).  Whether that generation should be collected on shutdown would still be a question.

If this gets implemented, it will impact the memory bitmap based GC idea I have been prototyping.  Currently I am thinking of using two bits for each small GC object.  The bits would mean: 00 - untracked, 01 - gen 0, 10 - gen 1, 11 - gen 2.  With the introduction of a frozen generation, I would have to use another bit I think.

Another thought is maybe we don't actually need 3 generations as they are currently used.  We could have gen 0 which is collected frequently and gen 1 that is collected rarely.  The frozen objects could go into gen 2 which are not automatically collected or have a user adjustable collection frequency.  Collection of gen 1 would not automatically move objects into gen 2.

I think issue 3110 (https://bugs.python.org/issue31105) is also related.  The current GC thresholds are not very good.  I've look at what Go does and the GC collection is based on a relative increase in memory usage.  Python could do perhaps something similar.  The accounting of actual bytes allocated and deallocated is tricky because the *_Del/Free functions don't actually know how much memory is being freed, at least not in a simple way.
msg302969 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-09-25 19:02
I like the idea of a gen 4 that never gets collected.  This would have been useful for the original problem that inspired me to add the `generation` argument to `gc.collect()`.  The nice thing about this, is just as you suggest: you could force a collection of gen 4 by `gc.collect(3)`.

It's unfortunate that you'd have to add a bit to handle this, but maybe you're right that we only really need three generations.
msg302972 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-09-25 19:07
Le 25/09/2017 à 20:55, Neil Schemenauer a écrit :
> 
> I think the basic idea makes a lot of sense, i.e. have a generation that is never collected.  An alternative way to implement it would be to have an extra generation, e.g. rather than just 0, 1, 2 also have generation 3.  The collection would by default never collect generation 3.  Generation 4 would be equivalent to the frozen generation.  You could still force collection by calling gc.collect(3).

API-wise it would sound better to have a separate gc.collect_frozen()...

Though I think a gc.unfreeze() that moves the frozen generation into the
oldest non-frozen generation would be useful too, at least for testing
and experimentation.

> I think issue 3110 (https://bugs.python.org/issue31105) is also related.  The current GC thresholds are not very good.  I've look at what Go does and the GC collection is based on a relative increase in memory usage.  Python could do perhaps something similar.  The accounting of actual bytes allocated and deallocated is tricky because the *_Del/Free functions don't actually know how much memory is being freed, at least not in a simple way.

Yeah... It's worse than that.  Take for example a bytearray object.  The
basic object (the PyByteArrayObject structure) is quite small.  But it
also has a separately-allocated payload that is deleted whenever
tp_dealloc is called.  The GC isn't aware of that payload.  Worse, the
payload can (and will) change size during the object's lifetime, without
the GC's knowledge about it ever being updated. (*)

IMHO, the only reliable way to use memory footprint to drive the GC
heuristic would be to force all allocations into our own allocator, and
reconcile the GC with that allocator (instead of having the GC be its
own separate thing as is the case nowadays).

(*) And let's not talk about hairier cases, such as having multiple
memoryviews over the same very large object...

PS: every heuristic has its flaws.  As I noted on python-(dev|ideas),
full GC runtimes such as most Java implementations are well-known for
requiring careful tuning of GC parameters for "non-usual" workloads.  At
least reference counting makes CPython more robust in many cases.
msg303285 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-09-28 22:05
(my previous msg303283 was for the bpo-11063, I removed it, sorry for the spam.)
msg303836 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2017-10-06 16:47
Alright Python people, I don't see anybody being against the idea on the thread.  Can we get a review of the linked PR?  I don't think it would be good form for me to accept it.
msg303841 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-06 18:03
What about msg302790?
msg304176 - (view) Author: Zekun Li (brainfvck) * Date: 2017-10-11 19:40
> This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.

What we saw in prod is that memory fragmentation caused by gc is the main reason of shared memory shrink.

The memory fragmentation is figured out by doing a full collection before fork and keep it disabled, it'll make a bunch of copy-on-write in child process.

This can't solve the copy-on-write caused by ref count, but we're thinking about freezing the ref count on those permanent objects too.

So this is useful if you did some warm-up work in parent process.

Also it could speedup gc if you have large amount of permanent objects.
msg304191 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-12 00:46
>> This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.
>
> What we saw in prod is that memory fragmentation caused by gc is the main reason of shared memory shrink.
>
> The memory fragmentation is figured out by doing a full collection before fork and keep it disabled, it'll make a bunch of copy-on-write in child process.

GC doesn't cause "memory fragmentation".
GC touches (writes) GC header and refcount.  It cause sharing memory shrink.
Maybe, you're wrongly understanding "memory fragmentation".

> This can't solve the copy-on-write caused by ref count, but we're thinking about freezing the ref count on those permanent objects too.

It may increase cost of refcount operation, because it makes all INCREF and DECREF bigger.
Note that this is only helps application using gc.freeze().  This shouldn't slow down all other applications.

> So this is useful if you did some warm-up work in parent process.

I don't understand this statement.

> Also it could speedup gc if you have large amount of permanent objects.

Yes, this helps not only "prefork" application, but also all long running applications
having large baseline data.
msg304192 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-12 00:51
As Instagram's report, disabling cycler GC really helps even if there is refcont.
All application have some cold data: imported but never used modules, functions.
msg304194 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-12 00:56
Should gc.freeze() do gc.collect() right before freezing?
Or should we document `gc.collect(); gc.freeze();` idiom?

I don't like `gc.freeze(collect=False)`.
So if there is at least one use case of `gc.freeze()` without `gc.collect()`, I'm +1 on former (current pull request) design.


Other nitpicking: get_freeze_count() or get_frozen_count()?
msg304196 - (view) Author: Zekun Li (brainfvck) * Date: 2017-10-12 01:41
So what we did is:

We keep gc **disabled** on parent process and freeze after warmup, enable gc on child process.

The reason not to do a full collection is mentioned in previous comments/original ticket - (I called it) memory fragmentation.

The observation is - We keep gc disabled on both parent and child process and did a full collection before fork, it makes the shared memory shrink a lot compared to no collection. - There's no way for disabled gc to touch the head to make copy-on-write.

Of course, enable gc will make the shared memory shrink more. But the former case is accounting more than latter one.

So my understand is that gc frees some objects and makes some memory pages becomes available to allocate in child process. Allocation on the shared memory pages will cause the copy-on-write even without gc.

Though this behavior may have better name?
msg304203 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-10-12 04:27
> So my understand is that gc frees some objects and makes some memory pages becomes available to allocate in child process. Allocation on the shared memory pages will cause the copy-on-write even without gc.
>
> Though this behavior may have better name?

OK, now I got what you're talking.
I don't know proper name about it.  I call it as "memory hole" for now.

But I don't think "memory hole" is big problem, because we already has refcount.
Say there are 100 function objects in one page, and 99 of them are never used. But when 1 of them are called, the page is unshared.

Solving memory hole issue is easy: just stop allocating new object from existing pages.
But I don't think it's worth enough because of refcount issue.

Instead of trying "share most data", I recommend to "use small number of processes" approach.

In my company, we don't use "prefork", but "--lazy-app" option of uWSGI for graceful reloading. (e.g. "afterfork")
But since we use nginx in front of uWSGI, # of uWSGI worker is just 2* CPU cores.  We can serve to massive clients from only 16~32 processes.

So I prefer optimizing normal memory usage.  It is good for all applications, not only "prefork" applications.

In this case, I'm +1 to gc.freeze() proposal because it can be used for single process applications.
msg304481 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2017-10-16 19:48
Based on Inadasan's, Antoine's, Neil's, and Barry's review, I'm merging the change to 3.7.
msg304482 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2017-10-16 19:49
New changeset c75edabbb65ca2bb29e51f8d1eb2c780e5890982 by Łukasz Langa (brainfvck) in branch 'master':
bpo-31558: Add gc.freeze() (#3705)
https://github.com/python/cpython/commit/c75edabbb65ca2bb29e51f8d1eb2c780e5890982
msg304483 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2017-10-16 21:39
New changeset c30b55b96c0967e3a8b3b86f25eb012a97f360a5 by Łukasz Langa in branch 'master':
bpo-31558: Update NEWS and ACKS (#4013)
https://github.com/python/cpython/commit/c30b55b96c0967e3a8b3b86f25eb012a97f360a5
History
Date User Action Args
2017-10-16 21:39:10lukasz.langasetmessages: + msg304483
2017-10-16 21:28:47lukasz.langasetpull_requests: + pull_request3987
2017-10-16 19:50:28lukasz.langasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-10-16 19:49:47lukasz.langasetmessages: + msg304482
2017-10-16 19:48:12lukasz.langasetmessages: + msg304481
2017-10-12 04:27:13inada.naokisetmessages: + msg304203
2017-10-12 01:41:33brainfvcksetmessages: + msg304196
2017-10-12 00:56:15inada.naokisetmessages: + msg304194
2017-10-12 00:51:06inada.naokisetmessages: + msg304192
2017-10-12 00:46:15inada.naokisetmessages: + msg304191
2017-10-11 19:40:28brainfvcksetnosy: + brainfvck
messages: + msg304176
2017-10-06 18:03:24serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg303841
2017-10-06 16:47:19lukasz.langasetmessages: + msg303836
2017-09-28 22:05:40hayposetmessages: + msg303285
2017-09-28 22:05:02hayposetmessages: - msg303283
2017-09-28 22:03:49hayposetmessages: + msg303283
2017-09-25 19:07:42pitrousetmessages: + msg302972
2017-09-25 19:02:35barrysetmessages: + msg302969
2017-09-25 18:55:51naschemesetmessages: + msg302967
2017-09-25 13:31:36pitrousetnosy: + pitrou
2017-09-24 07:51:31inada.naokisetmessages: + msg302832
2017-09-24 07:49:39rhettingersetnosy: + rhettinger
2017-09-24 07:40:47inada.naokisetnosy: + inada.naoki
messages: + msg302831
2017-09-23 22:22:23rhettingersetnosy: + tim.peters, davin
2017-09-23 17:05:06benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg302790
2017-09-23 11:46:51barrysetnosy: + barry
2017-09-23 01:58:42lukasz.langasetcomponents: + Interpreter Core
2017-09-23 01:43:12python-devsetkeywords: + patch
pull_requests: + pull_request3690
2017-09-23 00:21:29lukasz.langacreate