gc.freeze() - an API to mark objects as uncollectable #75739

ambv · 2017-09-23T00:21:29Z

BPO	31558
Nosy	@tim-one, @warsaw, @nascheme, @rhettinger, @jcea, @pitrou, @vstinner, @benjaminp, @methane, @ambv, @ericsnowcurrently, @serhiy-storchaka, @1st1, @applio, @brainfvck
PRs	bpo-31558: Add gc.freeze() #3705 bpo-31558: Update NEWS and ACKS #4013

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-10-16.19:50:28.576>
created_at = <Date 2017-09-23.00:21:29.490>
labels = ['interpreter-core', '3.7', 'performance']
title = 'gc.freeze() - an API to mark objects as uncollectable'
updated_at = <Date 2018-06-22.23:46:56.456>
user = 'https://github.com/ambv'

bugs.python.org fields:

activity = <Date 2018-06-22.23:46:56.456>
actor = 'eric.snow'
assignee = 'none'
closed = True
closed_date = <Date 2017-10-16.19:50:28.576>
closer = 'lukasz.langa'
components = ['Interpreter Core']
creation = <Date 2017-09-23.00:21:29.490>
creator = 'lukasz.langa'
dependencies = []
files = []
hgrepos = []
issue_num = 31558
keywords = ['patch', 'needs review']
message_count = 19.0
messages = ['302780', '302790', '302831', '302832', '302967', '302969', '302972', '303285', '303836', '303841', '304176', '304191', '304192', '304194', '304196', '304203', '304481', '304482', '304483']
nosy_count = 15.0
nosy_names = ['tim.peters', 'barry', 'nascheme', 'rhettinger', 'jcea', 'pitrou', 'vstinner', 'benjamin.peterson', 'methane', 'lukasz.langa', 'eric.snow', 'serhiy.storchaka', 'yselivanov', 'davin', 'brainfvck']
pr_nums = ['3705', '4013']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue31558'
versions = ['Python 3.7']

ambv · 2017-09-23T00:21:24Z

When you're forking many worker processes off of a parent process, the resulting children are initially very cheap in memory. They share memory pages with the base process until a write happens [1]_.

Sadly, the garbage collector in Python touches every object's PyGC_Head during a collection, even if that object stays alive, undoing all the copy-on-write wins. Instagram disabled the GC completely for this reason [2]_. This fixed the COW issue but made the processes more vulnerable to memory growth due to new cycles being silently introduced when the application code is changed by developers. While we could fix the most glaring cases, it was hard to keep the memory usage at bay. We came up with a different solution that fixes both issues. It requires a new API to be added to CPython's garbage collector.

gc.freeze()

As soon as possible in the lifecycle of the parent process we disable the garbage collector. Then we call a new API called gc.freeze() to move all currently tracked objects to a permanent generation. They won't be considered in further collections. This is okay since we are assuming that (almost?) all of the objects created until that point are module-level and thus useful for the entire lifecycle of the child process.

After calling gc.freeze() we call fork. Then, the child process is free to re-enable the garbage collector.

Why do we need to disable the collector on the parent process as soon as possible? When the GC cleans up memory in the mean time, it leaves space in pages for new objects. Those pages become shared after fork and as soon as the child process starts creating its own objects, they will likely be written to the shared pages, initiating a lot of copy-on-write activity.

In other words, we're wasting a bit of memory in the shared pages to save a lot of memory later (that would otherwise be wasted on copying entire pages after forking).

Other attempts
--------------

We also tried moving the GC head to another place in memory. This creates some indirection but cache locality on that segment is great so performance isn't really hurt. However, this change introduces two new pointers (16 bytes) per object. This doesn't sound like a lot but given millions of objects and tens of processes per box, this alone can cost hundreds of megabytes per host. Memory that we wanted to save in the first place. So that idea was scrapped.

Attribution
-----------

The original patch is by Zekun Li, with help from Jiahao Li, Matt Page, David Callahan, Carl S. Shapiro, and Chenyang Wu.

.. [1] https://en.wikipedia.org/wiki/Copy-on-write
.. [2] https://engineering.instagram.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172

benjaminp · 2017-09-23T17:05:06Z

This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.

methane · 2017-09-24T07:40:47Z

Nice idea!

I think it helps not only sharing more memory for forking application,
but also long running large application.
There are many static objects which is tracked by GC.
It makes full GC time long. And CPU cache is filled by unused data.

For example, web worker loading application after fork. (uWSGI's --lazy-app option).
Such application can call gc.freeze() after loading full application, before starting processing request.

methane · 2017-09-24T07:51:31Z

AFAIK, Python shutdown process calls full GC.

Don't touching permanent generation makes shutdown faster.
On the other hand, there are some downside:

Some object may be not freed while shutdown. It looks like "leak" for application embedding Python interpreter.
Some __del__ methods may be not be called.

Of course, GC permanent generation while shutdown doesn't make sense.
gc.freeze() is used for sharing more memory pages. Shutdown process
shouldn't unshare them.

So I think these notable downside should be documented.

nascheme · 2017-09-25T18:55:51Z

I think the basic idea makes a lot of sense, i.e. have a generation that is never collected. An alternative way to implement it would be to have an extra generation, e.g. rather than just 0, 1, 2 also have generation 3. The collection would by default never collect generation 3. Generation 4 would be equivalent to the frozen generation. You could still force collection by calling gc.collect(3). Whether that generation should be collected on shutdown would still be a question.

If this gets implemented, it will impact the memory bitmap based GC idea I have been prototyping. Currently I am thinking of using two bits for each small GC object. The bits would mean: 00 - untracked, 01 - gen 0, 10 - gen 1, 11 - gen 2. With the introduction of a frozen generation, I would have to use another bit I think.

Another thought is maybe we don't actually need 3 generations as they are currently used. We could have gen 0 which is collected frequently and gen 1 that is collected rarely. The frozen objects could go into gen 2 which are not automatically collected or have a user adjustable collection frequency. Collection of gen 1 would not automatically move objects into gen 2.

I think bpo-3110 (https://bugs.python.org/issue31105) is also related. The current GC thresholds are not very good. I've look at what Go does and the GC collection is based on a relative increase in memory usage. Python could do perhaps something similar. The accounting of actual bytes allocated and deallocated is tricky because the *_Del/Free functions don't actually know how much memory is being freed, at least not in a simple way.

warsaw · 2017-09-25T19:02:35Z

I like the idea of a gen 4 that never gets collected. This would have been useful for the original problem that inspired me to add the generation argument to gc.collect(). The nice thing about this, is just as you suggest: you could force a collection of gen 4 by gc.collect(3).

It's unfortunate that you'd have to add a bit to handle this, but maybe you're right that we only really need three generations.

pitrou · 2017-09-25T19:07:43Z

Le 25/09/2017 à 20:55, Neil Schemenauer a écrit :

I think the basic idea makes a lot of sense, i.e. have a generation that is never collected. An alternative way to implement it would be to have an extra generation, e.g. rather than just 0, 1, 2 also have generation 3. The collection would by default never collect generation 3. Generation 4 would be equivalent to the frozen generation. You could still force collection by calling gc.collect(3).

API-wise it would sound better to have a separate gc.collect_frozen()...

Though I think a gc.unfreeze() that moves the frozen generation into the
oldest non-frozen generation would be useful too, at least for testing
and experimentation.

I think bpo-3110 (https://bugs.python.org/issue31105) is also related. The current GC thresholds are not very good. I've look at what Go does and the GC collection is based on a relative increase in memory usage. Python could do perhaps something similar. The accounting of actual bytes allocated and deallocated is tricky because the *_Del/Free functions don't actually know how much memory is being freed, at least not in a simple way.

Yeah... It's worse than that. Take for example a bytearray object. The
basic object (the PyByteArrayObject structure) is quite small. But it
also has a separately-allocated payload that is deleted whenever
tp_dealloc is called. The GC isn't aware of that payload. Worse, the
payload can (and will) change size during the object's lifetime, without
the GC's knowledge about it ever being updated. (*)

IMHO, the only reliable way to use memory footprint to drive the GC
heuristic would be to force all allocations into our own allocator, and
reconcile the GC with that allocator (instead of having the GC be its
own separate thing as is the case nowadays).

(*) And let's not talk about hairier cases, such as having multiple
memoryviews over the same very large object...

PS: every heuristic has its flaws. As I noted on python-(dev|ideas),
full GC runtimes such as most Java implementations are well-known for
requiring careful tuning of GC parameters for "non-usual" workloads. At
least reference counting makes CPython more robust in many cases.

vstinner · 2017-09-28T22:05:40Z

(my previous msg303283 was for the bpo-11063, I removed it, sorry for the spam.)

ambv · 2017-10-06T16:47:20Z

Alright Python people, I don't see anybody being against the idea on the thread. Can we get a review of the linked PR? I don't think it would be good form for me to accept it.

serhiy-storchaka · 2017-10-06T18:03:25Z

What about msg302790?

brainfvck · 2017-10-11T19:40:28Z

This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.

What we saw in prod is that memory fragmentation caused by gc is the main reason of shared memory shrink.

The memory fragmentation is figured out by doing a full collection before fork and keep it disabled, it'll make a bunch of copy-on-write in child process.

This can't solve the copy-on-write caused by ref count, but we're thinking about freezing the ref count on those permanent objects too.

So this is useful if you did some warm-up work in parent process.

Also it could speedup gc if you have large amount of permanent objects.

methane · 2017-10-12T00:46:13Z

> This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs.

What we saw in prod is that memory fragmentation caused by gc is the main reason of shared memory shrink.

The memory fragmentation is figured out by doing a full collection before fork and keep it disabled, it'll make a bunch of copy-on-write in child process.

GC doesn't cause "memory fragmentation".
GC touches (writes) GC header and refcount. It cause sharing memory shrink.
Maybe, you're wrongly understanding "memory fragmentation".

This can't solve the copy-on-write caused by ref count, but we're thinking about freezing the ref count on those permanent objects too.

It may increase cost of refcount operation, because it makes all INCREF and DECREF bigger.
Note that this is only helps application using gc.freeze(). This shouldn't slow down all other applications.

So this is useful if you did some warm-up work in parent process.

I don't understand this statement.

Also it could speedup gc if you have large amount of permanent objects.

Yes, this helps not only "prefork" application, but also all long running applications
having large baseline data.

methane · 2017-10-12T00:51:06Z

As Instagram's report, disabling cycler GC really helps even if there is refcont.
All application have some cold data: imported but never used modules, functions.

methane · 2017-10-12T00:56:15Z

Should gc.freeze() do gc.collect() right before freezing?
Or should we document gc.collect(); gc.freeze(); idiom?

I don't like gc.freeze(collect=False).
So if there is at least one use case of gc.freeze() without gc.collect(), I'm +1 on former (current pull request) design.

Other nitpicking: get_freeze_count() or get_frozen_count()?

brainfvck · 2017-10-12T01:41:34Z

So what we did is:

We keep gc **disabled** on parent process and freeze after warmup, enable gc on child process.

The reason not to do a full collection is mentioned in previous comments/original ticket - (I called it) memory fragmentation.

The observation is - We keep gc disabled on both parent and child process and did a full collection before fork, it makes the shared memory shrink a lot compared to no collection. - There's no way for disabled gc to touch the head to make copy-on-write.

Of course, enable gc will make the shared memory shrink more. But the former case is accounting more than latter one.

So my understand is that gc frees some objects and makes some memory pages becomes available to allocate in child process. Allocation on the shared memory pages will cause the copy-on-write even without gc.

Though this behavior may have better name?

methane · 2017-10-12T04:27:13Z

So my understand is that gc frees some objects and makes some memory pages becomes available to allocate in child process. Allocation on the shared memory pages will cause the copy-on-write even without gc.

Though this behavior may have better name?

OK, now I got what you're talking.
I don't know proper name about it. I call it as "memory hole" for now.

But I don't think "memory hole" is big problem, because we already has refcount.
Say there are 100 function objects in one page, and 99 of them are never used. But when 1 of them are called, the page is unshared.

Solving memory hole issue is easy: just stop allocating new object from existing pages.
But I don't think it's worth enough because of refcount issue.

Instead of trying "share most data", I recommend to "use small number of processes" approach.

In my company, we don't use "prefork", but "--lazy-app" option of uWSGI for graceful reloading. (e.g. "afterfork")
But since we use nginx in front of uWSGI, # of uWSGI worker is just 2* CPU cores. We can serve to massive clients from only 16~32 processes.

So I prefer optimizing normal memory usage. It is good for all applications, not only "prefork" applications.

In this case, I'm +1 to gc.freeze() proposal because it can be used for single process applications.

ambv · 2017-10-16T19:48:13Z

Based on Inadasan's, Antoine's, Neil's, and Barry's review, I'm merging the change to 3.7.

ambv · 2017-10-16T19:49:47Z

New changeset c75edab by Łukasz Langa (brainfvck) in branch 'master':
bpo-31558: Add gc.freeze() (bpo-3705)
c75edab

ambv · 2017-10-16T21:39:11Z

New changeset c30b55b by Łukasz Langa in branch 'master':
bpo-31558: Update NEWS and ACKS (bpo-4013)
c30b55b

ambv added 3.7 (EOL) end of life performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Sep 23, 2017

ambv closed this as completed Oct 16, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gc.freeze() - an API to mark objects as uncollectable #75739

gc.freeze() - an API to mark objects as uncollectable #75739

ambv commented Sep 23, 2017

ambv commented Sep 23, 2017

benjaminp commented Sep 23, 2017

methane commented Sep 24, 2017

methane commented Sep 24, 2017

nascheme commented Sep 25, 2017

warsaw commented Sep 25, 2017

pitrou commented Sep 25, 2017

vstinner commented Sep 28, 2017

ambv commented Oct 6, 2017

serhiy-storchaka commented Oct 6, 2017

brainfvck mannequin commented Oct 11, 2017

methane commented Oct 12, 2017

methane commented Oct 12, 2017

methane commented Oct 12, 2017

brainfvck mannequin commented Oct 12, 2017

methane commented Oct 12, 2017

ambv commented Oct 16, 2017

ambv commented Oct 16, 2017

ambv commented Oct 16, 2017

gc.freeze() - an API to mark objects as uncollectable #75739

gc.freeze() - an API to mark objects as uncollectable #75739

Comments

ambv commented Sep 23, 2017

ambv commented Sep 23, 2017

benjaminp commented Sep 23, 2017

methane commented Sep 24, 2017

methane commented Sep 24, 2017

nascheme commented Sep 25, 2017

warsaw commented Sep 25, 2017

pitrou commented Sep 25, 2017

vstinner commented Sep 28, 2017

ambv commented Oct 6, 2017

serhiy-storchaka commented Oct 6, 2017

brainfvck mannequin commented Oct 11, 2017

methane commented Oct 12, 2017

methane commented Oct 12, 2017

methane commented Oct 12, 2017

brainfvck mannequin commented Oct 12, 2017

methane commented Oct 12, 2017

ambv commented Oct 16, 2017

ambv commented Oct 16, 2017

ambv commented Oct 16, 2017