Issue31558
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2017-09-23 00:21 by lukasz.langa, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 3705 | merged | python-dev, 2017-09-23 01:43 | |
PR 4013 | merged | lukasz.langa, 2017-10-16 21:28 |
Messages (19) | |||
---|---|---|---|
msg302780 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2017-09-23 00:21 | |
When you're forking many worker processes off of a parent process, the resulting children are initially very cheap in memory. They share memory pages with the base process until a write happens [1]_. Sadly, the garbage collector in Python touches every object's PyGC_Head during a collection, even if that object stays alive, undoing all the copy-on-write wins. Instagram disabled the GC completely for this reason [2]_. This fixed the COW issue but made the processes more vulnerable to memory growth due to new cycles being silently introduced when the application code is changed by developers. While we could fix the most glaring cases, it was hard to keep the memory usage at bay. We came up with a different solution that fixes both issues. It requires a new API to be added to CPython's garbage collector. gc.freeze() ----------- As soon as possible in the lifecycle of the parent process we disable the garbage collector. Then we call a new API called `gc.freeze()` to move all currently tracked objects to a permanent generation. They won't be considered in further collections. This is okay since we are assuming that (almost?) all of the objects created until that point are module-level and thus useful for the entire lifecycle of the child process. After calling `gc.freeze()` we call fork. Then, the child process is free to re-enable the garbage collector. Why do we need to disable the collector on the parent process as soon as possible? When the GC cleans up memory in the mean time, it leaves space in pages for new objects. Those pages become shared after fork and as soon as the child process starts creating its own objects, they will likely be written to the shared pages, initiating a lot of copy-on-write activity. In other words, we're wasting a bit of memory in the shared pages to save a lot of memory later (that would otherwise be wasted on copying entire pages after forking). Other attempts -------------- We also tried moving the GC head to another place in memory. This creates some indirection but cache locality on that segment is great so performance isn't really hurt. However, this change introduces two new pointers (16 bytes) per object. This doesn't sound like a lot but given millions of objects and tens of processes per box, this alone can cost hundreds of megabytes per host. Memory that we wanted to save in the first place. So that idea was scrapped. Attribution ----------- The original patch is by Zekun Li, with help from Jiahao Li, Matt Page, David Callahan, Carl S. Shapiro, and Chenyang Wu. .. [1] https://en.wikipedia.org/wiki/Copy-on-write .. [2] https://engineering.instagram.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172 |
|||
msg302790 - (view) | Author: Benjamin Peterson (benjamin.peterson) * | Date: 2017-09-23 17:05 | |
This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs. |
|||
msg302831 - (view) | Author: Inada Naoki (methane) * | Date: 2017-09-24 07:40 | |
Nice idea! I think it helps not only sharing more memory for forking application, but also long running large application. There are many static objects which is tracked by GC. It makes full GC time long. And CPU cache is filled by unused data. For example, web worker loading application after fork. (uWSGI's --lazy-app option). Such application can call `gc.freeze()` after loading full application, before starting processing request. |
|||
msg302832 - (view) | Author: Inada Naoki (methane) * | Date: 2017-09-24 07:51 | |
AFAIK, Python shutdown process calls full GC. Don't touching permanent generation makes shutdown faster. On the other hand, there are some downside: * Some object may be not freed while shutdown. It looks like "leak" for application embedding Python interpreter. * Some __del__ methods may be not be called. Of course, GC permanent generation while shutdown doesn't make sense. gc.freeze() is used for sharing more memory pages. Shutdown process shouldn't unshare them. So I think these notable downside should be documented. |
|||
msg302967 - (view) | Author: Neil Schemenauer (nascheme) * | Date: 2017-09-25 18:55 | |
I think the basic idea makes a lot of sense, i.e. have a generation that is never collected. An alternative way to implement it would be to have an extra generation, e.g. rather than just 0, 1, 2 also have generation 3. The collection would by default never collect generation 3. Generation 4 would be equivalent to the frozen generation. You could still force collection by calling gc.collect(3). Whether that generation should be collected on shutdown would still be a question. If this gets implemented, it will impact the memory bitmap based GC idea I have been prototyping. Currently I am thinking of using two bits for each small GC object. The bits would mean: 00 - untracked, 01 - gen 0, 10 - gen 1, 11 - gen 2. With the introduction of a frozen generation, I would have to use another bit I think. Another thought is maybe we don't actually need 3 generations as they are currently used. We could have gen 0 which is collected frequently and gen 1 that is collected rarely. The frozen objects could go into gen 2 which are not automatically collected or have a user adjustable collection frequency. Collection of gen 1 would not automatically move objects into gen 2. I think issue 3110 (https://bugs.python.org/issue31105) is also related. The current GC thresholds are not very good. I've look at what Go does and the GC collection is based on a relative increase in memory usage. Python could do perhaps something similar. The accounting of actual bytes allocated and deallocated is tricky because the *_Del/Free functions don't actually know how much memory is being freed, at least not in a simple way. |
|||
msg302969 - (view) | Author: Barry A. Warsaw (barry) * | Date: 2017-09-25 19:02 | |
I like the idea of a gen 4 that never gets collected. This would have been useful for the original problem that inspired me to add the `generation` argument to `gc.collect()`. The nice thing about this, is just as you suggest: you could force a collection of gen 4 by `gc.collect(3)`. It's unfortunate that you'd have to add a bit to handle this, but maybe you're right that we only really need three generations. |
|||
msg302972 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2017-09-25 19:07 | |
Le 25/09/2017 à 20:55, Neil Schemenauer a écrit : > > I think the basic idea makes a lot of sense, i.e. have a generation that is never collected. An alternative way to implement it would be to have an extra generation, e.g. rather than just 0, 1, 2 also have generation 3. The collection would by default never collect generation 3. Generation 4 would be equivalent to the frozen generation. You could still force collection by calling gc.collect(3). API-wise it would sound better to have a separate gc.collect_frozen()... Though I think a gc.unfreeze() that moves the frozen generation into the oldest non-frozen generation would be useful too, at least for testing and experimentation. > I think issue 3110 (https://bugs.python.org/issue31105) is also related. The current GC thresholds are not very good. I've look at what Go does and the GC collection is based on a relative increase in memory usage. Python could do perhaps something similar. The accounting of actual bytes allocated and deallocated is tricky because the *_Del/Free functions don't actually know how much memory is being freed, at least not in a simple way. Yeah... It's worse than that. Take for example a bytearray object. The basic object (the PyByteArrayObject structure) is quite small. But it also has a separately-allocated payload that is deleted whenever tp_dealloc is called. The GC isn't aware of that payload. Worse, the payload can (and will) change size during the object's lifetime, without the GC's knowledge about it ever being updated. (*) IMHO, the only reliable way to use memory footprint to drive the GC heuristic would be to force all allocations into our own allocator, and reconcile the GC with that allocator (instead of having the GC be its own separate thing as is the case nowadays). (*) And let's not talk about hairier cases, such as having multiple memoryviews over the same very large object... PS: every heuristic has its flaws. As I noted on python-(dev|ideas), full GC runtimes such as most Java implementations are well-known for requiring careful tuning of GC parameters for "non-usual" workloads. At least reference counting makes CPython more robust in many cases. |
|||
msg303285 - (view) | Author: STINNER Victor (vstinner) * | Date: 2017-09-28 22:05 | |
(my previous msg303283 was for the bpo-11063, I removed it, sorry for the spam.) |
|||
msg303836 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2017-10-06 16:47 | |
Alright Python people, I don't see anybody being against the idea on the thread. Can we get a review of the linked PR? I don't think it would be good form for me to accept it. |
|||
msg303841 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2017-10-06 18:03 | |
What about msg302790? |
|||
msg304176 - (view) | Author: Zekun Li (brainfvck) * | Date: 2017-10-11 19:40 | |
> This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs. What we saw in prod is that memory fragmentation caused by gc is the main reason of shared memory shrink. The memory fragmentation is figured out by doing a full collection before fork and keep it disabled, it'll make a bunch of copy-on-write in child process. This can't solve the copy-on-write caused by ref count, but we're thinking about freezing the ref count on those permanent objects too. So this is useful if you did some warm-up work in parent process. Also it could speedup gc if you have large amount of permanent objects. |
|||
msg304191 - (view) | Author: Inada Naoki (methane) * | Date: 2017-10-12 00:46 | |
>> This is only useful if the parent process has a lot of memory that's never used by the child processes right? Otherwise, you would lose via refcounting COWs. > > What we saw in prod is that memory fragmentation caused by gc is the main reason of shared memory shrink. > > The memory fragmentation is figured out by doing a full collection before fork and keep it disabled, it'll make a bunch of copy-on-write in child process. GC doesn't cause "memory fragmentation". GC touches (writes) GC header and refcount. It cause sharing memory shrink. Maybe, you're wrongly understanding "memory fragmentation". > This can't solve the copy-on-write caused by ref count, but we're thinking about freezing the ref count on those permanent objects too. It may increase cost of refcount operation, because it makes all INCREF and DECREF bigger. Note that this is only helps application using gc.freeze(). This shouldn't slow down all other applications. > So this is useful if you did some warm-up work in parent process. I don't understand this statement. > Also it could speedup gc if you have large amount of permanent objects. Yes, this helps not only "prefork" application, but also all long running applications having large baseline data. |
|||
msg304192 - (view) | Author: Inada Naoki (methane) * | Date: 2017-10-12 00:51 | |
As Instagram's report, disabling cycler GC really helps even if there is refcont. All application have some cold data: imported but never used modules, functions. |
|||
msg304194 - (view) | Author: Inada Naoki (methane) * | Date: 2017-10-12 00:56 | |
Should gc.freeze() do gc.collect() right before freezing? Or should we document `gc.collect(); gc.freeze();` idiom? I don't like `gc.freeze(collect=False)`. So if there is at least one use case of `gc.freeze()` without `gc.collect()`, I'm +1 on former (current pull request) design. Other nitpicking: get_freeze_count() or get_frozen_count()? |
|||
msg304196 - (view) | Author: Zekun Li (brainfvck) * | Date: 2017-10-12 01:41 | |
So what we did is: We keep gc **disabled** on parent process and freeze after warmup, enable gc on child process. The reason not to do a full collection is mentioned in previous comments/original ticket - (I called it) memory fragmentation. The observation is - We keep gc disabled on both parent and child process and did a full collection before fork, it makes the shared memory shrink a lot compared to no collection. - There's no way for disabled gc to touch the head to make copy-on-write. Of course, enable gc will make the shared memory shrink more. But the former case is accounting more than latter one. So my understand is that gc frees some objects and makes some memory pages becomes available to allocate in child process. Allocation on the shared memory pages will cause the copy-on-write even without gc. Though this behavior may have better name? |
|||
msg304203 - (view) | Author: Inada Naoki (methane) * | Date: 2017-10-12 04:27 | |
> So my understand is that gc frees some objects and makes some memory pages becomes available to allocate in child process. Allocation on the shared memory pages will cause the copy-on-write even without gc. > > Though this behavior may have better name? OK, now I got what you're talking. I don't know proper name about it. I call it as "memory hole" for now. But I don't think "memory hole" is big problem, because we already has refcount. Say there are 100 function objects in one page, and 99 of them are never used. But when 1 of them are called, the page is unshared. Solving memory hole issue is easy: just stop allocating new object from existing pages. But I don't think it's worth enough because of refcount issue. Instead of trying "share most data", I recommend to "use small number of processes" approach. In my company, we don't use "prefork", but "--lazy-app" option of uWSGI for graceful reloading. (e.g. "afterfork") But since we use nginx in front of uWSGI, # of uWSGI worker is just 2* CPU cores. We can serve to massive clients from only 16~32 processes. So I prefer optimizing normal memory usage. It is good for all applications, not only "prefork" applications. In this case, I'm +1 to gc.freeze() proposal because it can be used for single process applications. |
|||
msg304481 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2017-10-16 19:48 | |
Based on Inadasan's, Antoine's, Neil's, and Barry's review, I'm merging the change to 3.7. |
|||
msg304482 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2017-10-16 19:49 | |
New changeset c75edabbb65ca2bb29e51f8d1eb2c780e5890982 by Łukasz Langa (brainfvck) in branch 'master': bpo-31558: Add gc.freeze() (#3705) https://github.com/python/cpython/commit/c75edabbb65ca2bb29e51f8d1eb2c780e5890982 |
|||
msg304483 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2017-10-16 21:39 | |
New changeset c30b55b96c0967e3a8b3b86f25eb012a97f360a5 by Łukasz Langa in branch 'master': bpo-31558: Update NEWS and ACKS (#4013) https://github.com/python/cpython/commit/c30b55b96c0967e3a8b3b86f25eb012a97f360a5 |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:52 | admin | set | github: 75739 |
2018-06-22 23:46:56 | eric.snow | set | nosy:
+ eric.snow |
2018-06-22 14:27:30 | jcea | set | nosy:
+ jcea |
2017-10-16 21:39:10 | lukasz.langa | set | messages: + msg304483 |
2017-10-16 21:28:47 | lukasz.langa | set | pull_requests: + pull_request3987 |
2017-10-16 19:50:28 | lukasz.langa | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
2017-10-16 19:49:47 | lukasz.langa | set | messages: + msg304482 |
2017-10-16 19:48:12 | lukasz.langa | set | messages: + msg304481 |
2017-10-12 04:27:13 | methane | set | messages: + msg304203 |
2017-10-12 01:41:33 | brainfvck | set | messages: + msg304196 |
2017-10-12 00:56:15 | methane | set | messages: + msg304194 |
2017-10-12 00:51:06 | methane | set | messages: + msg304192 |
2017-10-12 00:46:15 | methane | set | messages: + msg304191 |
2017-10-11 19:40:28 | brainfvck | set | nosy:
+ brainfvck messages: + msg304176 |
2017-10-06 18:03:24 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg303841 |
2017-10-06 16:47:19 | lukasz.langa | set | messages: + msg303836 |
2017-09-28 22:05:40 | vstinner | set | messages: + msg303285 |
2017-09-28 22:05:02 | vstinner | set | messages: - msg303283 |
2017-09-28 22:03:49 | vstinner | set | messages: + msg303283 |
2017-09-25 19:07:42 | pitrou | set | messages: + msg302972 |
2017-09-25 19:02:35 | barry | set | messages: + msg302969 |
2017-09-25 18:55:51 | nascheme | set | messages: + msg302967 |
2017-09-25 13:31:36 | pitrou | set | nosy:
+ pitrou |
2017-09-24 07:51:31 | methane | set | messages: + msg302832 |
2017-09-24 07:49:39 | rhettinger | set | nosy:
+ rhettinger |
2017-09-24 07:40:47 | methane | set | nosy:
+ methane messages: + msg302831 |
2017-09-23 22:22:23 | rhettinger | set | nosy:
+ tim.peters, davin |
2017-09-23 17:05:06 | benjamin.peterson | set | nosy:
+ benjamin.peterson messages: + msg302790 |
2017-09-23 11:46:51 | barry | set | nosy:
+ barry |
2017-09-23 01:58:42 | lukasz.langa | set | components: + Interpreter Core |
2017-09-23 01:43:12 | python-dev | set | keywords:
+ patch pull_requests: + pull_request3690 |
2017-09-23 00:21:29 | lukasz.langa | create |