Issue 39143: Implementing sub-generation steps in the gc

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/83324

classification

Title:	Implementing sub-generation steps in the gc
Type:	enhancement	Stage:
Components:	Interpreter Core	Versions:	Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	nanjekyejoannah, nascheme, pablogsal, tim.peters
Priority:	normal	Keywords:

Created on 2019-12-27 21:35 by pablogsal, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
vanilla-vs-buffered.png	pablogsal, 2019-12-28 23:33
diff-vanilla-buffered.png	pablogsal, 2019-12-28 23:33

Messages (16)
msg358916 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-27 21:35
While I was re-reading The Garbage Collection Handbook [Moss, Hosking, Jones], I have been doing some benchmarks of different aspects of our garbage collection and taking statistics using the pyperformance suite as long as several "production" applications that heavily creates objects. I have observed that our collector presents a form of "generational nepotism" in some typical scenarios. The most common reason is that when promotions of the youngest generations happen, some very young objects that just arrived in the generation are promoted because we have reached a threshold, and its death will be delayed. These objects will likely die immediately if the promotion was delayed a bit (think of a burst of temporary objects being created at once) or if the promotion could distinguish "age" with more granularity. The book proposes serval solutions to this problem, and one of the simpler ones taking the architecture of our collector is to use sub-generation steps: > Promotion can be delayed by structuring a generation into two or more aging spaces. This allows objects to be copied between the fromspace and tospace an arbitrary number of times within the generation before they are promoted. In Lieberman and Hewitt's original generational collector [1983], a generation is collected several times before all survivors are eventually promoted en masse. In terms of the aging semispaces of Figure 9.2b, ei ther all live objects in fromspace are evacuated to tospace within this generation or all are promoted to the next generation, depending on the age of the generation as a whole. Basically, the differences between steps and generations are that both segregate objects by age, but different generations are collected at different frequencies whereas all the steps of a generation are collected at the same time. By using steps in the youngest generation (where most mutation occurs), and by reducing premature collection, the load on the write barrier can be reduced while also controlling promotion, without need for per-object age records. -- What do you think about implementing sub-generation steps? Maybe only on the youngest generation? A "lazy" way of implementing this (although without the semantic of "steps") is adding more intermediate generations with the same threshold (but this likely won't yield the same benefits).
msg358947 - (view)	Author: Joannah Nanjekye (nanjekyejoannah) *	Date: 2019-12-28 15:40
> The most common reason is that when promotions of the youngest generations happen, >some very young objects that just arrived in the generation are >promoted because we have >reached a threshold, and its death will be >delayed. What threshold is this? Is it based on the number of objects in the young generation in that if the number of objects in the young generation reaches some value, then promotion occurs. If this is the case, we can use another scheme for the start before exploring steps IMHO. Using Shaw’s Bucket Brigade Scheme of semi-spaces or generational steps is good but this means a generation with n steps guarantees n scavenges before an object is promoted to the next generation. The surviving objects are copied between these pairs of semispaces b times before they are promoted to the next step. An n-bucket scheme guarantees that objects will be promoted to the old generation If they have survived between nb and nb-1 scavenges. For example, a 2 bucket scheme, objects are copied up to three times before being promoted to the next generation. This method is good but we will still incur the overhead of copying objects between the semi-spaces. I have looked at a similar problem before for a different runtime and instead think to consider basing promotion on the number of collections survived by objects in the young generation. This requires using five bits from one of two header words of each object to record its age. The objects that reach the collection threshold are promoted to the old generation. This will give the young objects enough time to die thereby reducing how often objects are promoted to the old generation. This in turn reduces the frequency of major collections hence reduced pause times. I would only consider the generational semi-spaces after we have explored basing promotion on the age of an object in a generation. If this doesn't work, then we can use the bucket brigade scheme.
msg358949 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-28 16:24
> What threshold is this? This is the different thresholds for the generations that you can get using gc.get_threshold(). They are in relationship to the number of objects in every generation (there are slightly different rules for the latest generation). > This method is good but we will still incur the overhead of copying objects between the semi-spaces. I am not sure what you mean with "copy" here. Moving objects from generations or spaces is just updating the pointers of the linked lists. Objects are not "copied" but "moved". > I have looked at a similar problem before for a different runtime and instead think to consider basing promotion on the number of collections survived by objects in the young generation. This requires using five bits from one of two header words of each object to record its age. The objects that reach the collection threshold are promoted to the old generation. This will give the young objects enough time to die thereby reducing how often objects are promoted to the old generation. This in turn reduces the frequency of major collections hence reduced pause times. The reason I am proposing sub-generation steps is that making the object header bigger is a price too high to pay for this problem. We have recently gone to notable efforts to reduce one word per object by complicating the code substantially using tagged pointers, so recording the age in the object seems the opposite direction. > I would only consider the generational semi-spaces after we have explored basing promotion on the age of an object in a generation. If this doesn't work, then we can use the bucket brigade scheme. As I mentioned before, recording per-object age will be probably a no-go (the solution to this problem will yield memory gains because objects won't suffer "generational nepotism" but making the object header bigger will annihilate those gains) so that is the reason I proposed generational sub-steps.
msg358950 - (view)	Author: Joannah Nanjekye (nanjekyejoannah) *	Date: 2019-12-28 17:14
> What threshold is this? > This is the different thresholds for the generations that you can get using gc.get_threshold(). >They are in relationship to the number of objects in every generation (there are slightly different >rules for the latest generation). I think we are on the same frequency in terms of threshold. Promotion is currently based on number of objects accumulated in the generation if I interpreted your explanation right. >I am not sure what you mean with "copy" here. Moving objects from generations or spaces is >just updating the pointers of the linked lists. Objects are not "copied" but "moved". Yes, in terms of implementation so am thinking there must be a cost in moving these objects. IIRC the GC handbook you referenced may have noted this too..I stand to be corrected though on this. >The reason I am proposing sub-generation steps is that making the object header bigger is a >price too high to pay for this problem. We have recently gone to notable efforts to reduce one >word per object by complicating the code substantially using tagged pointers, so recording the >age in the object seems the opposite direction. I think we only need about 5 bits to store the age of each object , do we really need to increase the object back to two word per object ? >As I mentioned before, recording per-object age will be probably a no-go (the solution to this >problem will yield memory gains because objects won't suffer "generational nepotism" but >making the object header bigger will annihilate those gains) so that is the reason I proposed >generational sub-steps. Am actually in favour of this Shaw’s steps scheme though it may also mean the more delays we want, the more steps we need to create and manage while changing the age threshold for objects looks simpler than creating and managing steps.
msg358951 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-28 17:33
> I think we only need about 5 bits to store the age of each object , do we really need to increase the object back to two word per object ? 5 bit per object is a lot because it scales with the number of objects. It will quickly obliterate any gain that we get from some objects being deallocated sooner (which is what we are trying to achieve). Using inter-generations has a constant cost that is negligible. > Am actually in favour of this Shaw’s steps scheme though it may also mean the more delays we want, the more steps we need to create and manage while changing the age threshold for objects looks simpler than creating and managing steps. I will propose maybe to only do sub-steps for the first generation (as many papers suggest that is where the bigger gains are). I am not sure what you mean with "changing the age threshold for objects looks simpler than creating and managing steps". If you mean changing the generational threshold, I think that does not fix the problem as it only delays or speeds up collections but the problem is scale-invariant with the number of objects.
msg358955 - (view)	Author: Joannah Nanjekye (nanjekyejoannah) *	Date: 2019-12-28 18:40
>5 bit per object is a lot because it scales with the number of objects. It will quickly obliterate >any gain that we get from some objects being deallocated sooner (which is what we are trying >to achieve). Using inter-generations has a constant cost that is negligible. I agree but this may not require us falling back to a two word header per object as was the original worry. If it is a worse tradeoff to managing steps then I would suggest to stick to the steps. Again in the end it is about what tradeoffs are willing to work with. >I am not sure what you mean with "changing the age >threshold for objects looks simpler than >creating and managing steps". If you mean changing >the generational threshold, I think that >does not fix the problem as it only delays or speeds up >collections but the problem is >scale-invariant with the number of objects. I meant that for any need for increase in promotion delay, we need to add a step and manage it which is more complex than for example when tracking ages, incase of a change we only change the age threshold which looks simpler to implement. >I will propose maybe to only do sub-steps for the first generation (as many papers suggest >that is where the bigger gains are). Correct, since the premise is we just want to delay promotion of young objects to an older generation so that they have enough time to die. The steps should reasonably be in the first generation. How many semi-spaces do you think of having in this generation? Most systems have and recommend two.
msg358958 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-28 19:38
> How many semi-spaces do you think of having in this generation? Most systems have and recommend two. I suppose we would need to experiment, but for what I have seen I think two or three :) What do you think? > If it is a worse tradeoff to managing steps then I would suggest to stick to the steps. I think any memory increase that scales with the number of objects will be a big concern. > we need to add a step and manage it which is more complex than for example when tracking ages, incase of a change we only change the age threshold which looks simpler to implement. Exactly, but if with "tracking ages" you mean store per-object ages then the previous concern applies :( Adding sub-generation steps should not be much more complicated IMHO.
msg358959 - (view)	Author: Tim Peters (tim.peters) *	Date: 2019-12-28 19:53
Long ago I thought about adding "a buffer in front of" CPython's cyclic gc: newly tracked items would be added there, and not participate in cyclic gc at all until a collection passed. For example, break the youngest generation into parts A & B. Newly tracked items are added to part A. When a collection occurs, only part B participates from the youngest generation. When the collection ends, part A is renamed to part B, and part A is marked empty (just a handful of pointer manipulations). Seems a similar thrust, but a more extreme goal: trying to maximize the number of new containers that can be reclaimed by refcounting alone without _ever_ being chased by cyclic gc. Beyond that most objects "die young", which appears overwhelmingly truer than not across almost all Python programs, I'm not sure anything else about lifetime distribution is generally exploitable. About "5 bits", no, we don't have 'em. Even the relatively modest ugliness we have now makes it impossible to port Python to a word-addressed machine (I'm not sure any still exist!). Nothing in C guarantees the last 2 or 3 bits of a pointer "are 0". We're already using two words to store 2 persistent pointers, one persistent flag, a full-width transient reference count, and two transient flags.
msg358960 - (view)	Author: Joannah Nanjekye (nanjekyejoannah) *	Date: 2019-12-28 19:59
> I suppose we would need to experiment, but for what I have seen I think two or three :) What do you think? IMO, I think experimenting with two steps is good enough.
msg358962 - (view)	Author: Joannah Nanjekye (nanjekyejoannah) *	Date: 2019-12-28 20:04
> About "5 bits", no, we don't have 'em. Even the relatively modest > ugliness we have now makes it impossible to port Python to a word->addressed machine (I'm not sure any still exist!). Nothing in C >guarantees the last 2 or 3 bits of a pointer "are 0". We're already >using two words to store 2 persistent pointers, one persistent flag, a >full-width transient reference count, and two transient flags. In view of this, +1 for avoiding the extra bits for age storage. The steps scheme will work best in this case.
msg358963 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-28 20:12
> For example, break the youngest generation into parts A & B. Newly tracked items are added to part A. When a collection occurs, only part B participates from the youngest generation. When the collection ends, part A is renamed to part B, and part A is marked empty (just a handful of pointer manipulations). Well, this seems simple enough to fully implement just to try it out to benchmark a bit. I will start doing some experiments with it. > Beyond that most objects "die young", which appears overwhelmingly truer than not across almost all Python programs, I'm not sure anything else about lifetime distribution is generally exploitable. In theory, sub-generation steps will make use of "most object die young", specifically it will try to make it happen more often, as this issue can be summarized on "some objects that should die young don't because they were in a collection at the wrong time and now they need to wait even more".
msg358971 - (view)	Author: Tim Peters (tim.peters) *	Date: 2019-12-28 22:57
Oh, I don't expect it to help appreciably - which is why I never did it ;-) It's aimed at something different, I think, than what you're after: reducing the burden of cyclic gc on objects that would otherwise soon be reclaimed by refcounting anyway. But such stuff is going to get reclaimed "soon" regardless, and saving it from getting scanned by gc at most once has limited potential. You seem aimed more at reclaiming _cyclic_ trash sooner. The hack I sketched would only help with that if a cycle in part B became "theoretically dead" before part A filled up enough to trigger another collection. But I have no guess as to the distribution of cycle lifetimes, other than that it's bound to vary a lot across programs, and will be multi-modal.
msg358972 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-28 23:20
> Oh, I don't expect it to help appreciably - which is why I never did it ;-) It's aimed at something different, I think, than what you're after: reducing the burden of cyclic gc on objects that would otherwise soon be reclaimed by refcounting anyway. But such stuff is going to get reclaimed "soon" regardless, and saving it from getting scanned by gc at most once has limited potential. What I wanted to say is that this idea is still interesting to benchmark and experiment on its own, especially because is simple enough. >You seem aimed more at reclaiming _cyclic_ trash sooner. The hack I sketched would only help with that if a cycle in part B became "theoretically dead" before part A filled up enough to trigger another collection. But I have no guess as to the distribution of cycle lifetimes, other than that it's bound to vary a lot across programs, and will be multi-modal. That is right (I was also looking to solve the case in which an object is referred by something in an older generation, then is promoted and almost immediately the referent dies - or it was dead to begin with -). I think evolving from your proposal into something in which the first generation is split into two "sub-generations" with the same threshold that is collected at the same time should not be very complicated.
msg358973 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-28 23:37
I have attached two files to the issue. One of them (vanilla-vs-buffered) shows the mean memory profile of running test_asyncio 1000 times with the buffered GC and without the buffered GC. The other file shows the difference between both curves. test_ascyncio is not a "production application" but is the longest and less specific test that we have. I have experimented with some tests in the pyperformance suite and they show some similitudes like some memory spikes seem to be smaller and less memory usage in flat regions after collections.
msg358976 - (view)	Author: Tim Peters (tim.peters) *	Date: 2019-12-29 03:42
All right! So, at a first look, "buffering" isn't an obvious disaster ;-) I'm afraid nailing anything here is hard. For example, I don't know what you did to "measure memory", but if you're using stats reported by the OS, that's fraught with dangers too. Python's small-object allocator makes only weak attempts to return memory "to C", which in turn may or may not return memory to "the system". One way to do better is to call `sys._debugmallocstats()` and stare at the output. The "# bytes in allocated blocks" output line is an exact count of the number of bytes pymalloc is currently hanging on to for objects that have been allocated but not yet freed. The point is that C - and the OS - have nothing to do with this value. The downside: objects > 512 bytes aren't included at all in this (pymalloc passes on requests for > 512 bytes to the system malloc(), and doesn't keep track of them). Anyway, so far, so good! Looks worth pursuing :-)
msg358977 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-12-29 03:52
> I'm afraid nailing anything here is hard. For example, I don't know what you did to "measure memory", but if you're using stats reported by the OS, that's fraught with dangers too. I am interposing a custom allocator to track all the used blocks in all used pools in all arenas and then monitor all the memmap/malloc allocations. I get the same results if I don't use obmalloc (setting the PYTHONMALLOC=malloc) env var, which is more straightforward as I can just interpose a malloc implementation using LD_PRELOAD or a simple custom allocator. I don't claim to be exact but is better than just checking the resident size. I didn't want to use sys._debugmallocstats() so printing to stdout does not impact performance, as I was also checking that the code does not make the interpreter slower.

History
Date	User	Action	Args
2022-04-11 14:59:24	admin	set	github: 83324
2019-12-29 03:52:57	pablogsal	set	messages: + msg358977
2019-12-29 03:42:29	tim.peters	set	messages: + msg358976
2019-12-28 23:37:38	pablogsal	set	messages: + msg358973
2019-12-28 23:33:17	pablogsal	set	files: + diff-vanilla-buffered.png
2019-12-28 23:33:08	pablogsal	set	files: + vanilla-vs-buffered.png
2019-12-28 23:20:24	pablogsal	set	messages: + msg358972
2019-12-28 22:57:03	tim.peters	set	messages: + msg358971
2019-12-28 20:12:34	pablogsal	set	messages: + msg358963
2019-12-28 20:04:35	nanjekyejoannah	set	messages: + msg358962
2019-12-28 19:59:45	nanjekyejoannah	set	messages: + msg358960
2019-12-28 19:53:07	tim.peters	set	messages: + msg358959
2019-12-28 19:38:31	pablogsal	set	messages: + msg358958
2019-12-28 18:40:21	nanjekyejoannah	set	messages: + msg358955
2019-12-28 17:33:21	pablogsal	set	messages: + msg358951
2019-12-28 17:14:38	nanjekyejoannah	set	messages: + msg358950
2019-12-28 16:24:35	pablogsal	set	messages: + msg358949
2019-12-28 15:40:53	nanjekyejoannah	set	nosy: + nanjekyejoannah messages: + msg358947
2019-12-27 21:35:04	pablogsal	create