This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Fixing Copy on Writes from reference counting and immortal objects
Type: enhancement Stage: patch review
Components: Interpreter Core Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, carljm, corona10, dino.viehland, eelizondo, eric.snow, gregory.p.smith, methane, nascheme, pablogsal, pitrou, remi.lapeyre, shihai1991, steve.dower, tim.peters, vstinner
Priority: normal Keywords: patch

Created on 2020-04-11 15:40 by eelizondo, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 19474 open eelizondo, 2020-04-11 16:51
PR 24828 open eric.snow, 2021-03-11 22:18
PR 31488 open eelizondo, 2022-02-22 05:13
PR 31489 open eelizondo, 2022-02-22 05:24
PR 31490 open eelizondo, 2022-02-22 05:28
PR 31491 open eelizondo, 2022-02-22 05:36
Messages (69)
msg366216 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2020-04-11 15:40
Copy on writes are a big problem in large Python application that rely on multiple processes sharing the same memory.

With the implementation of `gc.freeze`, we've attenuated the problem by removing the CoW coming from the GC Head. However, reference counting still causes CoW.

This introduces Immortal Instances which allows the user to bypass reference counting on specific objects and avoid CoW on forked processes.

Immortal Instances are specially useful for applications that cache heap objects in shared memory which live throughout the entire execution (i.e modules, code, bytecode, etc.)
msg366221 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-04-11 22:37
Thanks for sharing!  It's good to have a patch implementing for others who might need it to try out.

We experimented with an implementation of this on CPython 2.7 that we called Eternal Refcounts for YouTube many years ago.  For the same reason (saving memory in forked workers by not dirtying pages of known needed objects via refcount changes).

I don't have that old patch handy right now, but looking at yours it was the same concept: A magic refcount value to branch based on.

We wound up not deploying it or pushing for it in CPython because the CPU performance implications of adding a branch instruction to Py_INCREC and Py_DECREF were, unsurprisingly, quite high.  I'd have to go digging to find numbers but it _believe_ it was on the order of a 10% slowdown on real world application code.  (microbenchmarks don't tell a good story, the python performance test suite should)

Given that most people's applications don't fork workers, I do not expect to see such an implementation ever become the default.  It is a very appropriate hack, but the use case is limited to applications that don't use threads, are on POSIX, and always fork() many workers.

Also note that this is an ABI change as those INCREF and DECREF definitions are intentionally public .h file macros or inlines and thus included within any compiledcode rather than being an expensive function call elsewhere (vstinner's new alternate higher level API proposal presumably changes this).  If an interpreter with this loaded a binary using the CPython API built against headers without it, your refcounts could get messed up.
msg366224 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2020-04-12 01:01
> the CPU performance implications of adding a branch instruction to Py_INCREC and Py_DECREF were, unsurprisingly, quite high.

Yeah, makes sense. I guess it really depends on the specific profile of your application.

For Instagram this was an overall net positive change and we still have it in prod. The amount of page faults from Copy on Writes was so large that reducing it resulted in a net CPU win (even with the added branching). And of course, a huge reduction in memory usage.


> Microbenchmarks don't tell a good story, the python performance test suite should

Agreed. I only added the Richards benchmark as a reference. I'm hoping someone can pick it up and have more concrete numbers on an average Python workload.


> Given that most people's applications don't fork workers, I do not expect to see such an implementation ever become the default.

In any case, I gated this change with ifdefs. In case we don't have it by default, we can always can easily enable it with a simple `-DPy_IMMORTAL_INSTANCES` flag to the compiler.

> Also note that this is an ABI change as those INCREF and DECREF definitions are intentionally public .h file

This has indeed been a bit of a pain for us as well. Due to how our tooling is set-up, there's a small number of third party libraries that are still causing Copy on Writes. Fortunately, that's the only drawback. Even if your immortalized object goes through an extension that has a different Py_{DECREF,INCREF} implementation, the reference count will still be so large that it will never be deallocated.
msg366325 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-13 19:23
I run the pyperformance test suite with PGO + LTO + full cpu isolation in the speed.python.org machine and these were the results:

+-------------------------+--------------------------------------+---------------------------------------+
| Benchmark               | master                               | immortal_refs_patch                   |
+=========================+======================================+=======================================+
| pidigits                | 289 ms                               | 290 ms: 1.01x slower (+1%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| pickle                  | 20.2 us                              | 20.3 us: 1.01x slower (+1%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| xml_etree_parse         | 253 ms                               | 258 ms: 1.02x slower (+2%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| json_loads              | 52.0 us                              | 53.1 us: 1.02x slower (+2%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| scimark_fft             | 708 ms                               | 723 ms: 1.02x slower (+2%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| python_startup_no_site  | 7.83 ms                              | 8.04 ms: 1.03x slower (+3%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| scimark_sparse_mat_mult | 9.28 ms                              | 9.57 ms: 1.03x slower (+3%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| unpickle                | 28.0 us                              | 28.9 us: 1.03x slower (+3%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| regex_effbot            | 4.70 ms                              | 4.86 ms: 1.03x slower (+3%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| xml_etree_iterparse     | 178 ms                               | 184 ms: 1.04x slower (+4%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| python_startup          | 11.5 ms                              | 12.0 ms: 1.04x slower (+4%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| scimark_monte_carlo     | 212 ms                               | 221 ms: 1.04x slower (+4%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| pathlib                 | 38.6 ms                              | 40.5 ms: 1.05x slower (+5%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| sqlite_synth            | 7.85 us                              | 8.26 us: 1.05x slower (+5%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| sqlalchemy_imperative   | 62.9 ms                              | 66.2 ms: 1.05x slower (+5%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| richards                | 138 ms                               | 145 ms: 1.05x slower (+5%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| crypto_pyaes            | 199 ms                               | 210 ms: 1.06x slower (+6%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| nbody                   | 249 ms                               | 265 ms: 1.06x slower (+6%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| raytrace                | 937 ms                               | 995 ms: 1.06x slower (+6%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| deltablue               | 13.4 ms                              | 14.2 ms: 1.06x slower (+6%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| tornado_http            | 276 ms                               | 295 ms: 1.07x slower (+7%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| scimark_lu              | 247 ms                               | 264 ms: 1.07x slower (+7%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| sqlalchemy_declarative  | 280 ms                               | 300 ms: 1.07x slower (+7%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| dulwich_log             | 143 ms                               | 153 ms: 1.07x slower (+7%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| chaos                   | 210 ms                               | 226 ms: 1.07x slower (+7%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| 2to3                    | 594 ms                               | 639 ms: 1.08x slower (+8%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| unpickle_list           | 6.60 us                              | 7.10 us: 1.08x slower (+8%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| float                   | 213 ms                               | 229 ms: 1.08x slower (+8%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| pyflate                 | 1.27 sec                             | 1.37 sec: 1.08x slower (+8%)          |
+-------------------------+--------------------------------------+---------------------------------------+
| go                      | 481 ms                               | 522 ms: 1.09x slower (+9%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| pickle_pure_python      | 966 us                               | 1.05 ms: 1.09x slower (+9%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| meteor_contest          | 175 ms                               | 191 ms: 1.09x slower (+9%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| genshi_xml              | 148 ms                               | 161 ms: 1.09x slower (+9%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| chameleon               | 20.6 ms                              | 22.5 ms: 1.09x slower (+9%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| nqueens                 | 191 ms                               | 208 ms: 1.09x slower (+9%)            |
+-------------------------+--------------------------------------+---------------------------------------+
| telco                   | 14.6 ms                              | 16.0 ms: 1.09x slower (+9%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| logging_format          | 22.1 us                              | 24.2 us: 1.10x slower (+10%)          |
+-------------------------+--------------------------------------+---------------------------------------+
| regex_compile           | 322 ms                               | 355 ms: 1.10x slower (+10%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| hexiom                  | 16.4 ms                              | 18.1 ms: 1.10x slower (+10%)          |
+-------------------------+--------------------------------------+---------------------------------------+
| json_dumps              | 25.6 ms                              | 28.3 ms: 1.10x slower (+10%)          |
+-------------------------+--------------------------------------+---------------------------------------+
| sympy_sum               | 397 ms                               | 439 ms: 1.11x slower (+11%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| sympy_integrate         | 42.7 ms                              | 47.2 ms: 1.11x slower (+11%)          |
+-------------------------+--------------------------------------+---------------------------------------+
| django_template         | 120 ms                               | 133 ms: 1.11x slower (+11%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| sympy_str               | 633 ms                               | 702 ms: 1.11x slower (+11%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| xml_etree_generate      | 162 ms                               | 180 ms: 1.11x slower (+11%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| spectral_norm           | 244 ms                               | 272 ms: 1.11x slower (+11%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| xml_etree_process       | 134 ms                               | 149 ms: 1.11x slower (+11%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| logging_simple          | 19.3 us                              | 21.5 us: 1.11x slower (+11%)          |
+-------------------------+--------------------------------------+---------------------------------------+
| mako                    | 28.2 ms                              | 31.5 ms: 1.12x slower (+12%)          |
+-------------------------+--------------------------------------+---------------------------------------+
| fannkuch                | 853 ms                               | 954 ms: 1.12x slower (+12%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| sympy_expand            | 983 ms                               | 1.11 sec: 1.13x slower (+13%)         |
+-------------------------+--------------------------------------+---------------------------------------+
| genshi_text             | 64.0 ms                              | 72.9 ms: 1.14x slower (+14%)          |
+-------------------------+--------------------------------------+---------------------------------------+
| unpack_sequence         | 123 ns                               | 142 ns: 1.16x slower (+16%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| logging_silent          | 311 ns                               | 364 ns: 1.17x slower (+17%)           |
+-------------------------+--------------------------------------+---------------------------------------+
| unpickle_pure_python    | 581 us                               | 683 us: 1.17x slower (+17%)           |
+-------------------------+--------------------------------------+---------------------------------------+

Not significant (2): pickle_list; regex_dna; pickle_dict; scimark_sor; regex_v8
msg366326 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-13 19:31
System state
============

CPU: use 12 logical CPUs: 1,3,5,7,9,11,13,15,17,19,21,23
Perf event: Maximum sample rate: 1 per second
ASLR: Full randomization
Linux scheduler: Isolated CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23
Linux scheduler: RCU disabled on CPUs (12/24): 1,3,5,7,9,11,13,15,17,19,21,23
CPU Frequency: 0,2,4,6,8,10,12,14,16,18,20,22=min=1600 MHz, max=3333 MHz; 1,3,5,7,9,13,15,17,19,21,23=min=max=3333 MHz
Turbo Boost (MSR): CPU 1,3,5,7,9,11,13,15,17,19,21,23: disabled
IRQ affinity: irqbalance service: inactive
IRQ affinity: Default IRQ affinity: CPU 0,2,4,6,8,10,12,14,16,18,20,22
IRQ affinity: IRQ affinity: IRQ 0,2=CPU 0-23; IRQ 1,3-15,17,20,22-23,28-51=CPU 0,2,4,6,8,10,12,14,16,18,20,22
msg366328 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-04-13 19:35
If it's easy to set up, it might be interesting to see if there's a difference if (e.g.) all interned strings are marked as immortal, or all small ints. Or type objects. Maybe set a flag during initialization and treat everything created before that as immortal.

There's definitely going to be overhead, but assuming the branch is there we may as well use it to optimise our own operations.
msg366329 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-13 20:06
Thanks, Eddie for sharing the patch. I think this is an interesting discussion: for instance, in the core dev sprint at Microsoft I recall that we had a quick discussion about having immortal references. 

After analyzing the patch, and even assuming that we add conditional compilation, here are some of the things I feel uneasy about:

- Anything that is touched by the immortal object will be leaked. This can also happen in obscure ways
  if reference cycles are created.

- As Eddie mentions in the PR, this does not fully cover all cases as objects that become tracked by the GC
  after they are modified (for instance, dicts and tuples that only contain immutable objects). Those objects will
  still, participate in reference counting after they start to be tracked.

- As Gregory mentions, if immortal objects are handed to extension modules compiled with the other version of the
  macros, the reference count can be corrupted. This may break the garbage collector algorithm that relies on the
  the balance between strong references between objects and its reference count to do the calculation of the isolated cycles.
  This without mentioning that this defies the purpose of the patch in those cases, as a single modification of the reference 
  the count will trigger the copy of the whole memory page where the object lives.

- The patch modifies very core mechanics of Python for the benefit of a restricted set of users (as Gregory mentions): no 
  threads, POSIX and forking.

- Is not clear to me that leaking memory is the best way to solve this problem compared to other solutions (like having the 
  the reference count of the objects separated from them).
  objects separated from them.

There is a trend right now to try to remove immortal objects (started by Victor and others) like static types and transforming them to heap types. I am afraid that having the restriction that the interpreter and the gc needs to be aware
of immortal types can raise considerably the maintenance costs of already complicated pieces of the CPython VM.
msg366330 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-13 20:10
> If it's easy to set up, it might be interesting to see if there's a difference if (e.g.) all interned strings are marked as immortal, or all small ints. Or type objects. Maybe set a flag during initialization and treat everything created before that as immortal.

I can look into it, the main problem is that every run of the pyperformance test suite takes several hours, even more with --full-pgo :(  I will try to set up some automation for that when I have some free cycles
msg366331 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-04-13 20:27
> There is a trend right now to try to remove immortal objects (started by Victor and others) like static types and transforming them to heap types.

This isn't actually about removing immortal objects, but removing *mutable* objects that would be shared between subinterpreters. Making some objects immortal, such as interned strings or stateless singletons, would actually benefit this work, as they could then be safely shared between subinterpreters.

But as you say, the added complexity here is starting to seem like it'll cost more than it's worth...
msg366341 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-13 22:45
The idea of immortal objects was proposed to solve the bpo-39511 problem.

I dislike immortable for different reasons. Here are some.

Gregory:
> We wound up not deploying it or pushing for it in CPython because the CPU performance implications of adding a branch instruction to Py_INCREC and Py_DECREF were, unsurprisingly, quite high.

Exactly. Copy-on-Write problem affects a minority of users. If PR 19474 is enabled by default, all users will pay the overhead even if they never call fork() without exec().

1.17x slower on logging_silent or unpickle_pure_python is a very expensive price :-(


Pablo:
> Anything that is touched by the immortal object will be leaked. This can also happen in obscure ways if reference cycles are created.

gc.freeze() has a similar issue, no? This function also comes from Instagram specific use case ;-)


Steve:
> This isn't actually about removing immortal objects, but removing *mutable* objects that would be shared between subinterpreters. Making some objects immortal, such as interned strings or stateless singletons, would actually benefit this work, as they could then be safely shared between subinterpreters.

From what I understood, we simply cannot share any object (mutable or not) between two subinterpreters. Otherwise, we will need to put a lock on all Py_INCREF/Py_DECREF operation which would kill performances of running multiple interpreters in parallel. Join bpo-39511 discussion.
msg366344 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-13 22:51
> gc.freeze() has a similar issue, no? This function also comes from Instagram specific use case ;-)

Not really because this patch is much more impacting. gc.freeze() moves objects into a permanent generation making them not participate in the GC so the only leaks are cycle-based. With this patch immortal object will leak every strong reference that they have even if they don't participate in cycles (but also if they participate in cycles).
msg366345 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-13 22:54
I would be more interested by an experiment to move ob_refcnt outside PyObject to solve the Copy-on-Write issue, especially to measure the impact on performances.
msg366387 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-04-14 13:55
>> This isn't actually about removing immortal objects, but removing *mutable*
>> objects that would be shared between subinterpreters. Making some objects
>> immortal, such as interned strings or stateless singletons, would actually
>> benefit this work, as they could then be safely shared between
>> subinterpreters.
> 
> From what I understood, we simply cannot share any object (mutable or not)
> between two subinterpreters. Otherwise, we will need to put a lock on all
> Py_INCREF/Py_DECREF operation which would kill performances of running
> multiple interpreters in parallel. Join bpo-39511 discussion.

That thread consists of everyone else telling you what I've just told you. Not sure that me telling you as well is going to help ;)
msg366388 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-14 14:23
> I run the pyperformance test suite with PGO + LTO + full cpu isolation in the speed.python.org machine and these were the results:

Be mindful that synthetic benchmarks are probably a gentle case for branch prediction.  Real-world performance on irregular workloads might be worse still (or perhaps better, who knows :-)).
msg366390 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-14 14:25
As a separate discussion, I would be interested to know whether the original use case (Eddie's) could be satisfied differently.  It probably doesn't belong to this issue, though.

(Eddie, if you want to discuss this, feel free to e-mail me privately. I think Davin Potts - co-maintainer of multiprocessing who recently added some facilities around shared memory - might be interested as well.)
msg366402 - (view) Author: Carl Meyer (carljm) * Date: 2020-04-14 17:02
> Anything that is touched by the immortal object will be leaked. This can also happen in obscure ways if reference cycles are created.

I think this is simply expected behavior if you choose to create immortal objects, and not really an issue. How could you have an immortal object that doesn't keep its strong references alive?

> this does not fully cover all cases as objects that become tracked by the GC after they are modified (for instance, dicts and tuples that only contain immutable objects). Those objects will still participate in reference counting after they start to be tracked.

I think the last sentence here is not quite right. An immortalized object will never start participating in reference counting again after it is immortalized.

There are two cases. If at the time of calling `immortalize_heap()` you have a non-GC-tracked object that is also not reachable from any GC-tracked container, then it will not be immortalized at all, so will be unaffected. This is a side effect of the PR using the GC to find objects to immortalize.

If the non-GC-tracked object is reachable from a GC-tracked object (I believe this is by far the more common case), then it will be immortalized. If it later becomes GC-tracked, it will start participating in GC (but the immortal bit causes it to appear to the GC to have a very high reference count, so GC will never collect it or any cycle it is part of), but that will not cause it to start participating in reference counting again.

> if immortal objects are handed to extension modules compiled with the other version of the macros, the reference count can be corrupted

I think the word "corrupted" makes this sound worse than it is in practice. What happens is just that the object is still effectively immortal (because the immortal bit is a very high bit), but the copy-on-write benefit is lost for the objects touched by old extensions.

> 1.17x slower on logging_silent or unpickle_pure_python is a very expensive price

Agreed. It seems the only way this makes sense is under an ifdef and off by default. CPython does a lot of that for debug features; this might be the first case of doing it for a performance feature?

> I would be more interested by an experiment to move ob_refcnt outside PyObject to solve the Copy-on-Write issue

It would certainly be interesting to see results of such an experiment. We haven't tried that for refcounts, but in the work that led to `gc.freeze()` we did try relocating the GC header to a side location. We abandoned that because the memory overhead of adding a single indirection pointer to every PyObject was too large to even consider the option further. I suspect that this memory overhead issue and/or likely cache locality problems will make moving refcounts outside PyObject look much worse for performance than this immortal-instances patch does.
msg366403 - (view) Author: Carl Meyer (carljm) * Date: 2020-04-14 17:12
> An immortalized object will never start participating in reference counting again after it is immortalized.

Well, "passed to an extension compiled with no-immortal headers" is an exception to this.

But for the "not GC tracked but later becomes GC tracked" case, it will not re-enter reference counting, only the GC.
msg366404 - (view) Author: Carl Meyer (carljm) * Date: 2020-04-14 17:26
> This may break the garbage collector algorithm that relies on the balance between strong references between objects and its reference count to do the calculation of the isolated cycles.

I don't think it really breaks anything. What happens is that the immortal object appears to the GC to have a very large reference count, even after adjusting for within-cycle references. So cycles including an immortal object are always kept alive, which is exactly the behavior one should expect from an immortal object.
msg366406 - (view) Author: Dino Viehland (dino.viehland) * (Python committer) Date: 2020-04-14 17:30
I think there's other cases of performance related features being hidden under an ifdef.  Computed gotos show up that way, although probably more because it's a compiler extension that's not supported everywhere.  Pymalloc is also very similar in that it implies an ABI change as well.

I wonder if it would be worth it to introduce an ABI flag for this as well?  On the one hand is it a slightly different contract, on the other hand using extensions that don't support the immortalization actually work just fine.
msg366407 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-14 17:42
> I think this is simply expected behavior if you choose to create immortal objects, and not really an issue. How could you have an immortal object that doesn't keep its strong references alive?

I think I didn't express myself very well: I am saying that I perceive this as a problem because when you immortalize the heap at a given time you will also immortalize any strong reference that these objects made afterwards. The immortal property will be "infecting" other objects as strong references are created.

> I think the last sentence here is not quite right. An immortalized object will never start participating in reference counting again after it is immortalized.

Yeah, but the objects I mention will not be immortalized. These are objects that were not tracked when the immortalization is done and are not reachable from tracked objects at the moment and become tracked afterwards (for instance, some dictionary that only had immutable objects when the immortalization was done). And this is always very impacting: a single object that participates in refcounting will copy an entire page of memory. Although I agree this is a rare thing is a possible thing that can happen.

> I think the word "corrupted" makes this sound worse than it is in practice. What happens is just that the object is still effectively immortal (because the immortal bit is a very high bit), but the copy-on-write benefit is lost for the objects touched by old extensions.

Yeah, I agree that corrupted may sound a bit more dramatic than it should but I think this is still a problem because when it happens (as mentioned before) it affects entire pages and not the single object that we consider.

> But for the "not GC tracked but later becomes GC tracked" case, it will not re-enter reference counting, only the GC.

But if said objects (isolated and untracked before and now tracked) acquire strong references to immortal objects, those objects will be visited when the gc starts calculating the isolated cycles and that requires a balanced reference count to work.
msg366408 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-04-14 17:50
There hasn't been much said about which kind of objects would be immortal. Could this be a creation-time option? Or a type object (similar to a no-op tp_dealloc)?

If it's something that is known when the object is created, then some of the "infection" concern can be reduced - but if you wanted to arbitrarily make arbitrary objects immortal dynamically then it's a different matter.

Another benefit of knowing at creation time is that immortal objects could be allocated to a different page (or potentially even to read-only shared memory, if that was appropriate).
msg366412 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-14 17:51
> There hasn't been much said about which kind of objects would be immortal.

The current patch makes *all objects that are tracked by the gc* immortal, independently of their type (unless I am missing something).
msg366413 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-14 17:52
> all objects that are tracked by the gc

also, all the objects that are reachable from objects tracked by the gc.
msg366416 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-04-14 18:09
Marking everything as immortal/eternal after you've loaded your common code & data but before you do your forking is thenorm for this kind of serving system.

You already presumably have a defined lifetime for the processes so they'll likely be set to die within N hours or days.  Memory leaks of too many things persisting is a non-issue in this kind of system.

The alternative of trying to pick and choose exactly what (and anything they reference) sticks around is a maintenance nightmare exercise in futility.
msg366422 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-14 18:26
> But if said objects (isolated and untracked before and now tracked) acquire strong references to immortal objects, those objects will be visited when the gc starts calculating the isolated cycles and that requires a balanced reference count to work.

I was thinking about this a bit more and because the refcount bit is too high, the GC algorithm will be "fine" in the sense that won't break. It will just treat immortal objects as referenced from outside any cycle isolate (because the refcount hopefully will never reach 0), which is consistent with the fact that they are in the permanent generation.
msg366423 - (view) Author: Carl Meyer (carljm) * Date: 2020-04-14 18:30
I think the concerns about "perfect" behavior in corner cases are in general irrelevant here.

In the scenarios where this optimization matters, there is no quantitative change that occurs at 100% coverage. Preventing 99% of CoW is 99% as good as preventing 100% :) So the fact that a few objects here and there in special cases could still trigger CoW just doesn't matter; it's still a massive improvement over the status quo. (That said, I wouldn't _mind_ improving the coverage, e.g. if you can suggest a better way to find all heap objects instead of using the GC.)

And similarly, gps is right that the concern that immortal objects can keep other objects alive (even via references added after immortalization) is a non-issue in practice. There really is no other behavior one could prefer or expect instead.

> if said objects (isolated and untracked before and now tracked) acquire strong references to immortal objects, those objects will be visited when the gc starts calculating the isolated cycles and that requires a balanced reference count to work.

I'm not sure what you mean here by "balanced ref count" or by "work" :) What will happen anytime an immortal object gets into the GC, for any reason, is that the GC will "subtract" cyclic references and see that the immortal object still has a large refcount even after that adjustment, and so it will keep the immortal object and any cycle it is part of alive. This behavior is correct and should be fully expected; nothing breaks. It doesn't matter at all to the GC that this large refcount is "fictional," and it doesn't break the GC algorithm, it results only in the desired behavior of maintaining immortality of immortal objects.

It is perhaps slightly weird that this behavior falls out of the immortal bit being a high bit rather than being more explicit. I did do some experimentation with trying to explicitly prevent immortal instances from ever entering GC, but it turned out to be hard to do that in an efficient way. And motivation to do it is low, because there's nothing wrong with the behavior in the existing PR.
msg366428 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-04-14 18:39
> I'm not sure what you mean here by "balanced ref count" or by "work" :) What will happen anytime an immortal object gets into the GC, for any reason, is that the GC will "subtract" cyclic references and see that the immortal object still has a large refcount even after that adjustment, and so it will keep the immortal object and any cycle it is part of alive. This behavior is correct and should be fully expected; nothing breaks. It doesn't matter at all to the GC that this large refcount is "fictional," and it doesn't break the GC algorithm, it results only in the desired behavior of maintaining immortality of immortal objects.


Yep, that is right. I think there was a race condition between my previous message and yours :)

I think what was confusing me in this line of reasoning is that I was not taking into account that the immortal bit is a very high one, making the refcount gigantic. I was treating it mentally like a flag without factoring the implications of such big reference count.
msg366456 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-14 22:55
This feature seems to be driven by Instagram use case. Are there other users who would benefit of that? Is it a common use case to load big data and then fork to use preloaded data?

PR 19474 has a maintenance cost:

* new configuration option
* new macro
* need a buildbot to check that the option is not broken
* document the change
* etc.

There is even now a question about using a different ABI flag.

It's not the common case to build a custom Python manually for a specific use case. Most users use a binary shipped by their operating system.

I'm not sure that it's worth it for Python to maintain such special build.

Maybe bpo-39511 would be a better motivation to support immortable objects.

But I'm also concerned by the huge impact on performance :-(
msg366518 - (view) Author: Carl Meyer (carljm) * Date: 2020-04-15 13:58
> Is it a common use case to load big data and then fork to use preloaded data?

A lot of the "big data" in question here is simply lots of Python module/class/code objects resulting from importing lots of Python modules.

And yes, this "pre-fork" model is extremely common for serving Python web applications; it is the way most Python web application servers work. We already have an example in this thread of another large Python web application (YouTube) that had similar needs and considered a similar approach.
msg366519 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-15 14:05
Carl:
> A lot of the "big data" in question here is simply lots of Python module/class/code objects resulting from importing lots of Python modules.
>
> And yes, this "pre-fork" model is extremely common for serving Python web applications; it is the way most Python web application servers work. (...)

I would be interested to hear the answer to Antoine's question which is basically: why not using the multiprocessing fork server?
msg366520 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-15 14:23
I'll note that this "extremely common" model can break as soon as you have hidden worker threads somewhere (this can happen in a third-party library).

For example, the libhdfs library is not fork-safe.
msg366536 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2020-04-15 18:24
Eddie mentions in the PR about using memory arenas to contain immortal objects.  I think it could be worth investigating further.  With the current PR, the immortal status is dependent on the value of the refcnt field of the object.  Using immortal arenas might be faster because you can check if an object is immortal just based on its address (no data dependency on refcnt value).  

The fastest would be to create an immortal block as part of the BSS (uninitialized data).  Then, within incref/decref you use the location and size of the immortal arena (compile time constants) to test if the object is immortal.  Maybe could just check if the high bits of an object address match a constant mask (if immortal region is aligned and is power of 2 in size).  Slightly slower but more flexible would be to make the immortal arena size and location global variables.  That way, you can set the size of the region on startup.  Also would be more flexible in terms of ABI compatibility.  That would introduce one or two global loads within incref/decref but those values would almost certainly be in L1 cache.

Even more flexible would be to use a memory map to mark which arenas are immortal.  See my radix tree implementation for obmalloc:

https://bugs.python.org/issue37448

I would guess the radix tree lookups are too expensive to put in incref/decref.  Should probably test that though.

I had started doing an experiment with the arena approach before I noticed Eddie's comment about it.  I would like to see his version.  Here is a sketch of mine (not working yet):

- change new_arena() to initially allocate from an "immortal memory" region.  There are multiple ways to allocate that (BSS/uninitialized data, aligned_alloc(), etc).

- add a _PyMem_enable_immortal() call to switch obmalloc from using immortal arenas to regular ones.  Need to mess with some obmalloc data structures to switch arenas (usedpools, usable_arenas, unused_arena_objects).

- change incref/decref to check if immortal status has been enabled and if object address falls within immortal region.  If so, incref/decref don't do anything.

By default, the immortal arenas could be enabled on Python startup.  Call _PyMem_enable_immortal after startup but before running user code.  There could be a command-line option to disable the automatic call to _PyMem_enable_immortal() so that users like Instagram can do their pre-fork initialization before calling it.

Next step down the rabbit-hole could be to use something like Jeethu Rao's frozen module serializer and dump out all of the startup objects and put them into the BSS:

 https://bugs.python.org/issue34690
msg366561 - (view) Author: Carl Meyer (carljm) * Date: 2020-04-15 21:50
> I would be interested to hear the answer to Antoine's question which is basically: why not using the multiprocessing fork server?

Concretely, because for a long time we have used the uWSGI application server and it manages forking worker processes (among other things), and AFAIK nobody has yet proposed trying to replace that with something built around the multiprocessing module. I'm actually not aware of any popular Python WSGI application server built on top of the multiprocessing module (but some may exist).

What problem do you have in mind that the fork server would solve? How is it related to this issue? I looked at the docs and don't see that it does anything to help sharing Python objects' memory between forked processes without CoW.
msg366565 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-04-15 22:09
> What problem do you have in mind that the fork server would solve?

My understanding is that the fork server solves the *general* problem of running arbitrary code before fork by making sure that only CPython/multiprocessing gets to run code before fork.

In this specific case where you're carefully running only forkable code at startup, it presumably won't help.
msg366566 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-15 22:14
Steve is right.
msg366573 - (view) Author: Carl Meyer (carljm) * Date: 2020-04-16 01:30
Makes sense. Yes, caution is required about what code runs before fork, but forkserver’s solution for that would be a non-starter for us, since it would ensure that we can share no basically no memory at all between worker processes.
msg366577 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2020-04-16 04:14
I was able to get an equal (maybe slightly better) performance by following the advice from Steve and Neil and skipping reference counting of instances that we know are immortal.

It seems that by getting performance parity, we should be able to ease most of the concerns raised so far.

I've updated the PR to now immortalize:
* All static types (i.e: PyType_Type, etc.)
* All small ints (-5 to 256)
* The following Singletons: PyTrue, PyFalse, PyNone
* And the heap after the runtime is initialized (in pymain_main)

A quick caveat, immortalizing the runtime heap caused ~6 or so tests to fail since they depend on shutdown behavior which this now changes. In the PR I commented out the call to `_PyGC_ImmortalizeHeap` in `pymain_main` to pass the CI. Ignoring these failing tests for now, these are the benchmarks numbers that I got from pyperformance:


Baseline (master branch):
unpack_sequence: Mean +- std dev: 76.0 ns +- 4.9 ns
richards: Mean +- std dev: 116 ms +- 8 ms
fannkuch: Mean +- std dev: 764 ms +- 24 ms
pidigits: Mean +- std dev: 261 ms +- 7 ms

Immortalizing known immortals objects (Latest PR)
unpack_sequence: Mean +- std dev: 74.7 ns +- 5.1 ns
richards: Mean +- std dev: 112 ms +- 5 ms
fannkuch: Mean +- std dev: 774 ms +- 24 ms
pidigits: Mean +- std dev: 262 ms +- 11 ms

Only adding immortal branch (The commit that Pablo benchmarked)
unpack_sequence: Mean +- std dev: 93.1 ns +- 5.7 ns
richards: Mean +- std dev: 124 ms +- 4 ms
fannkuch: Mean +- std dev: 861 ms +- 26 ms
pidigits: Mean +- std dev: 269 ms +- 7 ms


The performance of Immortal Objects by default seems to now be within error of the master branch.
msg366578 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2020-04-16 04:27
Neil:
> The fastest would be to create an immortal block as part of the BSS (uninitialized data).

That's an interesting idea, definitely worth exploring and we can probably get some perf win out of it. And yes, using the frozen modules is definitely a step forward and we can leverage that to move these instances into the rodata section of the binary.
 
> I had started doing an experiment with the arena approach before I noticed Eddie's comment about it.  I would like to see his version.

What I had written up is slightly different from what you mentioned. I was mostly concerned about having a small object that we did not reach through the GC roots. If this small object would get a reference count bump, the whole arena would Copy on Write.

I added a commit to the PR with this Arena Immortalization so that you could take a look at the implementation: https://github.com/python/cpython/pull/19474/commits/b29c8ffd3faf99fc5c9885d2a4c6c3c6d5768c8c

The idea is to walk all the arena's pools to mark them as immortal by using a new word added to pool_header. This word would help us identify if the pool is immortal on every pyalloc and pydealloc.

I still get some tests breaking with this, I haven't tried to debug it though
msg366579 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2020-04-16 04:42
After the added optimizations and assuming that running this more rigorously (i.e PGO + LTO + full cpu isolation + speed.python.org machine) also shows the PR as perf neutral, what are people's thoughts so far?

Would this be enough to consider adding this change by default?

Are there still any concerns regarding correctness? It seems like most of the questions raised so far have already been addressed
msg366593 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2020-04-16 12:15
A big -1 to this.

You are asking the whole world to take a hit on both performance and memory use, in order to save Instagram memory.

The PR uses the term "immortal" everywhere. There is only one reference to copy-on-write in a comment. Yet this issue about making object headers immutable.
Immortality and object header immutability are not the same.

If object header immutability is to be a requirement, that needs a PEP.

If it is not requirement, but immortality is, then make the obvious improvement of changing the branchy code

if (!(obj->refcnt & IMMORTAL_BIT)) {
   obj->refcnt++;
}

to the branchless

obj->refcnt += ((obj->refcnt &IMMORTAL_BIT) != 0)

Immortality has advantages because it allows saturating reference counting and thus smaller object headers, but it is not the same as making the object header immutable.
msg366605 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2020-04-16 15:10
Mark:
> You are asking the whole world to take a hit on both performance and memory use.

That's an explicit non-goal. That's why the code was guarded to add as an optional compilation mode

This should be added by default if and only if this is a neutral or an improvement on performance. Which by the way, the latest perf numbers suggest that we are now at perf parity / within error (pending official confirmation from running this on speed.python.org machines).

Also, can you expand on how is this a performance hit on memory?


> There is only one reference to copy-on-write in a comment. Yet this issue about making object headers immutable.

The PR summary expands on Copy on Writes. If you think this belongs in-code, let me know and I can update the PR.


> then make the obvious improvement of changing the branchy code

That is strictly slower. The current version makes Py_INCREF and Py_DECREF cheaper for known immortal instances (i.e the heap after runtime init). This skips increasing and decreasing the refcount as well as the refcount check for deallocation. Using the proposed branch-less version makes all Py_INCREFs and Py_DECREFs more expensive.

I ran a couple of benchmarks with the branch-less version to highlight this:

Branch-less version:
unpack_sequence: Mean +- std dev: 98.2 ns +- 4.9 ns
richards: Mean +- std dev: 130 ms +- 5 ms
fannkuch: Mean +- std dev: 894 ms +- 18 ms


Branch version:
unpack_sequence: Mean +- std dev: 82.7 ns +- 3.9 ns
richards: Mean +- std dev: 123 ms +- 4 ms
fannkuch: Mean +- std dev: 838 ms +- 25 ms


> Immortality has advantages because it allows saturating reference counting and thus smaller object headers, but it is not the same as making the object header immutable.

In its current form, this change is not trying to do a strict immutability of the header. We can't guarantee that third-party extensions won't mutate the instance. Instead, this change tries to maintain an instance's immutability as long as it can.

If the issue is with naming, I can easily rename this to something else :)
msg366623 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2020-04-16 18:41
Do you have more details on your comparison?
I'm somewhat puzzled how a version that does no more work and has no jumps is slower.

The memory hit is not immediate, but making the object header immutable prevents changes like https://github.com/markshannon/cpython/commit/c73407e7b5d1a0fc794d55c6bcc91cfdc958f6c4
which would potentially save the two word per object overhead that the cyclic GC currently imposes.
msg366624 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2020-04-16 19:13
> I'm somewhat puzzled how a version that does no more work and has no jumps is slower.

Oh I think I know why there's some confusion. I've updated the PR from the initial version (which is probably the one that you saw). The branching does 
less work in Py_INCREF and Py_DECREF for all instances marked with the immortal bit by exiting early.

In the latest change, I added the immortal bit to a bunch of "known" immortal objects by. For instance:
* All static types (i.e: PyType_Type, etc.)
* All small ints (-5 to 256)
* The following Singletons: PyTrue, PyFalse, PyNone
* And the heap after the runtime is initialized (in pymain_main)

Example 1)
```
PyObject _Py_NoneStruct = {
  _PyObject_EXTRA_INIT
  1, &_PyNone_Type
#ifdef Py_IMMORTAL_OBJECTS
  _Py_IMMORTAL_BIT,
#else
  1,
#endif  /* Py_IMMORTAL_OBJECTS */
  &_PyNone_Type
};
```

Example 2)
```
static int
pymain_main(_PyArgv *args)
{
    PyStatus status = pymain_init(args);
    if (_PyStatus_IS_EXIT(status)) {
        pymain_free();
        return status.exitcode;
    }
    if (_PyStatus_EXCEPTION(status)) {
        pymain_exit_error(status);
    }

#ifdef Py_IMMORTAL_OBJECTS
    /* Most of the objects alive at this point will stay alive throughout the
     * lifecycle of the runtime. Immortalize to avoid the GC and refcnt costs */
    _PyGC_ImmortalizeHeap();
#endif  /* Py_IMMORTAL_OBJECTS */
    return Py_RunMain();
```


Therefore, you are now making Py_INCREF and Py_DECREF cheaper for things like  `Py_RETURN_NONE` and a bunch of other instances.

Let me know if that explains it! I could also send you patch of the branch-less version so you can compare them.



> but making the object header immutable prevents changes like

Why would it prevent it? These changes are not mutually exclusive, you can still have an immortal bit by:
1) Using a bit from `gc_bits`. Currently you only need 2 bits for the GC. Even with a `short` you'll have space for the immortal bit.
2) Using a bit from the ob_refcnt. Separately, using this allows us to experiment with a branch-less and test-less code by using saturated adds. For example:

```
/* Branch-less incref with saturated add */
#define PY_REFCNT_MAX ((int)(((int)-1)>>1))
#define _Py_INCREF(op) ({
    __asm__ (
        "addl $0x1, %[refcnt]"
        "cmovol  %[refcnt_max], %[refcnt]"
        : [refcnt] "+r" (((PyObject *)op)->ob_refcnt)
        : [refcnt_max] "r" (PY_REFCNT_MAX)
    );})
```
msg366625 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2020-04-16 19:26
> which would potentially save the two word per object overhead

Btw, I misread. I thought `gc_bits` referred to the bits used by the GC in the reference count. In any case, you can still use a bit in the reference count :)
msg366628 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2020-04-16 20:51
A few thoughts

First, the PR is well done.  The changes are pretty minimal and are well localized.  Thanks Eddie for writing clean code and responding to review requests.

My feeling is that, in current state, this still should not get merged or at least not get enabled by default.  The main consideration is what is the cost of this new feature vs what is the benefit.

If the immortalize heap function is not called by default, all Python users pay a performance hit for this feature.  Even after recent PR improvements, if you don't immortalize the heap, Python is slower. 

Only a small fraction of Python users will benefit from this feature.  I think the "pre-fork, CoW exploiting" application architecture is more common than some people would realize (it is a successful way to achieve multi-processing with CPython, making good use of multiple CPU cores).  However, I'm pretty sure it is used by a very tiny fraction of all Python programs.

If we do immortalize by default (e.g. on Python startup), the semantics of the refcnt field are subtly changed.  I suspect some programs will be broken as a result of this change.  A clue about what might go wrong comes from the unicode object code.  E.g. the resize_inplace() function in unicodeobject.c.  It is possible that a non-trivial amount of other 3rd party code will make assumptions about refcnt that will be broken by this change.  That code could be fixed but it is a matter of cost vs benefit.

If it is disabled by default with a build flag we have an ABI compatibility problem.  I.e. incref/decref are inlined functions/macros.  Also, as mentioned above, these kinds of build options tend to make maintenance and testing harder.

The frozen GC generation did not have these kinds of issues.  If people don't use it, there is basically no cost.  I think it was a good idea to merge the frozen generation feature.  There is a small amount of ongoing maintenance cost but that's acceptable.

Regarding Mark's comment about immutability vs immutable object headers, I would frame the goal of the PR differently.  Like the GC frozen generation, this immutable instance change is about providing a way to opt-out of Python's default garbage collection mechanism.  The frozen generation was to opt-out of the mark-and-sweep style cyclic collection.  This PR is about opting-out of the default Python reference counting.  

Put that way, could we consider this PR as an incremental step in a process of making it possible to have a non-reference count based GC for CPython?  E.g. perhaps we could eventually have a mixture of mark-and-sweep and reference counted GC, in order to support the handle (HPy) API.  If we want to move towards that, we want CPython code to stop making assumptions about the refcnt field, like the unicodeobject.c file currently does.

I think pitching the PR in that way makes it more compelling.  Merge it not because it helps a tiny class of Python applications.  Instead, merge it because it helps find unwarranted assumptions about how the Python GC works.  Once we get bad assumptions cleaned up in 3rd party code, we have more of chance of pursuing an alternative GC strategy for CPython.
msg366629 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2020-04-16 20:56
> immutability vs immutable object headers

Sorry, I meant to write "immortal vs immutable".
msg366744 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2020-04-18 23:23
I resurrected an old thread on Discourse that seems quite relevant to this PR:

https://discuss.python.org/t/switching-from-refcounting-to-libgc/1641/35?u=nas
msg366818 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2020-04-20 12:44
Eddie,
How did you compare the branching vs. non-branching implementations?  I have read the code in the PR. What is the non-branching version that you used to compare it against?

Why not use the sign bit as the immortal bit? It simplifies the test considerably, making it faster, and doesn't need any assembly code.

Are you suggesting that making a set of common objects immortal will make things faster? Have you tested it?
(I would expect that the negative impact on branch predictability would easily outweigh the cost of the memory write (A guaranteed L1 hit).
msg366825 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-04-20 13:33
> I would expect that the negative impact on branch predictability would easily outweigh the cost of the memory write (A guaranteed L1 hit)

If that were true then Spectre and Meltdown wouldn't have been so interesting :)

Pipelining processors are going to speculatively execute both paths, and will skip the write much more quickly than by doing it, and meanwhile nobody should have tried to read the value so it hasn't had to block for that path. I'm not aware of any that detect no-op writes and skip synchronising across cores - the dirty bit of the cache line is just set unconditionally.

Benchmarking already showed that the branching version is faster. It's possible that "refcount += (refcount & IMMORTAL) ? 0 : 1" could generate different code (should be mov,test,lea,cmovz rather than mov,and,add,mov or mov,and,jz,add,mov), but it's totally reasonable for a branch to be faster than unconditionally modifying memory.
msg366826 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-20 13:35
> Benchmarking already showed that the branching version is faster.

But micro-benchmarks may tell you things which are not true in the real-world (for example an easily-predicted branch).
msg366828 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-04-20 13:40
> But micro-benchmarks may tell you things which are not true in the real-world (for example an easily-predicted branch)

This is true, though this branch should be more easily predictable because INCREF/DECREF are inlined. If there were only one function doing the branch it'd be different.

Wouldn't even surprise me if PGO made different optimisations to this in certain functions. Anything that normally operates on a code object, for example, is almost always going to take the branch when everything is imported early and before immortalisation.

We also have the real world app that is Instagram. If they say that this has shown an improvement in their app, I'm inclined to believe that it's not reliant on a microbenchmark (other controlled circumstances, sure, but not a microbenchmark ;) ).
msg366834 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-20 14:54
> We also have the real world app that is Instagram.

I don't think Instagram is a single app. What is Python used for at Instagram?
msg366835 - (view) Author: Rémi Lapeyre (remi.lapeyre) * Date: 2020-04-20 14:58
>> We also have the real world app that is Instagram.

> I don't think Instagram is a single app. What is Python used for at Instagram?

According to their engineering blog (https://instagram-engineering.com/static-analysis-at-scale-an-instagram-story-8f498ab71a0c):

> Our server app is a monolith, one big codebase of several million lines and a few thousand Django endpoints [1], all loaded up and served together. A few services have been split out of the monolith, but we don’t have any plans to aggressively break it up.
msg366837 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2020-04-20 14:59
Wow.  I see, thank you.
msg366852 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2020-04-20 16:17
On 20/04/2020 2:33 pm, Steve Dower wrote:
> 
> Steve Dower <steve.dower@python.org> added the comment:
> 
>> I would expect that the negative impact on branch predictability would easily outweigh the cost of the memory write (A guaranteed L1 hit)
> 
> If that were true then Spectre and Meltdown wouldn't have been so interesting :)

I really don't follow your logic here.

> 
> Pipelining processors are going to speculatively execute both paths, and will skip the write much more quickly than by doing it, and meanwhile nobody should have tried to read the value so it hasn't had to block for that path. I'm not aware of any that detect no-op writes and skip synchronising across cores - the dirty bit of the cache line is just set unconditionally.

Processors only execute the one path, speculatively or otherwise, that's 
why branch prediction is so important.
(And is important for the Spectre attack. If what you said were true, 
the Spectre attack wouldn't need to pre-bias the branch predictor.)
I'm assuming a modern Intel processor.

> 
> Benchmarking already showed that the branching version is faster. It's possible that "refcount += (refcount & IMMORTAL) ? 0 : 1" could generate different code (should be mov,test,lea,cmovz rather than mov,and,add,mov or mov,and,jz,add,mov), but it's totally reasonable for a branch to be faster than unconditionally modifying memory.

It hasn't been shown that it is faster. I'm not claiming that it isn't, 
just that I haven't seen the evidence.

> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue40255>
> _______________________________________
>
msg379644 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-26 06:48
I'm big -1 too. But I am interested in Instagram usage.

* How % of heap are CoW-ed with gc.freeze()?
* When CoW happen? in execution time, or shutdown?
* Which type cause CoW?

I have two ideas to reduce CoW:

* Fast shutdown option
  Currently Python try to free object as possible in finalization. It will cause much CoW. It will very bad for applications using multiprocessing.Pool for parallel processing of large data.

* Use "legacy" Unicode representation for large strings.
  Unicode is cold, immutable. But refcount is not immutable. Currentlly, most unicode object is "compact": object header and data is in one memory block. We may be able to use "legacy" representation and put cold data into another page.

For long term, multi process friendly `pyc` format can share more memory inter processes, even though they are not created by prefork.
msg379871 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-29 13:06
> Fast shutdown option

You can use os._exit(0).
msg379917 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-30 05:07
>> Fast shutdown option
>
> You can use os._exit(0).

Yes. Instagram use it as `atexit.register(os._exit, 0)`.

https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172

I think this hack can be supported in multiprocessing module in some way.
msg388525 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2021-03-11 22:25
FYI, I've just put up a similar PR [1] to Eddie's, with a focus on object immortality rather than immutability.  However, if Py_IMMORTAL_CONST_REFCOUNTS is defined then it should have the behavior Eddie is after.

[1] https://github.com/python/cpython/pull/24828
msg388526 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2021-03-11 22:27
Note that my PR does make the builtin types immortal (at least those initialized in _PyTypes_Init()).  However, I did not make the singletons or other candidate objects immortal in the PR.  That can be done separately.
msg388760 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2021-03-15 18:55
While the eventual solution may overlap, my interests are divergent enough from those here that I've created a separate issue: bpo-43503.
msg410592 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-14 19:39
[data] I finally dug up the old YouTube doc at work with their findings. Mostly just posting this here for future public reference if anyone wants. Nothing surprising.

When youtube experimented with a modified 2.7 adding "eternal refcounts" in 2015, they saw a 3-5% CPU performance regression. (not the 10% I had in my mind)

Their version of this simply set a high bit on the refcount as the indicator and added the obvious conditional into the Py_INCREF/Py_DECREF macros.

Unsurprisingly in line with what others have since found. For their preforked server and decision of what to mark eternal before forking, it saved them 10% ram (fewer copy on writes). The -ram vs +cpu +maintenance cost tradeoff wound up not being worthwhile to them though. Their motivation for trying was entirely COW memory savings.

=== CPython 2.7 object.h modification they used:

```
+#ifdef GOOGLE_ETERNAL_REFCOUNT_SUPPORT
+
+#define PY_ETERNAL_REFCOUNT (PY_SSIZE_T_MAX / 2)
+
+#define Py_IS_ETERNAL(op) (                       \
+  ((PyObject*)(op))->ob_refcnt >= PY_ETERNAL_REFCOUNT)
+
+#define Py_SET_ETERNAL(op)                                \
+  do {                                                    \
+      ((PyObject*)(op))->ob_refcnt = PY_ETERNAL_REFCOUNT; \
+      if (PyObject_IS_GC(op)) {                           \
+        PyObject_GC_UnTrack(op);                          \
+      }                                                   \
+  } while (0)
+
+#define Py_INCREF(op) (                           \
+      Py_IS_ETERNAL(op)                           \
+        ?  PY_ETERNAL_REFCOUNT                    \
+        :  (_Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA \
+           ((PyObject*)(op))->ob_refcnt++)        \
+  )
+
+#define Py_DECREF(op)                                   \
+    do {                                                \
+        if (Py_IS_ETERNAL(op)) break;                   \
+        if (_Py_DEC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
+        --((PyObject*)(op))->ob_refcnt != 0)            \
+            _Py_CHECK_REFCNT(op)                        \
+        else                                            \
+        _Py_Dealloc((PyObject *)(op));                  \
+    } while (0)
+
+#else
```
msg410613 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-01-15 00:12
I just want to say that GOOGLE_ETERNAL_REFCOUNT_SUPPORT is a cool name :-D I love "eternal refcount"!
msg412956 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2022-02-10 00:03
@Eddie, what can I do to push this forward?  FYI, in addition to the python-dev thread a few weeks back, I've brought the matter up with the steering council. [1]

Also, if we can get back to performance-neutral (currently at about 4% slower) then there would be a lot less controversy.  Even at 2% it may be enough.


[1] https://github.com/python/steering-council/issues/103
msg413080 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2022-02-11 16:08
@eric.snow great to hear about this update! I'll start looking at some of the techniques that we talked about to improve performance, I'm optimistic that we'll be able to close down the gap to 2%.
msg413116 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-02-12 06:30
I think making more objects immortal by default will reduce the gap, although I am not sure it can be 2%. (I guess 3% and I think it is acceptable gap.)

* Code attributes (contents of co_consts, co_names, etc...) in deep frozen modules.
  * only if subinterpreter shares them.
* Statically allocated strings (previously _Py_IDENTIFIER)

To reduce gap more, we need to reduce Python stack operation in ceval in some way.
msg413693 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-02-22 05:42
All of these optimizations should be disabled by default.

* It will cause leak when Python is embedded.
* Even for python command, it will break __del__ and weakref callbacks.
msg413694 - (view) Author: Eddie Elizondo (eelizondo) * Date: 2022-02-22 05:59
It seems that we are back on track with perf being back to neutral!

I've created 4 new PRs. Each with an optimization applied on top of the baseline introduction of instance immortalization.

The main PR 19474 currently stands at around 4%, after rebasing past Eric's PR 30928 it went down to 3%.

This is the list of optimizations that I used to get some performance back:
* Immortalizing Interned Strings (PR 31488): 1% regression
* Immortalizing Runtime Heap After Startup (PR 31489): 0% regression
* Immortalizing Modules After Import (PR 31490): 1% regression
* Combined Optimizations (PR 31491): 0% improvement

All the PRs contain the results of the pyperformance benchmarks and they should each stand on their own in case we want to go for a subset of these optimizations rather than all of them. Make sure to look at each PR to read the implementation details.

For testing, in every PR I made sure all the tests were passing on my local environment. Though I do expect some failures happening in non-linux environments. I'll be fixing these over the next couple of days.
msg413717 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2022-02-22 16:42
Thanks for the updates, Eddie.
History
Date User Action Args
2022-04-11 14:59:29adminsetgithub: 84436
2022-02-22 16:42:58eric.snowsetmessages: + msg413717
2022-02-22 05:59:39eelizondosetmessages: + msg413694
2022-02-22 05:42:28methanesetmessages: + msg413693
2022-02-22 05:36:34eelizondosetpull_requests: + pull_request29621
2022-02-22 05:28:25eelizondosetpull_requests: + pull_request29620
2022-02-22 05:24:28eelizondosetpull_requests: + pull_request29619
2022-02-22 05:13:22eelizondosetpull_requests: + pull_request29618
2022-02-12 06:30:26methanesetmessages: + msg413116
2022-02-11 16:08:12eelizondosetmessages: + msg413080
2022-02-10 00:03:34eric.snowsetmessages: + msg412956
2022-01-18 13:57:55vstinnersettitle: Fixing Copy on Writes from reference counting -> Fixing Copy on Writes from reference counting and immortal objects
2022-01-15 00:12:52vstinnersetmessages: + msg410613
2022-01-14 19:39:46gregory.p.smithsetmessages: + msg410592
2021-03-15 18:55:32eric.snowsetmessages: + msg388760
2021-03-11 22:27:34eric.snowsetmessages: + msg388526
2021-03-11 22:25:37eric.snowsetmessages: + msg388525
2021-03-11 22:18:03eric.snowsetnosy: + eric.snow
pull_requests: + pull_request23592
2020-10-30 05:07:37methanesetmessages: + msg379917
2020-10-29 13:06:01vstinnersetmessages: + msg379871
2020-10-26 06:49:04methanesetversions: + Python 3.10, - Python 3.9
2020-10-26 06:48:46methanesetnosy: + methane
messages: + msg379644
2020-04-20 16:17:43Mark.Shannonsetmessages: + msg366852
2020-04-20 14:59:24pitrousetmessages: + msg366837
2020-04-20 14:58:35remi.lapeyresetnosy: + remi.lapeyre
messages: + msg366835
2020-04-20 14:54:56pitrousetmessages: + msg366834
2020-04-20 13:40:36steve.dowersetmessages: + msg366828
2020-04-20 13:35:02pitrousetmessages: + msg366826
2020-04-20 13:33:01steve.dowersetmessages: + msg366825
2020-04-20 12:44:14Mark.Shannonsetmessages: + msg366818
2020-04-18 23:23:23naschemesetmessages: + msg366744
2020-04-16 20:56:51naschemesetmessages: + msg366629
2020-04-16 20:51:57naschemesetmessages: + msg366628
2020-04-16 19:26:37eelizondosetmessages: + msg366625
2020-04-16 19:13:30eelizondosetmessages: + msg366624
2020-04-16 18:41:10Mark.Shannonsetmessages: + msg366623
2020-04-16 15:10:46eelizondosetmessages: + msg366605
2020-04-16 12:15:11Mark.Shannonsetnosy: + Mark.Shannon
messages: + msg366593
2020-04-16 04:42:53eelizondosetmessages: + msg366579
2020-04-16 04:27:55eelizondosetmessages: + msg366578
2020-04-16 04:14:39eelizondosetmessages: + msg366577
2020-04-16 01:30:06carljmsetmessages: + msg366573
2020-04-15 22:14:23pitrousetmessages: + msg366566
2020-04-15 22:09:33steve.dowersetmessages: + msg366565
2020-04-15 21:50:53carljmsetmessages: + msg366561
2020-04-15 18:24:24naschemesetmessages: + msg366536
2020-04-15 14:23:58pitrousetmessages: + msg366520
2020-04-15 14:05:23vstinnersetmessages: + msg366519
2020-04-15 13:58:12carljmsetmessages: + msg366518
2020-04-14 22:55:03vstinnersetmessages: + msg366456
2020-04-14 18:39:14pablogsalsetmessages: + msg366428
2020-04-14 18:30:02carljmsetmessages: + msg366423
2020-04-14 18:26:11pablogsalsetmessages: + msg366422
2020-04-14 18:09:20gregory.p.smithsetmessages: + msg366416
2020-04-14 17:52:37pablogsalsetmessages: + msg366413
2020-04-14 17:51:32pablogsalsetmessages: + msg366412
2020-04-14 17:50:27steve.dowersetnosy: + dino.viehland
2020-04-14 17:50:14steve.dowersetmessages: + msg366408
2020-04-14 17:42:58pablogsalsetnosy: - dino.viehland
messages: + msg366407
2020-04-14 17:30:24dino.viehlandsetnosy: + dino.viehland
messages: + msg366406
2020-04-14 17:26:59carljmsetmessages: + msg366404
2020-04-14 17:12:21carljmsetmessages: + msg366403
2020-04-14 17:02:27carljmsetnosy: + carljm
messages: + msg366402
2020-04-14 14:25:48pitrousetmessages: + msg366390
2020-04-14 14:23:58pitrousetmessages: + msg366388
2020-04-14 13:55:00steve.dowersetmessages: + msg366387
2020-04-14 11:44:34shihai1991setnosy: + shihai1991
2020-04-13 22:54:02vstinnersetmessages: + msg366345
2020-04-13 22:51:55pablogsalsetmessages: + msg366344
2020-04-13 22:45:41vstinnersetnosy: + vstinner
messages: + msg366341
2020-04-13 20:27:52steve.dowersetmessages: + msg366331
2020-04-13 20:10:08pablogsalsetmessages: + msg366330
2020-04-13 20:06:53pablogsalsetmessages: + msg366329
2020-04-13 19:35:15steve.dowersetmessages: + msg366328
2020-04-13 19:31:50pablogsalsetmessages: + msg366326
2020-04-13 19:23:27pablogsalsetmessages: + msg366325
2020-04-13 10:36:09steve.dowersetnosy: + steve.dower
2020-04-12 14:22:59corona10setnosy: + corona10
2020-04-12 01:01:33eelizondosetmessages: + msg366224
2020-04-11 22:37:58gregory.p.smithsettype: enhancement

messages: + msg366221
nosy: + gregory.p.smith
2020-04-11 18:03:35corona10setnosy: + pablogsal
2020-04-11 18:01:03pablogsalsetnosy: + tim.peters, nascheme, pitrou
2020-04-11 16:51:07eelizondosetkeywords: + patch
stage: patch review
pull_requests: + pull_request18829
2020-04-11 15:40:21eelizondocreate