This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: add support for watching writes to selected dictionaries
Type: Stage: patch review
Components: C API Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Dennis Sweeney, Mark.Shannon, barry, brandtbucher, carljm, dino.viehland, gregory.p.smith, gvanrossum, itamaro
Priority: normal Keywords: patch

Created on 2022-03-01 22:19 by carljm, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 31787 open carljm, 2022-03-09 14:31
Messages (22)
msg414307 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-01 22:19
CPython extensions providing optimized execution of Python bytecode (e.g. the Cinder JIT), or even CPython itself (e.g. the faster-cpython project) may wish to inline-cache access to frequently-read and rarely-changed namespaces, e.g. module globals. Rather than requiring a dict version guard on every cached read, the best-performing way to do this is is to mark the dictionary as “watched” and set a callback on writes to watched dictionaries. This optimizes the cached-read fast-path at a small cost to the (relatively infrequent and usually less perf sensitive) write path.

We have an implementation of this in Cinder ( https://docs.google.com/document/d/1l8I-FDE1xrIShm9eSNJqsGmY_VanMDX5-aK_gujhYBI/edit#heading=h.n2fcxgq6ypwl ), used already by the Cinder JIT and its specializing interpreter. We would like to make the Cinder JIT available as a third-party extension to CPython ( https://docs.google.com/document/d/1l8I-FDE1xrIShm9eSNJqsGmY_VanMDX5-aK_gujhYBI/ ), and so we are interested in adding dict watchers to core CPython.

The intention in this issue is not to add any specific optimization or cache (yet); just the ability to mark a dictionary as “watched” and set a write callback.

The callback will be global, not per-dictionary (no extra function pointer stored in every dict). CPython will track only one global callback; it is a well-behaved client’s responsibility to check if a callback is already set when setting a new one, and daisy-chain to the previous callback if so. Given that multiple clients may mark dictionaries as watched, a dict watcher callback may receive events for dictionaries that were marked as watched by other clients, and should handle this gracefully.

There is no provision in the API for “un-watching” a watched dictionary; such an API could not be used safely in the face of potentially multiple dict-watching clients.

The Cinder implementation marks dictionaries as watched using the least bit of the dictionary version (so version increments by 2); this also avoids any additional memory usage for marking a dict as watched.

Initial proposed API, comments welcome:

// Mark given dictionary as "watched" (global callback will be called if it is modified)
void PyDict_Watch(PyObject* dict);

// Check if given dictionary is already watched
int PyDict_IsWatched(PyObject* dict);

typedef enum {
  PYDICT_EVENT_CLEARED,
  PYDICT_EVENT_DEALLOCED,
  PYDICT_EVENT_MODIFIED
} PyDict_WatchEvent;

// Callback to be invoked when a watched dict is cleared, dealloced, or modified.
// In clear/dealloc case, key and new_value will be NULL. Otherwise, new_value will be the
// new value for key, NULL if key is being deleted.
typedef void(*PyDict_WatchCallback)(PyDict_WatchEvent event, PyObject* dict, PyObject* key, PyObject* new_value);

// Set new global watch callback; supply NULL to clear callback
void PyDict_SetWatchCallback(PyDict_WatchCallback callback);

// Get existing global watch callback
PyDict_WatchCallback PyDict_GetWatchCallback();

The callback will be called immediately before the modification to the dict takes effect, thus the callback will also have access to the prior state of the dict.
msg414330 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-03-02 06:53
At first quick glance, this makes sense and the API looks reasonable.

Question: what happens on interpreter shutdown?

Shutdown obviously finalized and clears out most all dicts.  I guess the C callback simply gets called for each of these?  That makes sense.  Just wondering if there are any ramifications of that.  The callback is in C so it shouldn't have issues with this.

A pyperformance suite run on an interpreter modified to support this but having no callbacks registered would be useful.  (basically judging if there is measurable overhead added by the watched bit check - I doubt it'll be noticeable in most code)
msg414462 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-03 19:45
Thanks gps! Working on a PR and will collect pyperformance data as well.

We haven't observed any issues in Cinder with the callback just being called at shutdown, too, but if there are problems with that it should be possible to just have CPython clear the callback at shutdown time.
msg414466 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-03-03 20:15
> CPython will track only one global callback; it is a well-behaved client’s responsibility to check if a callback is already set when setting a new one, and daisy-chain to the previous callback if so.

Hm, this is a bit scary. Could we (or others) end up with unguarded stale caches if some buggy extension forgets to chain the calls correctly?

Core CPython seems most at risk of this, since we would most likely be registered first.
msg414467 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-03-03 20:17
Also, when you say "only one global callback": does that mean per-interpreter, or per-process?
msg414468 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-03 20:31
> Could we (or others) end up with unguarded stale caches if some buggy extension forgets to chain the calls correctly?

Yes. I can really go either way on this. I initially opted for simplicity in the core support at the cost of asking a bit more of clients, on the theory that a) there are lots of ways for a buggy C extension to cause crashes with bad use of the C API, and b) I don't expect there to be very many extensions using this API. But it's also true that the consequences of a mistake here could be hard to debug (and easily blamed to the wrong place), and there might turn out to be more clients for dict-watching than I expect! If the consensus is to prefer CPython tracking an array of callbacks instead, we can try that.

> when you say "only one global callback": does that mean per-interpreter, or per-process?

Good question! The currently proposed API suggests per-process, but it's not a question I've given a lot of thought to yet; open to suggestions. It seems like in general the preference is to avoid global state and instead tie things to an interpreter instance? I'll need to do a bit of research to understand exactly how that would affect the implementation. Doesn't seem like it should be a problem, though it might make the lookup at write time to see if we have a callback a bit slower.
msg414470 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-03-03 20:44
Per interpreter seems best.

If someone using this feature writes a buggy implementation of a callback that doesn't chain reliably, that is a bug in their code and all of the fallout from that is "just" a bug to be fixed in said code.

Think of it like a C signal handler, the OS doesn't chain for you - it is the signal handler installer's responsibility to chain to the previous one.  Not an unusual pattern for callback hooks in C.

We already have such global C callback hooks in SetProfile and SetTrace. https://docs.python.org/3/c-api/init.html#profiling-and-tracing

Those don't even provide for chaining.  We could do the same here and not let a hook be set more than once instead of providing a Get API to allow for manual chaining.  That seems less useful.
msg414531 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2022-03-04 14:41
Why so coarse?

Getting a notification for every change of a global in module, is likely to make use the use of global variables extremely expensive.

```
var = 0

CONST = 1
def foo(...):
    ...
```

I may well want to be notified if `foo` or `CONST` gets modified, but performance could suffer badly if we make a callback every time `var` is changed.

--------------

What happens if a watched dictionary is modified in a callback?

--------------

How do you plan to implement this? Steal a bit from `ma_version_tag` or replace `ma_version_tag`?
If you replace `ma_version_tag` this could actually speed things up a tad.

You'd probably need a PEP to replace PEP 509, but I think this may need a PEP anyway.
msg414540 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-03-04 17:39
> Why so coarse?

> Getting a notification for every change of a global in module, is likely to make use the use of global variables extremely expensive.

Perhaps a compromise is possible here: one global group/chain of callbacks registered for all dictionaries in an interpreter seems reasonable (to keep overhead low), but we could reduce the number of times it’s actually called by, say, tagging specific values of specific dictionaries to be watched.

For example, we could just tag the low bit of any pointer in a dictionary’s values that we want to be notified of changes to. Something like that seems like it could really keep the cost of watching down without sacrificing power.
msg414551 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-04 21:24
Thanks for the feedback!

> Why so coarse?

Simplicity of implementation is a strong advantage, all else equal :) And the coarse version is a) at least somewhat proven as useful and usable already by Cinder / Cinder JIT, and b) clearly doable without introducing memory or noticeable CPU overhead to unwatched dicts. Do you have thoughts about how you'd do a more granular version without overhead?

> Getting a notification for every change of a global in module, is likely to make use the use of global variables extremely expensive.

It's possible. We haven't ever observed this as an issue in practice, but we may have just not observed enough workloads with heavy writes to globals. I'd like to verify this problem with a real representative benchmark before making design decisions based on it, though. Calling a callback that is uninterested in a particular key doesn't need to be super-expensive if the callback is reasonably written, and this expense would occur only on the write path, for cases where the `global` keyword is used to rebind a global. I don't think it's common for idiomatic Python code to write to globals in perf-sensitive paths. Let's see how this shows up in pyperformance, if we try running it with all module globals dicts watched.

> For example, we could just tag the low bit of any pointer in a dictionary’s values that we want to be notified of changes to

Would you want to tag the value, or the key? If value, does that mean if the value is changed it would revert to unwatched unless you explicitly watched the new value?

I'm a bit concerned about the performance overhead this would create for use of dicts outside the write path, e.g. the need to mask off the watch bit of returned value pointers on lookup.

> What happens if a watched dictionary is modified in a callback?

It may be best to document that this isn't supported; it shouldn't be necessary or advisable for the intended uses of dict watching. That said, I think it should work fine if the callback can handle re-entrancy and doesn't create infinite recursion. Otherwise, I think it's a case of "you broke it, you get to keep all the pieces."

> How do you plan to implement this? Steal a bit from `ma_version_tag`

We currently steal the low bit from the version tag in Cinder; my plan was to keep that approach.

> You'd probably need a PEP to replace PEP 509, but I think this may need a PEP anyway.

I'd prefer to avoid coupling this to removal of the version tag. Then we get into issues of backward compatibility that this proposal otherwise avoids.

I don't think the current proposal is of a scope or level of user impact that should require a PEP, but I'm happy to write one if needed.
msg414803 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-09 14:38
Draft PR is up for consideration. Perf data in https://gist.github.com/carljm/987a7032ed851a5fe145524128bdb67a

Overall it seems like the base implementation is perf neutral -- maybe a slight impact on the pickle benchmarks? With all module global dicts (uselessly) watched, there are a few more benchmarks with small regressions, but also some with small improvements (just noise I guess?) -- overall still pretty close to neutral.

Comments welcome!
msg414804 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2022-03-09 16:30
A curiosity: have you considered watching dict keys rather than whole dicts?

That way, changing global values would not have to de-optimize, only adding new global keys would.

Indexing into dict values array wouldn't be as efficient as embedding direct jump targets in JIT-generated machine code, but as long as we're not doing that, maybe watching the keys is a happy medium?
msg414808 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-09 17:57
Hi Dennis, thanks for the questions!

> A curiosity: have you considered watching dict keys rather than whole dicts?

There's a bit of discussion of this above. A core requirement is to avoid any memory overhead and minimize CPU overhead on unwatched dicts. Additional memory overhead seems like a nonstarter, given the sheer number of dict objects that can exist in a large Python system. The CPU overhead for unwatched dicts in the current PR consists of a single added `testb` and `jne` (for checking if the dict is watched), in the write path only; I think that's effectively the minimum possible.

It's not clear to me how to implement per-key watching under this constraint. One option Brandt mentioned above is to steal the low bit of a `PyObject` pointer; in theory we could do this on `me_key` to implement per-key watching with no memory overhead. But then we are adding bit-masking overhead on every dict read and write. I think we really want the implementation here to be zero-overhead in the dict read path.

Open to suggestions if I've missed a good option here!

> That way, changing global values would not have to de-optimize, only adding new global keys would.

> Indexing into dict values array wouldn't be as efficient as embedding direct jump targets in JIT-generated machine code, but as long as we're not doing that, maybe watching the keys is a happy medium?

But we are doing that, in the Cinder JIT. Dict watching here is intentionally exposed for use by extensions, including hopefully  in future the Cinder JIT as an installable extension. We burn exact pointer values for module globals into generated JIT code and deopt if they change (we are close to landing a change to code-patch instead of deopting.) This is quite a bit more efficient in the hot path than having to go through a layer of indirection.

I don't want to assume too much about how dict watching will be used in future, or go for an implementation that limits its future usefulness. The current PR is quite flexible and can be used to implement a variety of caching strategies. The main downside of dict-level watching is that a lot of notifications will be fired if code does a lot of globals-rebinding in modules where globals are watched, but this doesn't appear to be a problem in practice, either in our workloads or in pyperformance. It seems likely that a workable strategy if this ever was observed to be a problem would be to notice at runtime that globals are being re-bound frequently in a particular module and just stop watching that module's globals.
msg414810 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-09 19:41
> have you considered watching dict keys rather than whole dicts?

Just realized that I misunderstood this suggestion; you don't mean per-key watching necessarily, you just mean _not_ notifying on dict values changes. Now I understand better how that connects to the second part of your comment! But yeah, I don't want this limitation on dict watching use cases.
msg414839 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2022-03-10 10:52
There are three kinds of changes that we might want to watch (that I can think of right now):

1. Any change.
   Rather coarse and potentially expensive. Used by Cinder.
2. A new key being added (or a change to the keys version as a proxy).
   Useful for detect shadowing of builtins by module globals, would save the keys version check for the module dict in `LOAD_GLOBAL_BUILTINS` and a version check in some `LOAD_METHOD` specializations.
3. The value corresponding to a particular key.
   With this we could effectively convert both `LOAD_GLOBAL` specializations into a constant, given an effective way to de-optimize.


One way to support the three cases above would be to replace the dict version with a pointer to a data structure describing what it watched.
If the pointer is `NULL`, then nothing is being watched.
The data structure would need 2 bits to cover cases 1 and 2, and 1 bit (or byte) for each key in the dict keys (or case 1 could be implemented by setting all the bits for case 3).
msg414877 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-10 20:40
Thanks for outlining the use cases. They make sense.

The current PR provides a flexible generic API that fully supports all three of those use cases (use cases 2 and 3 are strict subsets of use case 1.) Since the callback is called before the dict is modified, all the necessary information is available to the callback to decide whether the event is interesting to it or not.

The question is how much of the bookkeeping to classify events as "interesting" or "uninteresting" should be embedded in the core dispatch vs being handled by the callback.

One reason to prefer keeping this logic in the callback is that with potentially multiple chained callbacks in play, the filtering logic must always exist in the callback, regardless. E.g. if callback A wants to watch only keys-version changes to dict X, but callback B wants to watch all changes to it, events will fire for all changes, and callback A must still disregard "uninteresting" events that it may receive (just like it may receive events for dicts it never asked to watch at all.) So providing API for different "levels" of watching means that the "is this event interesting to me" predicate must effectively be duplicated both in the callback and in the watch level chosen.

The proposed rationale for this complexity and duplication is the idea that filtering out uninteresting events at dispatch will provide better performance. But this is hypothetical: it assumes the existence of perf-bottleneck code paths that repeatedly rebind globals. The only benchmark workload with this characteristic that I know of is pystone, and it is not even part of the pyperformance suite, I think precisely because it is not representative of real-world code patterns. And even assuming that we do need to optimize for such code, it's also not obvious that it will be noticeably cheaper in practice to filter on the dispatch side.

It may be more useful to focus on API. If we get the API right, internal implementation details can always be adjusted in future if a different implementation can be shown to be noticeably faster for relevant use cases. And if we get existing API right, we can always add new API if we have to. I don't think anything about the proposed simple API precludes adding `PyDict_WatchKeys` as an additional feature, if it turns out to be necessary.

One modification to the simple proposed API that should improve the performance (and ease of implementation) of use case #2 would be to split the current `PyDict_EVENT_MODIFIED` into two separate event types: `PyDict_EVENT_MODIFIED` and `PyDict_EVENT_NEW_KEY`. Then the callback-side event filtering for use case #2 would just be `event == PyDict_EVENT_NEW_KEY` instead of requiring a lookup into the dict to see whether the key was previously set or not.
msg414882 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-10 23:40
I've updated the PR to split `PyDict_EVENT_MODIFIED` into separate `PyDict_EVENT_ADDED`, `PyDict_EVENT_MODIFIED`, and `PyDict_EVENT_DELETED` event types. This allows callbacks only interested in e.g. added keys (case #2) to more easily and cheaply skip uninteresting events.
msg414898 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2022-03-11 11:14
You might not like global variables, they may not show up much in benchmarks, but people do use them. I suspect a lot of jupyter notebooks have quite a few global variables.


There should not be much of a slowdown for this code when watching `CONST`:

CONST = ... # watched
var = 0
for _ in range(LARGE_NUMBER):
    var += 1
CONST = ... # trigger event.
msg414899 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2022-03-11 11:20
Another use of this is to add watch points in debuggers.

To that end, it would better if the callback were a Python object.

The overhead is relatively small if using the vectorcall protocol.
If the call overhead matters that much, there is something wrong as way too many callbacks are happening.
msg415253 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-15 15:16
> There should not be much of a slowdown for this code when watching `CONST`:

How and when (and based on what data?) would the adaptive interpreter make the decision that for this code sample the key `CONST`, but not the key `var`, should be watched in the module globals dict? It's easy to contrive an example in which it's beneficial to watch one key but not another, but this is practically irrelevant unless it's also feasible for an optimizer to consistently make the right decision about which key(s) to watch.

The code sample also suggests that the module globals dict for a module is being watched while that module's own code object is being executed. In module body execution, writing to globals (vs reading them) is relatively much more common, compared to any other Python code execution context, and it's much less common for the same global to be read many times. Given this, how frequently would watching module globals dictionaries during module body execution be a net win at all? Certainly cases can be contrived in which it would be, but it seems unlikely that it would be a net win overall. And again, unless the optimizer can reliably (and in advance, since module bodies are executed only once) distinguish the cases where it's a win, it seems the example is not practically relevant.

> Another use of this is to add watch points in debuggers.
> To that end, it would better if the callback were a Python object.

It is easy to create a C callback that delegates to a Python callable if someone wants to implement this use case, so the vectorcall overhead is paid only when needed. The core API doesn't need to be made more complex for this, and there's no reason to impose any overhead at all on low-level interpreter-optimization use cases.
msg415261 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2022-03-15 16:16
Let me give you an example.

#module eggs

eggs_var = 0 # a variable, maybe a counter or similar
EGGS_CONST # a constant

#module spam

import eggs

spam_var # Another variable

def foo():
    use(eggs.EGGS_CONST)

-------------

We will want to treat `eggs.EGGS_CONST` as a constant.
To do that we need to be notified if `spam.eggs` or `eggs.EGGS_CONST` changes, but we do not want be notified whenever `spam.spam_var` or `eggs.eggs_var` changes.

This might not be necessary for us right now, but we will want to implement optimizations over larger regions than a single bytecode in 3.12.
msg415268 - (view) Author: Carl Meyer (carljm) * Date: 2022-03-15 17:17
Thanks for the extended example.

I think in order for this example to answer the question I asked, a few more assumptions should be made explicit:

1) Either `spam_var` and/or `eggs_var` are frequently re-bound to new values in a hot code path somewhere. (Given the observations above about module-level code, we should assume for a relevant example this takes place in a function that uses `global spam_var` or `global eggs_var` to allow such rebinding.)

2) But `spam_var` and `eggs_var` are not _read_ in any hot code path anywhere, because if they were, then the adaptive interpreter would be just as likely to decide to watch them as it is to watch `EGGS_CONST`, in which case any benefit of per-key watching in this example disappears. (Keep in mind that with possibly multiple watchers around, "unwatching" anything on the dispatch side is never an option, so we can't say that the adaptive interpreter would decide to unwatch the frequently-re-bound keys after it observes them being re-bound. It can  always "unwatch" them in the sense of no longer being interested in them in its callback, though.)

It is certainly possible that this case could occur, where some module contains both a frequently-read-but-not-written global and also a global that is re-bound using `global` keyword in a hot path, but rarely read. But it doesn't seem warranted to pre-emptively add a lot of complexity to the API in order to marginally improve the performance of this quite specific case, unsupported by any benchmark or sample workload demonstrating it.

> This might not be necessary for us right now

I think it's worth keeping in mind that `PyDict_WatchKey` API can always be added later without disturbing or changing semantics of the `PyDict_Watch` API added here.
History
Date User Action Args
2022-04-11 14:59:56adminsetgithub: 91052
2022-03-15 17:17:40carljmsetmessages: + msg415268
2022-03-15 16:16:21Mark.Shannonsetmessages: + msg415261
2022-03-15 15:16:50carljmsetmessages: + msg415253
2022-03-11 11:20:47Mark.Shannonsetmessages: + msg414899
2022-03-11 11:14:53Mark.Shannonsetmessages: + msg414898
2022-03-10 23:40:14carljmsetmessages: + msg414882
2022-03-10 20:40:35carljmsetmessages: + msg414877
2022-03-10 10:52:50Mark.Shannonsetmessages: + msg414839
2022-03-09 19:41:34carljmsetmessages: + msg414810
2022-03-09 17:57:21carljmsetmessages: + msg414808
2022-03-09 17:25:01gvanrossumsetnosy: + gvanrossum
2022-03-09 16:30:41Dennis Sweeneysetnosy: + Dennis Sweeney
messages: + msg414804
2022-03-09 14:38:29carljmsetmessages: + msg414803
2022-03-09 14:31:47carljmsetkeywords: + patch
stage: patch review
pull_requests: + pull_request29891
2022-03-04 21:24:30carljmsetmessages: + msg414551
2022-03-04 17:39:37brandtbuchersetmessages: + msg414540
2022-03-04 14:41:06Mark.Shannonsetmessages: + msg414531
2022-03-03 20:44:21gregory.p.smithsetmessages: + msg414470
2022-03-03 20:31:55carljmsetmessages: + msg414468
2022-03-03 20:17:12brandtbuchersetmessages: + msg414467
2022-03-03 20:15:00brandtbuchersetmessages: + msg414466
2022-03-03 20:13:56gregory.p.smithsetnosy: + Mark.Shannon
2022-03-03 20:10:16brandtbuchersetnosy: + brandtbucher
2022-03-03 19:45:45carljmsetmessages: + msg414462
2022-03-02 06:53:40gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg414330
2022-03-02 01:37:40barrysetnosy: + barry
2022-03-01 22:26:06carljmsettitle: add support for watching writes to selecting dictionaries -> add support for watching writes to selected dictionaries
2022-03-01 22:19:05carljmcreate