This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add jump table for certain safe match-case statements
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: AndersMunch, Dennis Sweeney, Kshitiz17, Mark.Shannon, brandtbucher, corona10, kj, rhettinger, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2021-06-01 21:58 by Dennis Sweeney, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
matchperf.py Dennis Sweeney, 2021-06-12 22:29
Pull Requests
URL Status Linked Edit
PR 26697 open Dennis Sweeney, 2021-06-12 20:25
Messages (20)
msg394874 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-06-01 21:58
Anecdotally, a few people I've talked to have expected that match-case statements would improve performance in the same way that switch-cases sometimes do in C-like languages. While this is impossible in general without changing semantics (since, for example, __eq__ can have side effects), it would be nice if a jump table was implemented in certain safe cases.

My idea was to implement this whenever the list of cases begins with 2 or more of the following in a row:

    (1) ints
    (2) strings
    (3) None
    (4) OR (`|`) patterns of any of the above
    (5?) potentially later add support for tuples

Example:

    match x:
        case 10:             # safe
            print("A")
        case 20:             # safe
            print("B")
        case 30 | 31:        # safe
            print("C")
        case 40:             # safe
            print("D")
        case MyClass(10, 20): # unsafe
            print("E")
        case []:              # unsafe
            print("F")
        case whatever:        # unsafe
            print("G")


This would compile to something like

    LOAD_FAST (x)
    LOAD_CONST 16            ({10: 16, 20: 38, 30: 80, 31: 80, 40: 102})
    ATTEMPT_SAFE_MATCH 110

    ... all the rest of the code is the same ...

Where the new opcode would be would be something like

case TARGET(ATTEMPT_SAFE_MATCH): {
    PyObject *jump_dict = POP();
    PyObject *x = TOP();
    
    if (PyLong_CheckExact(x) ||
        PyUnicode_CheckExact(x) ||
        Py_Is(x, Py_None))
    {
        PyObject *boxed = PyDict_GetItemWithError(jump_dict, x);
        Py_DECREF(jump_dict);
        if (res == NULL) {
            if (PyErr_Occurred()) {
                goto error;
            }
            else {
                JUMPBY(oparg);
            }
        }
        assert(PyLong_CheckExact(boxed));
        int target = (int)PyLong_AsLong(boxed);
        JUMPBY(target);
    }
    else {
        // fall back to general matching code
        Py_DECREF(jump_dict);
        DISPATCH();
    }
}

I can work on an implementation in the next couple of weeks.
msg394878 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-06-01 23:48
Big +1 from me.  This would make use of match-case more compelling.
msg394882 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-06-02 01:38
Seems like a good idea as long as we're careful about the implementation. I've just jotted down a few notes here:

- We should probably break the table upon encountering a guard.

- We probably can't get away with storing a dictionary in the code object's constants, since the entire code object needs to be hashable. We could store a tuple of key-value pairs and either (a) build the dictionary each time we enter the block, or (b) stash it somewhere. If we stash it:
  - Should we defer building it until the first time we hit the block, or earlier than that?
  - Where would we stash it?

- We should probably have some heuristic perhaps more restrictive than "two or more of the following in a row". I'm not entirely sure that loading/building a dictionary, hashing an integer/string/None, looking it up in a dictionary, and then either (a) recovering from the error, or (b) converting the Python integer value into a C long and jumping on it is faster than the current implementation for the first two cases. I know it's really fast, but I'm not *sure* it will be an obvious win yet. Ideally, the best-case scenario (match on first case) would be just as fast as it is currently. I'm probably willing to speed up the average case at the expense of the best case, though.

- I'm not sure we need to limit this to leading cases. What's stopping us from having one of these jump tables in the middle of a match block (or even having more than one scattered throughout)? Unless I'm missing something, this should work anytime we observe contiguous "simple" cases.
msg394884 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-06-02 02:47
At first I thought of matches with only the int/str/None cases and then I saw the easy generalization, but yeah, each contiguous block seems better.

One solution to the code hashability/lookup speed apparent dilemma: writing the equivalent of this in C:

class HashableDict(dict):
    def __new__(cls, pairs):
        return dict.__new__(cls, pairs)
    def __eq__(self, other):
        return tuple(self.mapping.items()) == tuple(other.sequence.items())
    def __hash__(self):
        return CONSTANT ^ hash(tuple(self.items()))
    def __getnewargs__(self):
        return tuple(self.items())
    
    # forbid all mutating methods
    __setitem__ = __delitem__ = update = __ior__ = None # etc.

(Or maybe using composition rather than inheritence)

Maybe using the HAMT in /Python/hamt.c would suffice instead.
msg394887 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-06-02 03:35
Hm. I’m not sure if the HAMT makes much sense. We don’t really care about efficient mutation or copies... constant-time lookups seem to be overwhelmingly the most valuable factor here.

I personally think that creating some sort of hashable dict and teaching marshal how to serialize/deserialize it could be a promising path forward. I imagine its actual design could mirror dict (sort of like set/frozenset).

That might be a rather drastic step though. Perhaps it’s better to first prove that building a mapping at runtime is too slow.
msg394889 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-06-02 04:46
I agree that we should benchmark for what the value of "two" should be ;) . Maybe only match statements with >=20 consecutive simple cases end up being worth it. I would assume that if "case 1, case 2, ..., case 20" are all in a row then "case 1" wouldn't be *that* much more likely than "case 20", so a slowdown on case 1 wouldn't matter too much if the average gets better.

I'm under the impression that match-case statements would frequently be used only once per function call. If I understand your proposal, that would mean that calling a function with a N-case constant-pattern match would require N hashes and N insertions into a dict (including memory allocation), followed by O(1) lookup. Wouldn't that eliminate any hope of making such a function O(1)? Unless there's a way to cache the populated dict somewhere else?

I suggested HAMT because it's a well-tested pre-existing fast immutable mapping that already implements __hash__, so it would be safe to store in co_consts already populated. Re-implementing all of the dict logic seems like an unnecessary pain, which is why I was suggesting either the HAMT or a thin wrapper around dict, not a re-implementation of a new hash table. I was hoping to do the minimum amount of disruption possible to get reasonable O(1) performance, and then seeing where that leaves us.
msg394891 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-06-02 06:46
> If I understand your proposal, that would mean that calling a function with a N-case constant-pattern match would require N hashes and N insertions into a dict (including memory allocation), followed by O(1) lookup. Wouldn't that eliminate any hope of making such a function O(1)? Unless there's a way to cache the populated dict somewhere else?

Yeah, that was the idea sketched out above. Basically, if we go with a vanilla dictionary here, we're going to need to build it at runtime. It only really makes sense to do that once, the first time we need it. Then we stash it away... uh... somewhere. As you note, it doesn't make much sense to spend linear time building a constant-time jump table if it's only going to be used once. :)

Maybe somebody familiar with the constantly-evolving opcaches could chime in if this is the sort of thing that could benefit from that infrastructure. Basically, we would need a separate cache per bytecode offset, per code object. My understanding is that we don't have anything like that yet.

(A quick-and-dirty prototype could probably store them at "hidden" local names like ".table0", ".table1", ".table2", etc. I know comprehensions do something like that.)

> Re-implementing all of the dict logic seems like an unnecessary pain, which is why I was suggesting either the HAMT or a thin wrapper around dict, not a re-implementation of a new hash table.

Well, I don't think we'd need to do any of that. I believe set and frozenset share the same core design and routines, but differ only in the interfaces provided by the objects themselves. I imagine we could do something similar with a hypothetical _PyFrozenDict_Type... copy most of the type definition, dropping all of the methods except mp_subcript, tp_hash, and a few other members. That would probably be all we needed to get this design up and running for a proof-of-concept.

A lot of work goes into making dicts fast (especially for things like strings), so it would be nice for a new type to be able to benefit from those incremental improvements.

> I was hoping to do the minimum amount of disruption possible to get reasonable O(1) performance, and then seeing where that leaves us.

Agreed, but the HAMT mapping has logarithmic time complexity for lookups, right? Maybe for realistic cases the coefficients make it basically equivalent, but on the surface it seems more promising to try reusing the dict guts instead.
msg394892 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-06-02 07:03
FWIW PEP 603 includes a graph (Figure 2) of dict.get versus hamt.get performance as the mappings grow, and it seems the hamt is roughly 1.4x slower on inputs of sizes between 10 and 1000.
msg394975 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-06-03 02:47
For anyone curious, I had some free time today and took a stab at creating a minimal _frozendict type (sharing as much implementation with dict as possible) to see how difficult it would be.

It was actually much simpler than I thought... just a few dozen lines of code, including marshal support. If you'd like to see it, you can check out my "frozendict" branch here:

https://github.com/python/cpython/compare/main...brandtbucher:frozendict

For testing purposes, I've exposed in the _testcapi module. It has the same constructor signature as dict, and basically only supports lookups:

>>> from _testcapi import _frozendict
>>> fd = _frozendict({1: 2, "3": 4, None: 5})
>>> fd
<_frozendict object at 0x7f4e127e4ef0>
>>> fd[1]
2
>>> fd[3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 3
>>> import marshal
>>> marshal.loads(marshal.dumps(fd)) == fd
True

I'm not gonna lie... I really like it. It sidesteps any runtime construction/caching issues that using a normal dict would have, but with all of the same performance characteristics. It also seems like it would not introduce much of a maintenance burden, since it has very little of its own code.

Anyways, food for thought. It was a fun experiment at the very least.
msg394977 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-06-03 03:27
Very interesting -- that is shorter than I thought also!

If we really wanted to optimize the tar out of this, we could probably find a way to re-use just the PyDictKeysObject to find the index into a C-array of integer jump targets and not have to worry about unboxing.
msg395019 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-06-03 16:45
I'm hoping we can get something close that for free by just waiting... the faster-cpython folks are working on a tagged-pointer implementation of integers as we speak.

I've been looking for a new project, so I'd love to help work on this issue (if you're open to it). The pattern compiler is a bit of a beast, and I like to think I have a better than-average-understanding of it. ;)
msg395069 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-06-04 09:42
Hi Brandt,

I would welcome your collaboration/mentorship in whatever capacity makes sense. I sent an email to brandt@python.org from sweeney.dennis650@gmail.com.
msg395265 - (view) Author: Anders Munch (AndersMunch) Date: 2021-06-07 14:08
Are you sure you want to do this?

This optimisation is not applicable if the matched values are given symbolic names.  You would be encouraging people to write bad code with lots of literals, for speed.
msg395278 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-06-07 19:30
> Are you sure you want to do this?

No, not yet. But there has been some positive feedback here, and it seems like an interesting project to prototype. :)

> This optimisation is not applicable if the matched values are given symbolic names. You would be encouraging people to write bad code with lots of literals, for speed.

I had the same concern initially, but I'm honestly not losing too much sleep over it. Literals already get lots of optimizations that symbolic constants don't, and to me it seems a bit silly to avoid optimizing literals here when it is quite straightforward (and powerful) to do so. There are also many cases where using symbolic constants for certain literals doesn't make much sense.

I'd be okay not loudly publicizing this change if it lands. After all, I'm the guy who spent the better part of a year trying to convince people that this is Not A Switch Statement. :)

(Although, on the flip side, maybe it will help appease the people who have been asking for one all these years.)
msg395717 - (view) Author: Dennis Sweeney (Dennis Sweeney) * (Python committer) Date: 2021-06-12 22:29
Here are some benchmarks:

PS C:\Users\sween\Source\Repos\cpython2\cpython> .\python.bat -m pyperf compare_to .\main.json .\PR-26697.json
Running Release|x64 interpreter...
1: Mean +- std dev: [main] 117 us +- 4 us -> [PR-26697] 122 us +- 3 us: 1.04x slower
2: Mean +- std dev: [main] 202 us +- 11 us -> [PR-26697] 180 us +- 7 us: 1.12x faster
3: Mean +- std dev: [main] 320 us +- 11 us -> [PR-26697] 243 us +- 13 us: 1.32x faster
4: Mean +- std dev: [main] 474 us +- 15 us -> [PR-26697] 306 us +- 15 us: 1.55x faster
5: Mean +- std dev: [main] 625 us +- 27 us -> [PR-26697] 341 us +- 6 us: 1.83x faster
6: Mean +- std dev: [main] 806 us +- 24 us -> [PR-26697] 404 us +- 20 us: 1.99x faster
7: Mean +- std dev: [main] 1.01 ms +- 0.04 ms -> [PR-26697] 461 us +- 19 us: 2.19x faster
8: Mean +- std dev: [main] 1.22 ms +- 0.03 ms -> [PR-26697] 538 us +- 20 us: 2.27x faster
9: Mean +- std dev: [main] 1.47 ms +- 0.05 ms -> [PR-26697] 607 us +- 27 us: 2.42x faster
10: Mean +- std dev: [main] 1.75 ms +- 0.07 ms -> [PR-26697] 666 us +- 36 us: 2.62x faster
msg395732 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-06-13 08:30
How common match-case statements with literal integers? I expect that in most cases such magic numbers are not used directly, but as named constants.
msg395738 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021-06-13 09:54
> How common match-case statements with literal integers? 

Nothing is common yet because the feature hasn't been released ;-)

I suspect that if match-case were optimized for integer constants, they would be used.  In discussions about case statements, there are two groups, one looking for a prettier if-elif-else chain and the other looking for a fast vectored jump.  This optimization is aimed at the latter group -- they would be happy to have this opportunity to make faster code, even if it means using integer constants.
msg395797 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-06-14 13:45
This is going in the wrong direction.

Rather than add more complex instructions for use only by pattern matching, we need to simplify the individual operations and re-use existing instructions.
That way pattern matching can benefit from the general performance improvements that we are making.


If you are serious about improving the performance of pattern matching, you need to do the following in the compiler:
1. Generate a decision tree.
2. Improve that decision tree so that fewer decisions are required.
3. Generate bytecode from that tree that uses lower level operations.

E.g.

match x:
    case [1]:
        A
    case [2]:
        B

The initial decision tree looks like:

if is_sequence(x):
    if len(x) == 1:
        if x[0] == 1:
            A
if is_sequence(x):
    if len(x) == 1:
        if x[0] == 2:
            B

which can be improved to:

if is_sequence(x):
    if len(x) == 1:
        if x[0] == 1:
            A
        elif x[0] == 2:
            B

For a sequence of integer constants, introducing the test `type(x) == int` at the start would allow you to convert the linear sequence of tests into a tree. Reducing `n` tests to `ln(n) + 1` tests.
msg395809 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-06-14 16:34
Sorry, I’ve been out on vacation this weekend. I didn’t realize that there was already a PR for this… I’m honestly not sure that it’s totally ready yet.

While I absolutely agree that compiling efficient decision trees lies in our future, it certainly seems to me that using *both* strategies (decision trees and jump tables) would create the most efficient branching structure. In effect, our control flow would be a rose tree.

In your example, Mark, we could perform the checks for the integers 1 and 2 simultaneously, right?

Or consider a slightly more complicated example (simplified from PEP 636):

match …:
    case ["quit"]: …
    case ["look"]: …
    case ["get", obj]: …
    case ["go", direction]: …

We could compile this to something like:

- Check for Sequence.
- Multi-way jump on length.
- Multi-way jump on first item.

…which looks ideal to me.

I’m also confident that the jumping ops could also be broken up into fairly primitive instructions. A multi-way branch would just look something like:

(subject on TOS)
LOAD_CONST(<some hashable defaultdict>)
ROT_TWO
BINARY_SUBSCR
JUMP_FORWARD_TOS

…where only JUMP_FORWARD_TOS is new.

> For a sequence of integer constants, introducing the test `type(x) == int` at the start would allow you to convert the linear sequence of tests into a tree. Reducing `n` tests to `ln(n) + 1` tests.

Sorry, I’m having a bit of difficulty following this part. Is it something like the strategy I just described?
msg395813 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021-06-14 17:02
Also (because some of the people who might be interested are nosied on this issue), I’ve been working a bit on general performance benchmarks for our pattern-matching implementation:

https://github.com/brandtbucher/patmaperformance

I still need something that hits sequence patterns hard, but I think these will help give us a good baseline for measuring future perf work in this area.

(There’s also the perf script at the bottom of Lib/test/test_patma.py, but the test cases it runs on probably are much less representative of “real” code.)
History
Date User Action Args
2022-04-11 14:59:46adminsetgithub: 88449
2021-06-14 17:02:35brandtbuchersetmessages: + msg395813
2021-06-14 16:34:00brandtbuchersetmessages: + msg395809
2021-06-14 13:45:33Mark.Shannonsetnosy: + Mark.Shannon
messages: + msg395797
2021-06-13 09:54:55rhettingersetmessages: + msg395738
2021-06-13 08:30:51serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg395732
2021-06-12 22:29:13Dennis Sweeneysetfiles: + matchperf.py

messages: + msg395717
2021-06-12 20:25:55Dennis Sweeneysetkeywords: + patch
stage: patch review
pull_requests: + pull_request25282
2021-06-09 17:07:25corona10setnosy: + corona10
2021-06-07 19:30:42brandtbuchersetmessages: + msg395278
2021-06-07 14:08:45AndersMunchsetnosy: + AndersMunch
messages: + msg395265
2021-06-04 09:42:57Dennis Sweeneysetmessages: + msg395069
2021-06-03 16:45:57brandtbuchersetmessages: + msg395019
2021-06-03 05:54:19Kshitiz17setnosy: + Kshitiz17
2021-06-03 03:27:53Dennis Sweeneysetmessages: + msg394977
2021-06-03 02:47:54brandtbuchersetmessages: + msg394975
2021-06-02 14:02:00kjsetnosy: + kj
2021-06-02 07:03:36Dennis Sweeneysetmessages: + msg394892
2021-06-02 06:46:42brandtbuchersetmessages: + msg394891
2021-06-02 04:46:22Dennis Sweeneysetmessages: + msg394889
2021-06-02 03:35:37brandtbuchersetmessages: + msg394887
2021-06-02 02:47:58Dennis Sweeneysetmessages: + msg394884
2021-06-02 01:38:43brandtbuchersetmessages: + msg394882
2021-06-02 01:15:31brandtbuchersetnosy: + brandtbucher
2021-06-01 23:48:57rhettingersetnosy: + rhettinger
messages: + msg394878
2021-06-01 21:58:12Dennis Sweeneycreate