This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Enable informing callee it's awaited via vector call flag
Type: performance Stage: needs patch
Components: Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: dino.viehland Nosy List: Mark.Shannon, brandtbucher, carljm, corona10, dino.viehland, gvanrossum, itamaro, v2m
Priority: normal Keywords:

Created on 2022-03-09 00:37 by dino.viehland, last changed 2022-04-11 14:59 by admin.

Messages (5)
msg414782 - (view) Author: Dino Viehland (dino.viehland) * (Python committer) Date: 2022-03-09 00:37
The idea here is to add a new flag to the vectorcall nargs that indicates the call is being awaited: _Py_AWAITED_CALL_MARKER.  This flag will allow the callee to know that it's being eagerly evaluated.  When the call is eagerly evaluated the callee can potentially avoid various amounts of overhead.  For a coroutine the function can avoid creating the coroutine object and instead returns a singleton instance of a wait handle indicating eager execution has occurred:
https://github.com/facebookincubator/cinder/blob/cinder/3.8/Python/ceval.c#L6617

This gives a small win by reducing the overhead of allocating the co-routine object.

For something like gather much more significant wins can be achieved.  If all of the inputs have already been computed the creation of tasks and scheduling of them to the event loop can be elided.  An example implementation of this is available in Cinder: https://github.com/facebookincubator/cinder/blob/cinder/3.8/Modules/_asynciomodule.c#L7103

Again the gather implementation uses the singleton wait handle object to return the value indicating the computation completed synchronously.

We've used this elsewhere in Cinder as well - for example if we have an "AsyncLazyValue" which lazily performs a one-time computation of a value   and caches it.  Therefore the common case becomes that the value is already available, and the await can be performed without allocating any intermediate values.
msg415301 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-03-15 21:47
The link https://github.com/facebookincubator/cinder/blob/cinder/3.8/Python/ceval.c#L6617 points to something that I wouldn't associate with the subject. @Dino, could you provide a new link (preferably a permalink)?

FWIW rather than dynamically checking what the next opcode is, maybe we could use a super-instruction for CALL + GET_AWAITABLE? (Understanding that there are a bunch of different CALL opcodes.)

The gather code you link to is all in C. Is rewriting gather in C the only way to benefit from this speedup? (I guess you could just add your gather impl to the existing _asynciomodule.c, in the same or a separate PR.)
msg415433 - (view) Author: Dino Viehland (dino.viehland) * (Python committer) Date: 2022-03-17 17:27
Doh, sorry about that link, this one goes to a specific commit: https://github.com/facebookincubator/cinder/blob/6863212ada4b569c15cd95c4e7a838f254c8ccfb/Python/ceval.c#L6642

I do think a new opcode is a good way to go, and that could just be emitted by the compiler when it recognizes the pattern.  I think we mainly avoided that because we had some issues around performance testing when we updated the byte code version and the peek was negligible, but with improved call performance in 3.11 that may not be the case anymore.

It's probably possible to keep most of gather in Python if necessary, there'd still need to be a C wrapper which could flow the wrapper in and the wait handle creation would need to be possible from Python (which slightly scares me).  There's probably also a perf win from the C implementation - I'll see if @v2m has any data on that.
msg415640 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-03-20 22:54
I will wait until there is a draft PR to review, or until you ping me.--
--Guido (mobile)
msg415795 - (view) Author: Vladimir Matveev (v2m) * Date: 2022-03-22 16:50
- introducing dedicated opcodes for each kind of awaited call is definitely an option. In fact first implementation used it however as Dino has mentioned it was more of a logistical issue (there were several spots that produced .pyc files so compiler needed to be up to date across all of them).
- there was some perf win that was coming from rewriting `gather` in C however main reason for us to do it was the ability to be await-aware which we made only available in C (also returning completed waithandle is not exposed to Python either) to reduce the scope. Providing ability to follow await-aware protocol (read indicator that call is awaited + return completed waithandle for eagerly completed calls) in pure Python is definitely possible
- to provide some context why it was beneficial: typical implementation of endpoint in IG is an async function that in turn calls into numerous other async functions to generate an output. 
  - `gather` is used all over the place in case if there are no sequential dependency between calls
  - amount of unique pieces of data that are ultimately fetched by async calls is not very big, i.e. the same fragment of information can be requested by different async calls which makes memoization a very attractive strategy to reduce I/O and heavyweight computations.
  - memoized pieces of data is represented effectively by completed futures and it is very common to have `gather` accept either memoized value or coroutine object that will be completed synchronously by awaiting memoized value.

Before making gather await-aware if always have to follow the standard process and convert awaitables into tasks that are queued into the event loop for execution. In our workload task creation/queueing were adding a noticeable overhead. With await-aware gather we can execute coroutine objects eagerly and if they were not suspended - bypass task creation entirely.  
```
import asyncio
import time

async def step(i):
    if i == 0:
        return
    await asyncio.gather(*[step(i - 1) for _ in range(6)])

async def main():
    t0 = time.perf_counter()
    await step(6)
    t1 = time.perf_counter()
    print(f"{t1 - t0} s")

N = 0
def create_task(loop, coro):
    global N
    N += 1
    return asyncio.Task(coro, loop=loop)

loop = asyncio.get_event_loop()
loop.set_task_factory(create_task)
loop.run_until_complete(main())
print(f"{N} tasks created")

# Cinder
# 0.028410961851477623 s
# 1 tasks created

# Python 3.8
# 1.2157012447714806 s
# 55987 tasks created
```
History
Date User Action Args
2022-04-11 14:59:57adminsetgithub: 91121
2022-03-22 16:50:02v2msetmessages: + msg415795
2022-03-20 22:54:07gvanrossumsetmessages: + msg415640
2022-03-20 17:00:20corona10setnosy: + corona10
2022-03-17 17:27:01dino.viehlandsetmessages: + msg415433
2022-03-15 21:47:47gvanrossumsetmessages: + msg415301
2022-03-09 17:49:55v2msetnosy: + v2m
2022-03-09 17:11:45brandtbuchersetnosy: + brandtbucher
2022-03-09 00:37:24dino.viehlandcreate