This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add a PEP578 audit hook for Asyncio loop stalls
Type: enhancement Stage: patch review
Components: asyncio Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: asvetlov, christian.heimes, orf, steve.dower, yselivanov
Priority: normal Keywords: patch

Created on 2021-05-08 12:19 by orf, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 25990 open orf, 2021-05-08 12:46
Messages (6)
msg393251 - (view) Author: Tom Forbes (orf) * Date: 2021-05-08 12:19
Detecting and monitoring loop stalls in a production asyncio application is more difficult than it could be.

Firstly you must enable debug mode for the entire loop then you need to look for warnings outputted via the asyncio logger. This makes it hard to send loop stalls to monitoring systems via something like statsd.

Ideally asyncio callbacks would always be timed and an auditevent always triggered if it passes a particular threshold. If debug mode is enabled then a warning is logged.
msg393253 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-05-08 12:25
Are you proposing to use PEP 578 for monitoring the event loop?
msg393255 - (view) Author: Tom Forbes (orf) * Date: 2021-05-08 12:33
I don't see why we shouldn't use PEP 578 for this - the events provide rich monitoring information about what a Python process is "doing" with an easy, central way to register callbacks to receive these events and shovel them off to a monitoring solution.

Is there that much of a difference between monitoring the number of files, sockets, emails or even web browsers opened and the number of times an asyncio application has stalled?

The alternative would be to make the loop stalling some kind of hookable event, which just seems like reinventing `sys.audit()`.
msg393263 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-05-08 15:58
Fundamentally I don't have an issue with the audit hook. My only concern would be if there's anything that an application may do to _respond_ to a stall (e.g. is this valuable for applying backpressure? etc.)

If it's purely diagnostic, and there's nothing you'd do in production when it happens, then an audit hook is perfect.
msg393267 - (view) Author: Tom Forbes (orf) * Date: 2021-05-08 16:11
Actually reacting to a stall would require something more and probably should be done at some point.

But this is purely about monitoring - in our use case we'd send a metric via statsd that would be used to correlate stalls against other service level metrics. This seems pretty critical when running a large number of asyncio applications in production because you can only currently _infer_ that a stall is happening, and it's hard to trace the cause across service boundaries. An event hook that was sent the loop and handle would be ideal for this.
msg415542 - (view) Author: Andrew Svetlov (asvetlov) * (Python committer) Date: 2022-03-19 12:00
I am still not convinced that audit events should be used.

Maybe support of explicit callbacks pair (on_start() + on_finish()) with `None` for fast-and-cheap "do nothing flag" is a better alternative for catching stale coroutines?
History
Date User Action Args
2022-04-11 14:59:45adminsetgithub: 88241
2022-03-19 12:00:57asvetlovsetmessages: + msg415542
2021-05-08 16:11:02orfsetmessages: + msg393267
2021-05-08 15:58:00steve.dowersetmessages: + msg393263
2021-05-08 12:46:47orfsetkeywords: + patch
stage: patch review
pull_requests: + pull_request24642
2021-05-08 12:33:37orfsetmessages: + msg393255
2021-05-08 12:25:11christian.heimessetnosy: + christian.heimes, steve.dower
messages: + msg393253
2021-05-08 12:19:46orfcreate