Issue 44075: Add a PEP578 audit hook for Asyncio loop stalls

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/88241

classification

Title:	Add a PEP578 audit hook for Asyncio loop stalls
Type:	enhancement	Stage:	patch review
Components:	asyncio	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	asvetlov, christian.heimes, orf, steve.dower, yselivanov
Priority:	normal	Keywords:	patch

Created on 2021-05-08 12:19 by orf, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 25990	open	orf, 2021-05-08 12:46

Messages (6)
msg393251 - (view)	Author: Tom Forbes (orf) *	Date: 2021-05-08 12:19
Detecting and monitoring loop stalls in a production asyncio application is more difficult than it could be. Firstly you must enable debug mode for the entire loop then you need to look for warnings outputted via the asyncio logger. This makes it hard to send loop stalls to monitoring systems via something like statsd. Ideally asyncio callbacks would always be timed and an auditevent always triggered if it passes a particular threshold. If debug mode is enabled then a warning is logged.
msg393253 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2021-05-08 12:25
Are you proposing to use PEP 578 for monitoring the event loop?
msg393255 - (view)	Author: Tom Forbes (orf) *	Date: 2021-05-08 12:33
I don't see why we shouldn't use PEP 578 for this - the events provide rich monitoring information about what a Python process is "doing" with an easy, central way to register callbacks to receive these events and shovel them off to a monitoring solution. Is there that much of a difference between monitoring the number of files, sockets, emails or even web browsers opened and the number of times an asyncio application has stalled? The alternative would be to make the loop stalling some kind of hookable event, which just seems like reinventing `sys.audit()`.
msg393263 - (view)	Author: Steve Dower (steve.dower) *	Date: 2021-05-08 15:58
Fundamentally I don't have an issue with the audit hook. My only concern would be if there's anything that an application may do to _respond_ to a stall (e.g. is this valuable for applying backpressure? etc.) If it's purely diagnostic, and there's nothing you'd do in production when it happens, then an audit hook is perfect.
msg393267 - (view)	Author: Tom Forbes (orf) *	Date: 2021-05-08 16:11
Actually reacting to a stall would require something more and probably should be done at some point. But this is purely about monitoring - in our use case we'd send a metric via statsd that would be used to correlate stalls against other service level metrics. This seems pretty critical when running a large number of asyncio applications in production because you can only currently _infer_ that a stall is happening, and it's hard to trace the cause across service boundaries. An event hook that was sent the loop and handle would be ideal for this.
msg415542 - (view)	Author: Andrew Svetlov (asvetlov) *	Date: 2022-03-19 12:00
I am still not convinced that audit events should be used. Maybe support of explicit callbacks pair (on_start() + on_finish()) with `None` for fast-and-cheap "do nothing flag" is a better alternative for catching stale coroutines?

History
Date	User	Action	Args
2022-04-11 14:59:45	admin	set	github: 88241
2022-03-19 12:00:57	asvetlov	set	messages: + msg415542
2021-05-08 16:11:02	orf	set	messages: + msg393267
2021-05-08 15:58:00	steve.dower	set	messages: + msg393263
2021-05-08 12:46:47	orf	set	keywords: + patch stage: patch review pull_requests: + pull_request24642
2021-05-08 12:33:37	orf	set	messages: + msg393255
2021-05-08 12:25:11	christian.heimes	set	nosy: + christian.heimes, steve.dower messages: + msg393253
2021-05-08 12:19:46	orf	create