Issue 40352: SocketHandler silently drops log messages on re-connect

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84532

classification

Title:	SocketHandler silently drops log messages on re-connect
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Oleg Nykolyn, remi.lapeyre, serhiy.storchaka, vinay.sajip
Priority:	normal	Keywords:	patch

Created on 2020-04-21 13:02 by Oleg Nykolyn, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
Archive.zip	Oleg Nykolyn, 2020-04-21 13:02	test client and server to reproduce the issue

Pull Requests
URL	Status	Linked	Edit
PR 22061	open	serhiy.storchaka, 2020-09-02 09:12

Messages (8)
msg366920 - (view)	Author: Oleg Nykolyn (Oleg Nykolyn)	Date: 2020-04-21 13:02
Hi, I've faced this issue when using logging.handlers.SocketHandler AWS TCP balancer. AWS balancer uses 60 second time-out by default (max 4000s), thus resulting in lots of closed sockets during inactive periods. SocketHandler.send() drops current message on any socket errors, so only next message gets logged. I've tried to reproduce this using Lib unit tests, but didn't find any easy way to close() socket on test server side. Sample client/server scripts attached, server output: Got connection from: ('127.0.0.1', 58044) Got message: b'Message #1\n' Got message: b'Message #2\n' Got connection from: ('127.0.0.1', 58045) Got message: b'Message #5\n' Server closes incoming connection is 2 seconds, client looses messages #3 and #4.
msg366925 - (view)	Author: Rémi Lapeyre (remi.lapeyre) *	Date: 2020-04-21 16:12
On one hand it's bad messages get lost, one the other retrying to send the message would take a lot of time and make `SocketHandler` very slow. Maybe we could had the record to a very short queue so we can retry to send it with the next message?
msg366929 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-04-21 16:41
If sending log messages is not guaranteed, we could use UDP. But when we use TCP it is expected that log messages will not be lost.
msg376190 - (view)	Author: Vinay Sajip (vinay.sajip) *	Date: 2020-09-01 13:29
> But when we use TCP it is expected that log messages will not be lost. I wouldn't go that far: logging is not a primary program function (i.e. the library or application should work exactly the same if logging were to be disabled). For the situation where you absolutely don't want to lose messages (apparently not that common a case - the relevant code is over 15 years old, and I can't remember this coming up before), you could either subclass SocketHandler to buffer messages, or use e.g. a MemoryHandler in conjunction with a SocketHandler.
msg376217 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-09-02 08:17
In this particular case we do not need a buffer. It is enough to reopen the socket and send the current message again.
msg376229 - (view)	Author: Vinay Sajip (vinay.sajip) *	Date: 2020-09-02 11:32
> It is enough to reopen the socket and send the current message again. A common reason for connection failure in SocketHandler is the other end going offline for some reason. The offline period can often be measured in seconds to hours, or even longer. The current strategy is to retry connection, but on the next logging call, and with an exponential backoff. Otherwise, if the remote end goes down and you keep retrying to connect on the same call, the handler would keep trying to connect and failing, and this could slow things down.
msg376844 - (view)	Author: Vinay Sajip (vinay.sajip) *	Date: 2020-09-13 20:24
@Oleg In the interests of clarity, can you please give more detail about the network topology and sequence of events in your use case? Where the machine with the SocketHandler is, where the socket server is that it's sending to, where the TCP balancer comes into it, what exactly the timeout is for, and what is the precise cause of the socket errors {e.g. whether a failure occurs after a connection has been made and some events have been successfully logged - if so, what exactly causes the failure)?
msg378625 - (view)	Author: Oleg Nykolyn (Oleg Nykolyn)	Date: 2020-10-14 15:36
There are multiple servers running in Kubrnetes cluster - API servers based on Django, celery workers, etc. All of them send logs to AWS TCP balancer, which acts as balancer for vector service[1], which send logs to Elasticsearch. Basically we have following logging pipeline: python-based services -> AWS TCP network balancer -> vector -> Elasticsearch. AWS network balancer has an option called "Idle timeout" with max value of 3600 seconds[2]. Log messages are logged successfully at first, but fail(one message gets lost on re-connect) if there is gap between messages, corresponding to "Idle timeout". 1: https://github.com/timberio/vector 2: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout

History
Date	User	Action	Args
2022-04-11 14:59:29	admin	set	github: 84532
2020-10-14 15:36:25	Oleg Nykolyn	set	messages: + msg378625
2020-09-13 20:24:09	vinay.sajip	set	messages: + msg376844
2020-09-02 11:32:11	vinay.sajip	set	messages: + msg376229
2020-09-02 09:12:46	serhiy.storchaka	set	keywords: + patch stage: patch review pull_requests: + pull_request21151
2020-09-02 08:17:45	serhiy.storchaka	set	messages: + msg376217
2020-09-01 13:29:57	vinay.sajip	set	messages: + msg376190
2020-04-21 16:41:20	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg366929
2020-04-21 16:12:16	remi.lapeyre	set	nosy: + remi.lapeyre messages: + msg366925
2020-04-21 15:19:11	serhiy.storchaka	set	nosy: + vinay.sajip
2020-04-21 13:02:07	Oleg Nykolyn	create