Title: SocketHandler silently drops log messages on re-connect
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Oleg Nykolyn, remi.lapeyre, serhiy.storchaka, vinay.sajip
Priority: normal Keywords: patch

Created on 2020-04-21 13:02 by Oleg Nykolyn, last changed 2020-10-14 15:36 by Oleg Nykolyn.

File name Uploaded Description Edit Oleg Nykolyn, 2020-04-21 13:02 test client and server to reproduce the issue
Pull Requests
URL Status Linked Edit
PR 22061 open serhiy.storchaka, 2020-09-02 09:12
Messages (8)
msg366920 - (view) Author: Oleg Nykolyn (Oleg Nykolyn) Date: 2020-04-21 13:02

I've faced this issue when using logging.handlers.SocketHandler AWS TCP balancer. AWS balancer uses 60 second time-out by default (max 4000s), thus resulting in lots of closed sockets during inactive periods.
SocketHandler.send() drops current message on any socket errors, so only next message gets logged.
I've tried to reproduce this using Lib unit tests, but didn't find any easy way to close() socket on test server side.
Sample client/server scripts attached, server output:

Got connection from:  ('', 58044)
Got message:  b'Message #1\n'
Got message:  b'Message #2\n'
Got connection from:  ('', 58045)
Got message:  b'Message #5\n'

Server closes incoming connection is 2 seconds, client looses messages #3 and #4.
msg366925 - (view) Author: RĂ©mi Lapeyre (remi.lapeyre) * Date: 2020-04-21 16:12
On one hand it's bad messages get lost, one the other retrying to send the message would take a lot of time and make `SocketHandler` very slow.

Maybe we could had the record to a very short queue so we can retry to send it with the next message?
msg366929 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-04-21 16:41
If sending log messages is not guaranteed, we could use UDP. But when we use TCP it is expected that log messages will not be lost.
msg376190 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2020-09-01 13:29
> But when we use TCP it is expected that log messages will not be lost.

I wouldn't go that far: logging is not a primary program function (i.e. the library or application should work exactly the same if logging were to be disabled). For the situation where you absolutely don't want to lose messages (apparently not that common a case - the relevant code is over 15 years old, and I can't remember this coming up before), you could either subclass SocketHandler to buffer messages, or use e.g. a MemoryHandler in conjunction with a SocketHandler.
msg376217 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-09-02 08:17
In this particular case we do not need a buffer. It is enough to reopen the socket and send the current message again.
msg376229 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2020-09-02 11:32
> It is enough to reopen the socket and send the current message again.

A common reason for connection failure in SocketHandler is the other end going offline for some reason. The offline period can often be measured in seconds to hours, or even longer. The current strategy is to retry connection, but on the next logging call, and with an exponential backoff.

Otherwise, if the remote end goes down and you keep retrying to connect on the same call, the handler would keep trying to connect and failing, and this could slow things down.
msg376844 - (view) Author: Vinay Sajip (vinay.sajip) * (Python committer) Date: 2020-09-13 20:24
@Oleg In the interests of clarity, can you please give more detail about the network topology and sequence of events in your use case? Where the machine with the SocketHandler is, where the socket server is that it's sending to, where the TCP balancer comes into it, what exactly the timeout is for, and what is the precise cause of the socket errors {e.g. whether a failure occurs after a connection has been made and some events have been successfully logged - if so, what exactly causes the failure)?
msg378625 - (view) Author: Oleg Nykolyn (Oleg Nykolyn) Date: 2020-10-14 15:36
There are multiple servers running in Kubrnetes cluster - API servers based on Django, celery workers, etc. All of them send logs to AWS TCP balancer, which acts as balancer for vector service[1], which send logs to Elasticsearch.
Basically we have following logging pipeline: python-based services -> AWS TCP network balancer -> vector -> Elasticsearch.
AWS network balancer has an option called "Idle timeout" with max value of 3600 seconds[2].
Log messages are logged successfully at first, but fail(one message gets lost on re-connect) if there is gap between messages, corresponding to "Idle timeout".

Date User Action Args
2020-10-14 15:36:25Oleg Nykolynsetmessages: + msg378625
2020-09-13 20:24:09vinay.sajipsetmessages: + msg376844
2020-09-02 11:32:11vinay.sajipsetmessages: + msg376229
2020-09-02 09:12:46serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request21151
2020-09-02 08:17:45serhiy.storchakasetmessages: + msg376217
2020-09-01 13:29:57vinay.sajipsetmessages: + msg376190
2020-04-21 16:41:20serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg366929
2020-04-21 16:12:16remi.lapeyresetnosy: + remi.lapeyre
messages: + msg366925
2020-04-21 15:19:11serhiy.storchakasetnosy: + vinay.sajip
2020-04-21 13:02:07Oleg Nykolyncreate