Author ghost43
Recipients asvetlov, christian.heimes, ghost43, yselivanov
Date 2021-05-04.17:19:01
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1620148744.77.0.874589771286.issue44036@roundup.psfhosted.org>
In-reply-to
Content
This is about a potential DOS vector that can get an asyncio server serving SSL connections to enter a busy loop and hang. To recover the server (python process) needs to be restarted.

See downstream report at https://github.com/spesmilo/electrumx/issues/92

ElectrumX is server software that serves JSON-RPC requests either directly over plaintext TCP or over SSL/TLS (no HTTP involved). The server code is written using asyncio. The most popular client is also written using python+asyncio. However, there are other server implementations (e.g. in Rust), and other clients (e.g. in javascript, scala, java). The servers are part of a federated network and peer with each other.

In the last 1+ year, server operators have started reporting that the python process (of ElectrumX) sporadically hangs: it uses 100% CPU and stops serving clients or accepting new connections. The only way we could find to recover is restarting the server (the python process); waiting for even days did not help, python was stuck. The hang seemed to mostly happen on busy servers that serve several thousands of concurrent sessions, and even on those only happened e.g. once a month. So the hang was difficult to reproduce.

Nevertheless we observed that the hang only happens if it is the asyncio server that handles SSL, i.e. if the server operator put nginx in front of the python process and handles SSL termination in nginx, they would be unaffected. One server operator whose server at one point hanged multiple times a day confirmed this, and reported that nginx subsequently started logging lines like this:
```
2021/01/11 02:28:30 [crit] 21#21: *437620 SSL_shutdown() failed (SSL: error:14094123:SSL routines:ssl3_read_bytes:application data after close notify) while proxying connection, client: REDACTED, server: 0.0.0.0:50002, upstream: "127.0.0.1:50213", bytes from/to client:81/205, bytes from/to upstream:205/0
```

Over these last months, the hang has been reproduced on many different python versions by different people, e.g. 3.7.1, 3.7.5, 3.8.5, 3.9.1, 3.9.4.

A few days ago, many servers hanged almost simultaneously, and when restarted, they would soon hang again at most a few hours later. Presumably someone either flooded the network with buggy clients, or it might have been an attack. Anyway, this allowed me to look into this better. I set up gdb and waited. This was on ubuntu 20.04 with python 3.8.5 and openssl=1.1.1f-1ubuntu2.3.

It seems the event loop thread is stuck in a busy loop.
The deepest common stack frame looks to be at
https://github.com/python/cpython/blob/v3.8.5/Lib/asyncio/sslproto.py#L675

```
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 535, in feed_appdata

  File "/usr/lib/python3.8/asyncio/sslproto.py", line 675, in _process_write_backlog
    ssldata, offset = self._sslpipe.feed_appdata(data, offset)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 599, in _write_appdata
    self._process_write_backlog()
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 387, in write
    self._ssl_protocol._write_appdata(data)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/rawsocket.py", line 118, in write
    self._asyncio_transport.write(framed_message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 153, in _send_message
    await self.transport.write(message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 496, in _throttled_request
    await self._send_message(message)
  <built-in method run of Context object at remote 0x7f09e30eb6c0>
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 2627, in _run_once
--Type <RET> for more, q to quit, c to continue without paging--
  File "/usr/lib/python3.8/asyncio/base_events.py", line 826, in run_forever
    None, getaddr_func, host, port, family, type, proto, flags)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.8/asyncio/runners.py", line 299, in run
  File "/usr/local/bin/electrumx_server", line 35, in main
    asyncio.run(controller.run())
  File "/usr/local/bin/electrumx_server", line 43, in <module>
    main()
```
```
(gdb) py-list
 530            except (SystemExit, KeyboardInterrupt):
 531                raise
 532            except BaseException as e:
 533                self._fatal_error(e, 'SSL error in data received')
 534                return
>535
 536            for chunk in ssldata:
 537                self._transport.write(chunk)
 538
 539            for chunk in appdata:
 540                if chunk:
```
```
(gdb) py-locals
self = <_SSLPipe(_context=<SSLContext at remote 0x7f0a30dc00c0>, _server_side=True, _server_hostname=None, _state='WRAPPED', _incoming=<_ssl.MemoryBIO at remote 0x7f09cc0925b0>, _outgoing=<_ssl.MemoryBIO at remote 0x7f09b919c430>, _sslobj=<SSLObject(_sslobj=<_ssl._SSLSocket at remote 0x7f09cc8c1630>) at remote 0x7f09db428910>, _need_ssldata=False, _handshake_cb=<method at remote 0x7f09c0177a00>, _shutdown_cb=None) at remote 0x7f09db43f160>
data = b'{"jsonrpc": "2.0", "result": null, "id": 18}\n'
offset = 0
ssldata = [b'\x17\x03\x03\x00\x13<\x8fh\x93\xd3\xd7M\xb9\xbd\xb6\xad\x08\x1bZ\x857\x0c\xe4\xac']
view = <memoryview at remote 0x7f09d295be80>
exc_errno = 5
```

```
(gdb) cont
Continuing.
^C
Thread 1 "electrumx_serve" received signal SIGINT, Interrupt.
memory_getbuf (self=0x7f09b69a3880, view=0x7ffcb6829200, flags=0)
    at ../Objects/memoryobject.c:1441
1441    ../Objects/memoryobject.c: No such file or directory.
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.8/ssl.py", line 897, in write
    return self._sslobj.write(data)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 262, in feed_appdata
    offset += self._sslobj.write(view[offset:])
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 675, in _process_write_backlog
    ssldata, offset = self._sslpipe.feed_appdata(data, offset)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 599, in _write_appdata
    self._process_write_backlog()
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 387, in write
    self._ssl_protocol._write_appdata(data)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/rawsocket.py", line 118, in write
    self._asyncio_transport.write(framed_message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 153, in _send_message
    await self.transport.write(message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 496, in _throttled_request
    await self._send_message(message)
  <built-in method run of Context object at remote 0x7f09e30eb6c0>
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
--Type <RET> for more, q to quit, c to continue without paging--
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 2627, in _run_once
  File "/usr/lib/python3.8/asyncio/base_events.py", line 826, in run_forever
    None, getaddr_func, host, port, family, type, proto, flags)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.8/asyncio/runners.py", line 299, in run
  File "/usr/local/bin/electrumx_server", line 35, in main
    asyncio.run(controller.run())
  File "/usr/local/bin/electrumx_server", line 43, in <module>
    main()
(gdb)
(gdb) cont
Continuing.
^C
(gdb) py-bt
Traceback (most recent call first):
  <built-in method len of module object at remote 0x7f0a34f8e0e0>
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 535, in feed_appdata

  File "/usr/lib/python3.8/asyncio/sslproto.py", line 675, in _process_write_backlog
    ssldata, offset = self._sslpipe.feed_appdata(data, offset)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 599, in _write_appdata
    self._process_write_backlog()
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 387, in write
    self._ssl_protocol._write_appdata(data)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/rawsocket.py", line 118, in write
    self._asyncio_transport.write(framed_message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 153, in _send_message
    await self.transport.write(message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 496, in _throttled_request
    await self._send_message(message)
  <built-in method run of Context object at remote 0x7f09e30eb6c0>
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
--Type <RET> for more, q to quit, c to continue without paging--
  File "/usr/lib/python3.8/asyncio/base_events.py", line 2627, in _run_once
  File "/usr/lib/python3.8/asyncio/base_events.py", line 826, in run_forever
    None, getaddr_func, host, port, family, type, proto, flags)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.8/asyncio/runners.py", line 299, in run
  File "/usr/local/bin/electrumx_server", line 35, in main
    asyncio.run(controller.run())
  File "/usr/local/bin/electrumx_server", line 43, in <module>
    main()
    main()
(gdb) cont
Continuing.
^C
Thread 1 "electrumx_serve" received signal SIGINT, Interrupt.
PySlice_New (start=0, stop=None, step=0x0) at ../Objects/sliceobject.c:123
123     ../Objects/sliceobject.c: No such file or directory.
(gdb)
Continuing.
^C
Thread 1 "electrumx_serve" received signal SIGINT, Interrupt.
0x00000000005d4390 in long_richcompare (
    self=<SSLErrorNumber(_value_=5, _name_='SSL_ERROR_SYSCALL', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f0a34e119d0>, __doc__='An enumeration.', __module__='ssl', _member_names_=['SSL_ERROR_SSL', 'SSL_ERROR_WANT_READ', 'SSL_ERROR_WANT_WRITE', 'SSL_ERROR_WANT_X509_LOOKUP', 'SSL_ERROR_SYSCALL', 'SSL_ERROR_ZERO_RETURN', 'SSL_ERROR_WANT_CONNECT', 'SSL_ERROR_EOF', 'SSL_ERROR_INVALID_ERROR_CODE'], _member_map_={'SSL_ERROR_SSL': <SSLErrorNumber(_value_=1, _name_='SSL_ERROR_SSL', __objclass__=<...>) at remote 0x7f0a3487fe00>, 'SSL_ERROR_WANT_READ': <SSLErrorNumber(_value_=2, _name_='SSL_ERROR_WANT_READ', __objclass__=<...>) at remote 0x7f0a3487fe80>, 'SSL_ERROR_WANT_WRITE': <SSLErrorNumber(_value_=3, _name_='SSL_ERROR_WANT_WRITE', __objclass__=<...>) at remote 0x7f0a3487ff00>, 'SSL_ERROR_WANT_X509_LOOKUP': <SSLErrorNumber(_value_=4, _name_='SSL_ERROR_WANT_X509_LOOKUP', __objclass__=<...>) at remote 0x7f0a3487ff80>, 'SSL_ERROR_SYSCALL': <...>, 'SSL_ERROR_ZERO_RETURN': <SSLErrorNumber(_value_=6, _nam...(truncated), other=5, op=2) at ../Objects/longobject.c:3083
3083    ../Objects/longobject.c: No such file or directory.
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 526, in feed_appdata
    return
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 675, in _process_write_backlog
    ssldata, offset = self._sslpipe.feed_appdata(data, offset)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 599, in _write_appdata
    self._process_write_backlog()
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 387, in write
    self._ssl_protocol._write_appdata(data)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/rawsocket.py", line 118, in write
    self._asyncio_transport.write(framed_message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 153, in _send_message
    await self.transport.write(message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 496, in _throttled_request
    await self._send_message(message)
  <built-in method run of Context object at remote 0x7f09e30eb6c0>
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 2627, in _run_once
--Type <RET> for more, q to quit, c to continue without paging--
(gdb)
(gdb) cont
Continuing.
^C
Thread 1 "electrumx_serve" received signal SIGINT, Interrupt.
0x00000000004f7fa0 in builtin_len (module=<optimized out>, obj=<optimized out>)
    at ../Objects/longobject.c:61
61      ../Objects/longobject.c: No such file or directory.
(gdb) py-bt
Traceback (most recent call first):
  <built-in method len of module object at remote 0x7f0a34f8e0e0>
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 535, in feed_appdata

  File "/usr/lib/python3.8/asyncio/sslproto.py", line 675, in _process_write_backlog
    ssldata, offset = self._sslpipe.feed_appdata(data, offset)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 599, in _write_appdata
    self._process_write_backlog()
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 387, in write
    self._ssl_protocol._write_appdata(data)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/rawsocket.py", line 118, in write
    self._asyncio_transport.write(framed_message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 153, in _send_message
    await self.transport.write(message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 496, in _throttled_request
    await self._send_message(message)
  <built-in method run of Context object at remote 0x7f09e30eb6c0>
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
--Type <RET> for more, q to quit, c to continue without paging--
(gdb) cont
Continuing.
^C
Thread 1 "electrumx_serve" received signal SIGINT, Interrupt.
PyType_IsSubtype (b=0x907020 <_PyNone_Type>, a=0x90a860 <PyUnicode_Type>)
    at ../Objects/typeobject.c:1370
1370    ../Objects/typeobject.c: No such file or directory.
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 268, in feed_appdata
    if exc.reason == 'PROTOCOL_IS_SHUTDOWN':
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 675, in _process_write_backlog
    ssldata, offset = self._sslpipe.feed_appdata(data, offset)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 599, in _write_appdata
    self._process_write_backlog()
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 387, in write
    self._ssl_protocol._write_appdata(data)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/rawsocket.py", line 118, in write
    self._asyncio_transport.write(framed_message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 153, in _send_message
    await self.transport.write(message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 496, in _throttled_request
    await self._send_message(message)
  <built-in method run of Context object at remote 0x7f09e30eb6c0>
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 2627, in _run_once
--Type <RET> for more, q to quit, c to continue without paging--
(gdb) cont
Continuing.
^C
Thread 1 "electrumx_serve" received signal SIGINT, Interrupt.
__GI___errno_location () at errno-loc.c:25
25      errno-loc.c: No such file or directory.
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.8/ssl.py", line 897, in write
    return self._sslobj.write(data)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 262, in feed_appdata
    offset += self._sslobj.write(view[offset:])
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 675, in _process_write_backlog
    ssldata, offset = self._sslpipe.feed_appdata(data, offset)
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 599, in _write_appdata
    self._process_write_backlog()
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 387, in write
    self._ssl_protocol._write_appdata(data)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/rawsocket.py", line 118, in write
    self._asyncio_transport.write(framed_message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 153, in _send_message
    await self.transport.write(message)
  File "/usr/local/lib/python3.8/dist-packages/aiorpcx/session.py", line 496, in _throttled_request
    await self._send_message(message)
  <built-in method run of Context object at remote 0x7f09e30eb6c0>
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
```

Note that around line https://github.com/python/cpython/blob/v3.8.5/Lib/asyncio/sslproto.py#L262,
exc_errno = 5 corresponds to `ssl.SSL_ERROR_SYSCALL`.

Considering the gdb backtraces above, it seems to me the execution is stuck in that while loop, always getting `ssl.SSL_ERROR_SYSCALL` and suppressing it.
Note that this code is not async, so the whole event loop is blocked.
I have patched my python interpreter to not suppress that exception but instead re-raise it, and recompiled; and the hangs have gone away; I can no longer reproduce them.

I used the following patch on top of 3.9.4 (but note the code in this file has not changed for years):

```
 Lib/asyncio/sslproto.py | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Lib/asyncio/sslproto.py b/Lib/asyncio/sslproto.py
index cad25b2653..f67898a777 100644
--- a/Lib/asyncio/sslproto.py
+++ b/Lib/asyncio/sslproto.py
@@ -267,8 +267,7 @@ class _SSLPipe(object):
                 if exc.reason == 'PROTOCOL_IS_SHUTDOWN':
                     exc_errno = exc.errno = ssl.SSL_ERROR_WANT_READ
                 if exc_errno not in (ssl.SSL_ERROR_WANT_READ,
-                                     ssl.SSL_ERROR_WANT_WRITE,
-                                     ssl.SSL_ERROR_SYSCALL):
+                                     ssl.SSL_ERROR_WANT_WRITE):
                     raise
                 self._need_ssldata = (exc_errno == ssl.SSL_ERROR_WANT_READ)
 
```

Frankly I don't understand why the `ssl.SSL_ERROR_SYSCALL` error is special cased in the try-except, hence this might break something else. The git blame or the python docs were of no help either (to understand that exception). However with this patch my server seems to be working correctly and is no longer getting stuck.

Unfortunately I don't have code that triggers the hang.
History
Date User Action Args
2021-05-04 17:19:04ghost43setrecipients: + ghost43, christian.heimes, asvetlov, yselivanov
2021-05-04 17:19:04ghost43setmessageid: <1620148744.77.0.874589771286.issue44036@roundup.psfhosted.org>
2021-05-04 17:19:04ghost43linkissue44036 messages
2021-05-04 17:19:01ghost43create