Message 406422 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jcristharif
Recipients	jcristharif
Date	2021-11-16.18:51:39
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1637088700.35.0.697825575179.issue45819@roundup.psfhosted.org>
In-reply-to

Content
In https://bugs.python.org/issue7946 an issue with how the current GIL interacts with mixing IO and CPU bound work. Quoting this issue: > when an I/O bound thread executes an I/O call, > it always releases the GIL. Since the GIL is released, a CPU bound > thread is now free to acquire the GIL and run. However, if the I/O > call completes immediately (which is common), the I/O bound thread > immediately stalls upon return from the system call. To get the GIL > back, it now has to go through the timeout process to force the > CPU-bound thread to release the GIL again. This issue can come up in any application where IO and CPU bound work are mixed (we've found it to be a cause of performance issues in https://dask.org for example). Fixing the general problem is tricky and likely requires changes to the GIL's internals, but in the specific case of mixing asyncio running in one thread and CPU work happening in background threads, there may be a simpler fix - don't release the GIL if we don't have to. Asyncio relies on nonblocking socket operations, which by definition shouldn't block. As such, releasing the GIL shouldn't be needed for many operations (`send`, `recv`, ...) on `socket.socket` objects provided they're in nonblocking mode (as suggested in https://bugs.python.org/issue7946#msg99477). Likewise, dropping the GIL can be avoided when calling `select` on `selectors.BaseSelector` objects with a timeout of 0 (making it a non-blocking call). I've made a patch (https://github.com/jcrist/cpython/tree/keep-gil-for-fast-syscalls) with these two changes, and run a benchmark (attached) to evaluate the effect of background threads with/without the patch. The benchmark starts an asyncio server in one process, and a number of clients in a separate process. A number of background threads that just spin are started in the server process (configurable by the `-t` flag, defaults to 0), then the server is loaded to measure the RPS. Here are the results: ``` # Main branch $ python bench.py -c1 -t0 Benchmark: clients = 1, msg-size = 100, background-threads = 0 16324.2 RPS $ python bench.py -c1 -t1 Benchmark: clients = 1, msg-size = 100, background-threads = 1 Spinner spun 1.52e+07 cycles/second 97.6 RPS $ python bench.py -c2 -t0 Benchmark: clients = 2, msg-size = 100, background-threads = 0 31308.0 RPS $ python bench.py -c2 -t1 Benchmark: clients = 2, msg-size = 100, background-threads = 1 Spinner spun 1.52e+07 cycles/second 96.2 RPS $ python bench.py -c10 -t0 Benchmark: clients = 10, msg-size = 100, background-threads = 0 47169.6 RPS $ python bench.py -c10 -t1 Benchmark: clients = 10, msg-size = 100, background-threads = 1 Spinner spun 1.54e+07 cycles/second 95.4 RPS # With this patch $ ./python bench.py -c1 -t0 Benchmark: clients = 1, msg-size = 100, background-threads = 0 18201.8 RPS $ ./python bench.py -c1 -t1 Benchmark: clients = 1, msg-size = 100, background-threads = 1 Spinner spun 9.03e+06 cycles/second 194.6 RPS $ ./python bench.py -c2 -t0 Benchmark: clients = 2, msg-size = 100, background-threads = 0 34151.8 RPS $ ./python bench.py -c2 -t1 Benchmark: clients = 2, msg-size = 100, background-threads = 1 Spinner spun 8.72e+06 cycles/second 729.6 RPS $ ./python bench.py -c10 -t0 Benchmark: clients = 10, msg-size = 100, background-threads = 0 53666.6 RPS $ ./python bench.py -c10 -t1 Benchmark: clients = 10, msg-size = 100, background-threads = 1 Spinner spun 5e+06 cycles/second 21838.2 RPS ``` A few comments on the results: - On the main branch, any GIL contention sharply decreases the number of RPS an asyncio server can handle, regardless of the number of clients. This makes sense - any socket operation will release the GIL, and the server thread will have to wait to reacquire it (up to the switch interval), rinse and repeat. So if every request requires 1 recv and 1 send, a server with background GIL contention is stuck at a max of `1 / (2 * switchinterval)` or 200 RPS with default configuration. This effectively prioritizes the background thread over the IO thread, since the IO thread releases the GIL very frequently and the background thread never does. - With the patch, we still see a performance degradation, but the degradation is less severe and improves with the number of clients. This is because with these changes the asyncio thread only releases the GIL when doing a blocking poll for new IO events (or when the switch interval is hit). With low load (1 client), the IO thread becomes idle more frequently and releases the GIL. Under higher load though the event loop frequently still has work to do at the end of a cycle and issues a `selector.select` call with a 0 timeout (nonblocking), avoiding releasing the GIL at all during that loop (note the nonlinear effect of adding more clients). Since the IO thread still releases the GIL sometimes, the background thread still holds the GIL a larger percentage of the time than the IO thread, but the difference between them is less severe than without this patch. I have also tested this patch on a Dask cluster running some real-world problems and found that it did improve performance where IO was throttled due to GIL contention.

In https://bugs.python.org/issue7946 an issue with how the current GIL interacts with mixing IO and CPU bound work. Quoting this issue:

> when an I/O bound thread executes an I/O call,
> it always releases the GIL.  Since the GIL is released, a CPU bound
> thread is now free to acquire the GIL and run.  However, if the I/O
> call completes immediately (which is common), the I/O bound thread
> immediately stalls upon return from the system call.  To get the GIL
> back, it now has to go through the timeout process to force the
> CPU-bound thread to release the GIL again.

This issue can come up in any application where IO and CPU bound work are mixed (we've found it to be a cause of performance issues in https://dask.org for example). Fixing the general problem is tricky and likely requires changes to the GIL's internals, but in the specific case of mixing asyncio running in one thread and CPU work happening in background threads, there may be a simpler fix - don't release the GIL if we don't have to.

Asyncio relies on nonblocking socket operations, which by definition shouldn't block. As such, releasing the GIL shouldn't be needed for many operations (`send`, `recv`, ...) on `socket.socket` objects provided they're in nonblocking mode (as suggested in https://bugs.python.org/issue7946#msg99477). Likewise, dropping the GIL can be avoided when calling `select` on `selectors.BaseSelector` objects with a timeout of 0 (making it a non-blocking call).

I've made a patch (https://github.com/jcrist/cpython/tree/keep-gil-for-fast-syscalls) with these two changes, and run a benchmark (attached) to evaluate the effect of background threads with/without the patch. The benchmark starts an asyncio server in one process, and a number of clients in a separate process. A number of background threads that just spin are started in the server process (configurable by the `-t` flag, defaults to 0), then the server is loaded to measure the RPS.

Here are the results:

```
# Main branch
$ python bench.py -c1 -t0
Benchmark: clients = 1, msg-size = 100, background-threads = 0
16324.2 RPS
$ python bench.py -c1 -t1
Benchmark: clients = 1, msg-size = 100, background-threads = 1
Spinner spun 1.52e+07 cycles/second
97.6 RPS
$ python bench.py -c2 -t0
Benchmark: clients = 2, msg-size = 100, background-threads = 0
31308.0 RPS
$ python bench.py -c2 -t1
Benchmark: clients = 2, msg-size = 100, background-threads = 1
Spinner spun 1.52e+07 cycles/second
96.2 RPS
$ python bench.py -c10 -t0
Benchmark: clients = 10, msg-size = 100, background-threads = 0
47169.6 RPS
$ python bench.py -c10 -t1
Benchmark: clients = 10, msg-size = 100, background-threads = 1
Spinner spun 1.54e+07 cycles/second
95.4 RPS

# With this patch
$ ./python bench.py -c1 -t0
Benchmark: clients = 1, msg-size = 100, background-threads = 0
18201.8 RPS
$ ./python bench.py -c1 -t1
Benchmark: clients = 1, msg-size = 100, background-threads = 1
Spinner spun 9.03e+06 cycles/second
194.6 RPS
$ ./python bench.py -c2 -t0
Benchmark: clients = 2, msg-size = 100, background-threads = 0
34151.8 RPS
$ ./python bench.py -c2 -t1
Benchmark: clients = 2, msg-size = 100, background-threads = 1
Spinner spun 8.72e+06 cycles/second
729.6 RPS
$ ./python bench.py -c10 -t0
Benchmark: clients = 10, msg-size = 100, background-threads = 0
53666.6 RPS
$ ./python bench.py -c10 -t1
Benchmark: clients = 10, msg-size = 100, background-threads = 1
Spinner spun 5e+06 cycles/second
21838.2 RPS
```

A few comments on the results:

- On the main branch, any GIL contention sharply decreases the number of RPS an asyncio server can handle, regardless of the number of clients. This makes sense - any socket operation will release the GIL, and the server thread will have to wait to reacquire it (up to the switch interval), rinse and repeat. So if every request requires 1 recv and 1 send, a server with background GIL contention is stuck at a max of `1 / (2 * switchinterval)` or 200 RPS with default configuration. This effectively prioritizes the background thread over the IO thread, since the IO thread releases the GIL very frequently and the background thread never does.

- With the patch, we still see a performance degradation, but the degradation is less severe and improves with the number of clients. This is because with these changes the asyncio thread only releases the GIL when doing a blocking poll for new IO events (or when the switch interval is hit). With low load (1 client), the IO thread becomes idle more frequently and releases the GIL. Under higher load though the event loop frequently still has work to do at the end of a cycle and issues a `selector.select` call with a 0 timeout (nonblocking), avoiding releasing the GIL at all during that loop (note the nonlinear effect of adding more clients). Since the IO thread still releases the GIL sometimes, the background thread still holds the GIL a larger percentage of the time than the IO thread, but the difference between them is less severe than without this patch.

I have also tested this patch on a Dask cluster running some real-world problems and found that it did improve performance where IO was throttled due to GIL contention.

History
Date	User	Action	Args
2021-11-16 18:51:40	jcristharif	set	recipients: + jcristharif
2021-11-16 18:51:40	jcristharif	set	messageid: <1637088700.35.0.697825575179.issue45819@roundup.psfhosted.org>
2021-11-16 18:51:40	jcristharif	link	issue45819 messages
2021-11-16 18:51:39	jcristharif	create