Message 273989 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kevinconway
Recipients	gvanrossum, kevinconway, vstinner, yselivanov
Date	2016-08-31.01:57:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1472608644.33.0.983787454696.issue27906@psf.upfronthosting.co.za>
In-reply-to

Content
My organization noticed this issue after launching several asyncio services that would receive either a sustained high number of incoming connections or regular bursts of traffic. Our monitoring showed a loss of between 4% and 6% of all incoming requests. On the client side we see a socket read error "Connection reset by peer". On the asyncio side, with debug turned on, we see nothing. After some more investigation we determined asyncio was not calling 'accept()' on the listening socket fast enough. To further test this we put together several hello-world type examples and put them under load. I've attached the project we used to test. Included are three docker files that will run the services under different configurations. One runs the service as an aiohttp service, the other uses the aiohttp worker behind gunicorn, and the third runs the aiohttp service with the proposed asyncio patch in place. For our testing we used 'wrk' to generate traffic and collect data on the OS/socket errors. For anyone attempting to recreate our experiments, we ran a three test batteries against the service for each endpoint using: wrk --duration 30s --timeout 10s --latency --threads 2 --connections 10 <URL> wrk --duration 30s --timeout 10s --latency --threads 2 --connections 100 <URL> wrk --duration 30s --timeout 10s --latency --threads 2 --connections 1000 <URL> The endpoints most valuable for us to test were the ones that replicated some of our production logic: <URL>/ # Hello World <URL>/sleep?time=100 # Every request is delayed by 100 milliseconds and returns an HTML message. <URL>/blocking/inband # Every request performs a bcrypt with complexity 10 and performs the CPU blocking work on the event loop thread. Our results varied based on the available CPU cycles, but we consistently recreate the socket read errors from production using the above tests. Our proposed solution, attached as a patch file, is to put the socket.accept() call in a loop that is bounded by the listening socket's backlog. We use the backlog value as an upper bound to prevent the reverse situation of starving active coroutines while the event loop continues to accept new connections without yielding. With the proposed patch in place our loss rate disappeared. For further comparison, we reviewed the socket accept logic in Twisted against which we ran similar tests and encountered no loss. We found that Twisted already runs the socket accept in a bounded loop to prevent this issue (https://github.com/twisted/twisted/blob/trunk/src/twisted/internet/tcp.py#L1028).

My organization noticed this issue after launching several asyncio services that would receive either a sustained high number of incoming connections or regular bursts of traffic. Our monitoring showed a loss of between 4% and 6% of all incoming requests. On the client side we see a socket read error "Connection reset by peer". On the asyncio side, with debug turned on, we see nothing.

After some more investigation we determined asyncio was not calling 'accept()' on the listening socket fast enough. To further test this we put together several hello-world type examples and put them under load. I've attached the project we used to test. Included are three docker files that will run the services under different configurations. One runs the service as an aiohttp service, the other uses the aiohttp worker behind gunicorn, and the third runs the aiohttp service with the proposed asyncio patch in place. For our testing we used 'wrk' to generate traffic and collect data on the OS/socket errors.

For anyone attempting to recreate our experiments, we ran a three test batteries against the service for each endpoint using:

wrk --duration 30s --timeout 10s --latency --threads 2 --connections 10 <URL>
wrk --duration 30s --timeout 10s --latency --threads 2 --connections 100 <URL>
wrk --duration 30s --timeout 10s --latency --threads 2 --connections 1000 <URL>

The endpoints most valuable for us to test were the ones that replicated some of our production logic:

<URL>/ # Hello World
<URL>/sleep?time=100 # Every request is delayed by 100 milliseconds and returns an HTML message.
<URL>/blocking/inband # Every request performs a bcrypt with complexity 10 and performs the CPU blocking work on the event loop thread.

Our results varied based on the available CPU cycles, but we consistently recreate the socket read errors from production using the above tests.

Our proposed solution, attached as a patch file, is to put the socket.accept() call in a loop that is bounded by the listening socket's backlog. We use the backlog value as an upper bound to prevent the reverse situation of starving active coroutines while the event loop continues to accept new connections without yielding. With the proposed patch in place our loss rate disappeared.

For further comparison, we reviewed the socket accept logic in Twisted against which we ran similar tests and encountered no loss. We found that Twisted already runs the socket accept in a bounded loop to prevent this issue (https://github.com/twisted/twisted/blob/trunk/src/twisted/internet/tcp.py#L1028).

History
Date	User	Action	Args
2016-08-31 01:57:24	kevinconway	set	recipients: + kevinconway, gvanrossum, vstinner, yselivanov
2016-08-31 01:57:24	kevinconway	set	messageid: <1472608644.33.0.983787454696.issue27906@psf.upfronthosting.co.za>
2016-08-31 01:57:24	kevinconway	link	issue27906 messages
2016-08-31 01:57:22	kevinconway	create