Socket accept exhaustion during high TCP traffic #72093

kevinconway · 2016-08-31T01:57:24Z

BPO	27906
Nosy	@gvanrossum, @vstinner, @1st1
Files	testservice.zip: Service example for testing multi-accept.patch: Patchfile with proposed solution. multi-accept-2.patch: Patch file with test multi-accept-3.patch: Patch with test and comment

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/1st1'
closed_at = None
created_at = <Date 2016-08-31.01:57:24.161>
labels = ['expert-asyncio']
title = 'Socket accept exhaustion during high TCP traffic'
updated_at = <Date 2017-04-01.05:47:54.133>
user = 'https://bugs.python.org/kevinconway'

bugs.python.org fields:

activity = <Date 2017-04-01.05:47:54.133>
actor = 'serhiy.storchaka'
assignee = 'yselivanov'
closed = False
closed_date = None
closer = None
components = ['asyncio']
creation = <Date 2016-08-31.01:57:24.161>
creator = 'kevinconway'
dependencies = []
files = ['44286', '44287', '44289', '44321']
hgrepos = []
issue_num = 27906
keywords = ['patch']
message_count = 16.0
messages = ['273989', '273990', '273991', '273992', '273993', '273996', '274121', '274143', '276498', '276499', '276501', '276526', '276582', '276584', '276586', '276592']
nosy_count = 5.0
nosy_names = ['gvanrossum', 'vstinner', 'python-dev', 'yselivanov', 'kevinconway']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue27906'
versions = ['Python 3.4', 'Python 3.5', 'Python 3.6']

kevinconway · 2016-08-31T01:57:22Z

My organization noticed this issue after launching several asyncio services that would receive either a sustained high number of incoming connections or regular bursts of traffic. Our monitoring showed a loss of between 4% and 6% of all incoming requests. On the client side we see a socket read error "Connection reset by peer". On the asyncio side, with debug turned on, we see nothing.

After some more investigation we determined asyncio was not calling 'accept()' on the listening socket fast enough. To further test this we put together several hello-world type examples and put them under load. I've attached the project we used to test. Included are three docker files that will run the services under different configurations. One runs the service as an aiohttp service, the other uses the aiohttp worker behind gunicorn, and the third runs the aiohttp service with the proposed asyncio patch in place. For our testing we used 'wrk' to generate traffic and collect data on the OS/socket errors.

For anyone attempting to recreate our experiments, we ran a three test batteries against the service for each endpoint using:

wrk --duration 30s --timeout 10s --latency --threads 2 --connections 10 <URL>
wrk --duration 30s --timeout 10s --latency --threads 2 --connections 100 <URL>
wrk --duration 30s --timeout 10s --latency --threads 2 --connections 1000 <URL>

The endpoints most valuable for us to test were the ones that replicated some of our production logic:

<URL>/ # Hello World
<URL>/sleep?time=100 # Every request is delayed by 100 milliseconds and returns an HTML message.
<URL>/blocking/inband # Every request performs a bcrypt with complexity 10 and performs the CPU blocking work on the event loop thread.

Our results varied based on the available CPU cycles, but we consistently recreate the socket read errors from production using the above tests.

Our proposed solution, attached as a patch file, is to put the socket.accept() call in a loop that is bounded by the listening socket's backlog. We use the backlog value as an upper bound to prevent the reverse situation of starving active coroutines while the event loop continues to accept new connections without yielding. With the proposed patch in place our loss rate disappeared.

For further comparison, we reviewed the socket accept logic in Twisted against which we ran similar tests and encountered no loss. We found that Twisted already runs the socket accept in a bounded loop to prevent this issue (https://github.com/twisted/twisted/blob/trunk/src/twisted/internet/tcp.py#L1028).

kevinconway · 2016-08-31T01:58:02Z

Attaching the patch file

gvanrossum · 2016-08-31T02:22:19Z

Thanks -- IIRC this was even brought up during asyncio's early implementation phase, but we didn't have enough experience to warrant the extra logic. It seems like now is the time to do this!

I hope you won't need it for 3.4, because that version is out of support. However I would like to have this upstream in https://github.com/python/asyncio.

Is there any chance you can write a unit test for this functionality? Or is the only way to test it really to do a load test?

kevinconway · 2016-08-31T03:07:31Z

I'll dig into the existing asyncio unit tests and see what I can come up with. I'm not sure, yet, exactly what I might test for.

The variables involved with reproducing the error are mostly environmental. CPU speed of the host, amount of CPU bound work happening in handler coroutines, and the rate of new connections are the major contributors we've identified. I'm not sure how I might simulate those in a unit test.

Would it be sufficient to add a test that ensures the _accept_connection calls .accept() on the listening socket 'backlog' number of times in event there are no OS errors?

gvanrossum · 2016-08-31T03:10:11Z

Yeah, the tests are often full of mocks, and I trust that the test
suite you wrote verifies that this fixes the problem. The unit tests
are more to ensure that future changes to the code won't accidentally
break the code you wrote (which is why we should really have test
coverage, but that's another story).

kevinconway · 2016-08-31T04:09:27Z

I've added a unit test to the patch that asserts sock.accept() is called the appropriate number of times.

Worth a note, I had to adjust one of the existing tests to account for the new backlog argument. There is a default value for the argument to preserve backwards compat for any callers, but the mock used in the test was not tolerant of having an extra arg available.

kevinconway · 2016-09-01T13:11:30Z

Added a comment to the .accept() loop with a reference to the issue.

gvanrossum · 2016-09-01T15:33:41Z

I'll try to get to this during the core dev sprint next week.

gvanrossum · 2016-09-14T23:52:14Z

Sorry, we didn't get to this. @yury, do you think you could get to this before b1 goes out? It's a pure optimization (and a good one!).

1st1 · 2016-09-15T01:02:03Z

@yury, do you think you could get to this before b1 goes out? It's a pure optimization (and a good one!).

LGTM. Will commit tomorrow.

gvanrossum · 2016-09-15T01:30:20Z

(Of course I meant b2.)

vstinner · 2016-09-15T07:53:14Z

On the asyncio side, with debug turned on, we see nothing.

Hum, I'm concerned by this issue. I would expect that in debug log, a log would be emitted. Maybe something like "accept queue saturated! we may probably loose incoming connections".

Does someone see a way to log a message in such case? Maybe only in debug mode?

I spent time to make sure that asyncio logs events related to network to try to understand what is happening. It's hard to decide which events should be logged, because logging everything is likely to flood logs.

For example, _SelectorSocketTransport._read_ready() logs an event for

"EOF" (when recv() returns an empty string).
transport pauses reading
transport resumes reading

1st1 · 2016-09-15T17:57:24Z

> On the asyncio side, with debug turned on, we see nothing.

Does someone see a way to log a message in such case? Maybe only in debug mode?

I'm not sure we can do anything here. @kevinconway, did you try to find out if it's possible to log this somehow?

gvanrossum · 2016-09-15T18:03:56Z

Maybe with the new code it could log a message if it doesn't exit the
loop via a break? Because that would mean the listen buffer is full.
It would encourage people to configure a larger number.

python-dev · 2016-09-15T18:15:41Z

New changeset 1dcfafed3cb0 by Yury Selivanov in branch '3.5':
Issue bpo-27906: Fix socket accept exhaustion during high TCP traffic.
https://hg.python.org/cpython/rev/1dcfafed3cb0

New changeset 62948164ff94 by Yury Selivanov in branch '3.6':
Merge 3.5 (issue bpo-27906)
https://hg.python.org/cpython/rev/62948164ff94

New changeset ac13bf1967b5 by Yury Selivanov in branch 'default':
Merge 3.6 (issue bpo-27906)
https://hg.python.org/cpython/rev/ac13bf1967b5

vstinner · 2016-09-15T19:04:30Z

Guido van Rossum: "Maybe with the new code it could log a message if
it doesn't exit the loop via a break? Because that would mean the
listen buffer is full. It would encourage people to configure a larger
number."

Hum, yes maybe. But I would suggest to only log a maximum message per
server, not start filling logs with the same message if the server is
stressed.

kumaraditya303 · 2022-09-17T11:17:07Z

This seems fixed, anything left to do here?

gvanrossum · 2022-09-18T00:48:28Z

Since a PR was merged that claims to address this, I will just close the issue and remove the pending label.

kevinconway mannequin added the topic-asyncio label Aug 31, 2016

1st1 self-assigned this Sep 15, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

kumaraditya303 added the pending The issue will be closed if no feedback is provided label Sep 17, 2022

gvanrossum closed this as completed Sep 18, 2022

gvanrossum removed the pending The issue will be closed if no feedback is provided label Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Socket accept exhaustion during high TCP traffic #72093

Socket accept exhaustion during high TCP traffic #72093

kevinconway mannequin commented Aug 31, 2016

kevinconway mannequin commented Aug 31, 2016

kevinconway mannequin commented Aug 31, 2016

gvanrossum commented Aug 31, 2016

kevinconway mannequin commented Aug 31, 2016

gvanrossum commented Aug 31, 2016

kevinconway mannequin commented Aug 31, 2016

kevinconway mannequin commented Sep 1, 2016

gvanrossum commented Sep 1, 2016

gvanrossum commented Sep 14, 2016

1st1 commented Sep 15, 2016

gvanrossum commented Sep 15, 2016

vstinner commented Sep 15, 2016

1st1 commented Sep 15, 2016

gvanrossum commented Sep 15, 2016

python-dev mannequin commented Sep 15, 2016

vstinner commented Sep 15, 2016

kumaraditya303 commented Sep 17, 2022

gvanrossum commented Sep 18, 2022

Socket accept exhaustion during high TCP traffic #72093

Socket accept exhaustion during high TCP traffic #72093

Comments

kevinconway mannequin commented Aug 31, 2016

kevinconway mannequin commented Aug 31, 2016

kevinconway mannequin commented Aug 31, 2016

gvanrossum commented Aug 31, 2016

kevinconway mannequin commented Aug 31, 2016

gvanrossum commented Aug 31, 2016

kevinconway mannequin commented Aug 31, 2016

kevinconway mannequin commented Sep 1, 2016

gvanrossum commented Sep 1, 2016

gvanrossum commented Sep 14, 2016

1st1 commented Sep 15, 2016

gvanrossum commented Sep 15, 2016

vstinner commented Sep 15, 2016

1st1 commented Sep 15, 2016

gvanrossum commented Sep 15, 2016

python-dev mannequin commented Sep 15, 2016

vstinner commented Sep 15, 2016

kumaraditya303 commented Sep 17, 2022

gvanrossum commented Sep 18, 2022