test_signal: test_warn_on_full_buffer() failed on AppVeyor #78311

vstinner · 2018-07-16T22:19:02Z

BPO	34130
Nosy	@pitrou, @vstinner, @njsmith, @1st1
PRs	bpo-34130: Fix test_signal.test_socket() #8326 bpo-34130: Fix test_signal.test_warn_on_full_buffer() #8327 [3.7] bpo-34130: Fix 2 race conditions in test_signal #8329 [3.6] bpo-34130: Fix test_signal.test_socket() (GH-8326) #8330

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-07-18.17:03:02.895>
created_at = <Date 2018-07-16.22:19:01.981>
labels = ['3.8', '3.7', 'tests']
title = 'test_signal: test_warn_on_full_buffer() failed on AppVeyor'
updated_at = <Date 2018-07-18.17:03:02.893>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2018-07-18.17:03:02.893>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2018-07-18.17:03:02.895>
closer = 'vstinner'
components = ['Tests']
creation = <Date 2018-07-16.22:19:01.981>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 34130
keywords = ['patch']
message_count = 21.0
messages = ['321775', '321780', '321781', '321783', '321815', '321873', '321874', '321875', '321876', '321877', '321879', '321880', '321881', '321884', '321889', '321890', '321892', '321893', '321894', '321897', '321898']
nosy_count = 4.0
nosy_names = ['pitrou', 'vstinner', 'njs', 'yselivanov']
pr_nums = ['8326', '8327', '8329', '8330']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue34130'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

vstinner · 2018-07-16T22:19:02Z

test_signal.test_warn_on_full_buffer() failed on AppVeyor.

https://ci.appveyor.com/project/python/cpython/build/3.8build19372

======================================================================
FAIL: test_warn_on_full_buffer (test.test_signal.WakeupSocketSignalTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "C:\projects\cpython\lib\test\test_signal.py", line 538, in test_warn_on_full_buffer
    assert_python_ok('-c', code)
  File "C:\projects\cpython\lib\test\support\script_helper.py", line 157, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
  File "C:\projects\cpython\lib\test\support\script_helper.py", line 143, in _assert_python
    res.fail(cmd_line)
  File "C:\projects\cpython\lib\test\support\script_helper.py", line 84, in fail
    err))
AssertionError: Process return code is 1
command line: ['C:\\projects\\cpython\\PCbuild\\win32\\python.exe', '-X', 'faulthandler', '-I', '-c', 'if 1:\n        import errno\n        import signal\n        import socket\n        import sys\n        import time\n        import _testcapi\n        from test.support import captured_stderr\n\n        signum = signal.SIGINT\n\n        # This handler will be called, but we intentionally won\'t read from\n        # the wakeup fd.\n        def handler(signum, frame):\n            pass\n\n        signal.signal(signum, handler)\n\n        read, write = socket.socketpair()\n        read.setblocking(False)\n        write.setblocking(False)\n\n        # Fill the send buffer\n        try:\n            while True:\n                write.send(b"x")\n        except BlockingIOError:\n            pass\n\n        # By default, we get a warning when a signal arrives\n        signal.set_wakeup_fd(write.fileno())\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        # And also if warn_on_full_buffer=True\n        signal.set_wakeup_fd(write.fileno(), warn_on_full_buffer=True)\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        # But not if warn_on_full_buffer=False\n        signal.set_wakeup_fd(write.fileno(), warn_on_full_buffer=False)\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if err != "":\n            raise AssertionError("got unexpected output %r" % (err,))\n\n        # And then check the default again, to make sure warn_on_full_buffer\n        # settings don\'t leak across calls.\n        signal.set_wakeup_fd(write.fileno())\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        ']
stdout:
---

stderr:
---

Traceback (most recent call last):
  File "<string>", line 39, in <module>
AssertionError
---

The test passed when run again sequentially ("Re-running test 'test_signal' in verbose mode").

vstinner · 2018-07-17T00:09:38Z

The test failed at:

# Fill the send buffer
try:
    while True:
        write.send(b"x")
except BlockingIOError:
    pass

    # By default, we get a warning when a signal arrives
    signal.set_wakeup_fd(write.fileno())

    with captured_stderr() as err:
        _testcapi.raise_signal(signum)

    err = err.getvalue()
    if ('Exception ignored when trying to {action} to the signal wakeup fd'
        not in err):
        raise AssertionError(err) # <~~~~~~ HERE

vstinner · 2018-07-17T00:16:34Z

Traceback (most recent call last):
  File "<string>", line 39, in <module>
AssertionError

According to the traceback, the captured stderr ('err' variable) is an empty string.

The test uses test.support.capture_stderr() which replaces sys.stderr.

The signal module calls PySys_WriteStderr(msg) which calls sys.stderr.write(msg). If the Python call fails, msg is supposed to be written into the C stderr stream. sys.stderr.flush() is not called, but it shouldn't be needed, since test.support.capture_stderr() replace sys.stderr with a io.StringIO object.

njsmith · 2018-07-17T00:43:00Z

Huh, that's weird. My first thought was some kind of race condition, but... raise_signal uses raise(), which on Windows should be invoking the signal handler synchronously, so the warning should definitely be printed before raise_signal() returns.

Could the warning be trapped in some buffer? That would be weird too, usually stderr and warnings should not be buffered.

Of course putting a retry loop around the test is an option if we can't figure out how to fix it properly.

vstinner · 2018-07-17T11:29:24Z

I succeeded to reproduce the issue on Windows, but I also got a failure of test_signal.test_socket() as if the C signal handle has not been called immediately by _testcapi.raise_signal() and so the socket didn't contain the byte of the signal number. It seems quite easy to make test_signal fail on Windows.

vstinner · 2018-07-18T10:09:54Z

On Windows 10, when I run "python -m test test_signal -v -F" in 3 terminals in parallel, sometimes test_socket() fails.

I debugged test_socket() and I validated that the C signal handler called send() and that the signal byte (signal number) has been sent properly (send() returns 1) when the bug occurs. I added an internal "static size_t written = 0;" variable which is incremented when send() succeed and I added a signal.get_written() method to get the value after read.recv(1) fails in the unit test. But recv() fails with BlockingIOError, as if the single byte has not be transfered from the write end to the read end of the TCP socket pair.

On Windows, a socket pair is a pair of two TCP sockets connected on the local link (IPv4 127.0.0.1). It's not a UNIX socket as on Linux.

Questions.

(*) Is the socket pair properly connected when the bug occurs? socket.socketpair() sets the socket as non-blocking to call connect() and ignores when connect() fails with BlockingIOError or InterruptedError (EINTR). I modified the Windows implementation of socketpair() (in socket.py) to use a blocking call and the bug occurs even if connect() is made in blocking mode. I also checked that csock.getpeername() == lsock.getsockname(). So the socket is connected.

(*) Does the socket pair use buffering somewhere? Both socekts are set to non-blocking mode by test_signal.test_socket(). By default, SO_SNDBUF is 64 KiB (65 536 bytes) for the write end of the socket pair, and SO_RCVBUF is 64 KiB (65 536 bytes) for the read end of the socket pair.

It *seems* like adding the following call to test_socket() works around the bug:

write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)

I'm not sure if it's "correct" to use SO_SNDBUF=0. I'm not sure that SO_SNDBUF=0 really *fix* the bug: it's a race condition, so it's hard to check if it's really fixed or not.

vstinner · 2018-07-18T10:16:59Z

write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)

Instead of that, I also tried:

read.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
write.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

But I'm able to reproduce the bug with TCP_NODELAY=1 on the write end. (The option should have no impact on the read end, but I added it anyway, just in case, I still reproduce the bug.)

MSDN documentation for TCP_NODELAY:

"Disables the Nagle algorithm for send coalescing.This socket option is included for backward compatibility with Windows Sockets 1.1 "

https://docs.microsoft.com/en-us/windows/desktop/api/winsock/nf-winsock-setsockopt

vstinner · 2018-07-18T10:18:02Z

Note: socket.socketpair() has been implemented for Windows in pure Python in Python 3.5: bpo-18643.

vstinner · 2018-07-18T10:28:24Z

Oh, I found a test written by Nathaniel in trio source code and he wrote that "if SNDBUF = 0 freezes, so that's useless" :-)

https://github.com/python-trio/trio/blob/master/notes-to-self/socketpair-buffering.py

vstinner · 2018-07-18T10:38:34Z

Using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)", I ran a stress test on test_socket() in 3 different terminals and I failed to reproduce the bug, whereas the same test (in 3 terminals) usually reproduces the bug in less than 30 seconds.

Using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)", I still reproduce the bug.

It seems like Windows behaves differently with SO_SNDBUF=0 as if the socket has really no kernel buffer and so respects user preference for SO_SNDBUF=0 ;-)

vstinner · 2018-07-18T10:48:02Z

I found two different articles on microsoft.com which confirms that SO_SNDBUF=0 is valid and is intended to completely disable kernel buffering:

"If SO_SNDBUF=0, each send() or WriteFile() must wait for an acknowledgement from the other side before returning. (That is the only way for the protocol to implement retransmissions)"

vstinner · 2018-07-18T12:37:02Z

I ran "python -m test test_signal -v -F" 4 times in parallel (in 4 terminals), and I reproduced test_warn_on_full_buffer() failure after one hour: 2 terminals got the failures (the 2 others are still running).

I ran this test using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)" fix on test_socket().

vstinner · 2018-07-18T13:14:53Z

I modified test_warn_on_full_buffer() to fill the socket pair using chunks of 64 KiB instead of sending b'x' byte per byte, just to make the test faster.

Using two terminals, I ran "python -m test test_signal -m test_warn_on_full_buffer -F": I reproduced the bug in 2 minutes.

Ok. The problem is in that is that the code responsible to full the socketpair is not reliable. *Sometimes*, send() blocks, whereas if I retry *immediately*, the send() complete (the socketpair is not full).

The issue seems to be related to the test_socket() failure where send() is "asynchronous".

vstinner · 2018-07-18T14:26:59Z

I wrote a simple fix for test_signal.test_socket(): PR 8326.

Just configure the read end of the socket pair as blocking, so read.recv(1) blocks until we get the signal handler.

This fix is simple and should be more portable than my previous attempt of workaround:

write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)

vstinner · 2018-07-18T15:16:03Z

I tried to modify the test_warn_on_full_buffer() code filling the socketpair buffer:

iterate on multiple chunk sizes: 2**16, 2**10, 2**5, 1 bytes
first try in non-blocking mode then retry with a timeout of 100 ms
I also tried to play with SO_SNDBUF and SO_RCVBUF socket options

The problem is that I failed to fail a reliable way to fill the socketpair buffer.

I will stick with Nathaniel's very simple idea: retry once if the test fails. I implemented exactly this idea in my PR 8327 (with additional minor changes to enhance the test).

vstinner · 2018-07-18T15:44:08Z

New changeset 99bb6df by Victor Stinner in branch 'master':
bpo-34130: Fix test_signal.test_socket() (GH-8326)
99bb6df

vstinner · 2018-07-18T16:29:58Z

New changeset 686b4b5 by Victor Stinner in branch 'master':
bpo-34130: Fix test_signal.test_warn_on_full_buffer() (GH-8327)
686b4b5

vstinner · 2018-07-18T16:48:44Z

Python 2.7 is not impacted: it doesn't have test_socket() nor test_warn_on_full_buffer() tests in test_signal.

Python 3.6 only has test_socket() in test_signal.

vstinner · 2018-07-18T16:49:30Z

New changeset 296e572 by Victor Stinner in branch '3.7':
[3.7] bpo-34130: Fix 2 race conditions in test_signal (GH-8329)
296e572

vstinner · 2018-07-18T17:02:09Z

New changeset e13180b by Victor Stinner (Miss Islington (bot)) in branch '3.6':
bpo-34130: Fix test_signal.test_socket() (GH-8326) (GH-8330)
e13180b

vstinner · 2018-07-18T17:03:03Z

Ok, test_signal should now be more reliable on Windows in 3.6, 3.7 and master branches.

If test_signal.test_warn_on_full_buffer() fails again, we should just increase the timeout of 50 ms to 100 ms or more.

vstinner added 3.8 only security fixes tests Tests in the Lib/test dir labels Jul 16, 2018

vstinner added the 3.7 (EOL) end of life label Jul 18, 2018

vstinner closed this as completed Jul 18, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_signal: test_warn_on_full_buffer() failed on AppVeyor #78311

test_signal: test_warn_on_full_buffer() failed on AppVeyor #78311

vstinner commented Jul 16, 2018

vstinner commented Jul 16, 2018

vstinner commented Jul 17, 2018

vstinner commented Jul 17, 2018

njsmith commented Jul 17, 2018

vstinner commented Jul 17, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

test_signal: test_warn_on_full_buffer() failed on AppVeyor #78311

test_signal: test_warn_on_full_buffer() failed on AppVeyor #78311

Comments

vstinner commented Jul 16, 2018

vstinner commented Jul 16, 2018

vstinner commented Jul 17, 2018

vstinner commented Jul 17, 2018

njsmith commented Jul 17, 2018

vstinner commented Jul 17, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018

vstinner commented Jul 18, 2018