classification
Title: test_signal: test_warn_on_full_buffer() failed on AppVeyor
Type: Stage: resolved
Components: Tests Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: njs, pitrou, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2018-07-16 22:19 by vstinner, last changed 2018-07-18 17:03 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 8326 merged vstinner, 2018-07-18 14:25
PR 8327 merged vstinner, 2018-07-18 15:12
PR 8329 merged vstinner, 2018-07-18 16:34
PR 8330 merged miss-islington, 2018-07-18 16:48
Messages (21)
msg321775 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-16 22:19
test_signal.test_warn_on_full_buffer() failed on AppVeyor.

https://ci.appveyor.com/project/python/cpython/build/3.8build19372

======================================================================
FAIL: test_warn_on_full_buffer (test.test_signal.WakeupSocketSignalTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\cpython\lib\test\test_signal.py", line 538, in test_warn_on_full_buffer
    assert_python_ok('-c', code)
  File "C:\projects\cpython\lib\test\support\script_helper.py", line 157, in assert_python_ok
    return _assert_python(True, *args, **env_vars)
  File "C:\projects\cpython\lib\test\support\script_helper.py", line 143, in _assert_python
    res.fail(cmd_line)
  File "C:\projects\cpython\lib\test\support\script_helper.py", line 84, in fail
    err))
AssertionError: Process return code is 1
command line: ['C:\\projects\\cpython\\PCbuild\\win32\\python.exe', '-X', 'faulthandler', '-I', '-c', 'if 1:\n        import errno\n        import signal\n        import socket\n        import sys\n        import time\n        import _testcapi\n        from test.support import captured_stderr\n\n        signum = signal.SIGINT\n\n        # This handler will be called, but we intentionally won\'t read from\n        # the wakeup fd.\n        def handler(signum, frame):\n            pass\n\n        signal.signal(signum, handler)\n\n        read, write = socket.socketpair()\n        read.setblocking(False)\n        write.setblocking(False)\n\n        # Fill the send buffer\n        try:\n            while True:\n                write.send(b"x")\n        except BlockingIOError:\n            pass\n\n        # By default, we get a warning when a signal arrives\n        signal.set_wakeup_fd(write.fileno())\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        # And also if warn_on_full_buffer=True\n        signal.set_wakeup_fd(write.fileno(), warn_on_full_buffer=True)\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        # But not if warn_on_full_buffer=False\n        signal.set_wakeup_fd(write.fileno(), warn_on_full_buffer=False)\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if err != "":\n            raise AssertionError("got unexpected output %r" % (err,))\n\n        # And then check the default again, to make sure warn_on_full_buffer\n        # settings don\'t leak across calls.\n        signal.set_wakeup_fd(write.fileno())\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        ']
stdout:
---
---
stderr:
---
Traceback (most recent call last):
  File "<string>", line 39, in <module>
AssertionError
---
----------------------------------------------------------------------

The test passed when run again sequentially ("Re-running test 'test_signal' in verbose mode").
msg321780 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-17 00:09
The test failed at:

    # Fill the send buffer
    try:
        while True:
            write.send(b"x")
    except BlockingIOError:
        pass

    # By default, we get a warning when a signal arrives
    signal.set_wakeup_fd(write.fileno())

    with captured_stderr() as err:
        _testcapi.raise_signal(signum)

    err = err.getvalue()
    if ('Exception ignored when trying to {action} to the signal wakeup fd'
        not in err):
        raise AssertionError(err) # <~~~~~~ HERE
msg321781 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-17 00:16
Traceback (most recent call last):
  File "<string>", line 39, in <module>
AssertionError

According to the traceback, the captured stderr ('err' variable) is an empty string.

The test uses test.support.capture_stderr() which replaces sys.stderr.

The signal module calls PySys_WriteStderr(msg) which calls sys.stderr.write(msg). If the Python call fails, msg is supposed to be written into the C stderr stream. sys.stderr.flush() is not called, but it shouldn't be needed, since test.support.capture_stderr() replace sys.stderr with a io.StringIO object.
msg321783 - (view) Author: Nathaniel Smith (njs) * (Python committer) Date: 2018-07-17 00:42
Huh, that's weird. My first thought was some kind of race condition, but... raise_signal uses raise(), which on Windows should be invoking the signal handler synchronously, so the warning should definitely be printed before raise_signal() returns.

Could the warning be trapped in some buffer? That would be weird too, usually stderr and warnings should not be buffered.

Of course putting a retry loop around the test is an option if we can't figure out how to fix it properly.
msg321815 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-17 11:29
I succeeded to reproduce the issue on Windows, but I also got a failure of test_signal.test_socket() as if the C signal handle has not been called immediately by _testcapi.raise_signal() and so the socket didn't contain the byte of the signal number. It seems quite easy to make test_signal fail on Windows.
msg321873 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 10:09
On Windows 10, when I run "python -m test test_signal -v -F" in 3 terminals in parallel, sometimes test_socket() fails.

I debugged test_socket() and I validated that the C signal handler called send() and that the signal byte (signal number) has been sent properly (send() returns 1) when the bug occurs. I added an internal "static size_t written = 0;" variable which is incremented when send() succeed and I added a signal.get_written() method to get the value after read.recv(1) fails in the unit test. But recv() fails with BlockingIOError, as if the single byte has not be transfered from the write end to the read end of the TCP socket pair.

On Windows, a socket pair is a pair of two TCP sockets connected on the local link (IPv4 127.0.0.1). It's not a UNIX socket as on Linux.

Questions.

(*) Is the socket pair properly connected when the bug occurs? socket.socketpair() sets the socket as non-blocking to call connect() and ignores when connect() fails with BlockingIOError or InterruptedError (EINTR). I modified the Windows implementation of socketpair() (in socket.py) to use a blocking call and the bug occurs even if connect() is made in blocking mode. I also checked that csock.getpeername() == lsock.getsockname(). So the socket is connected.

(*) Does the socket pair use buffering somewhere? Both socekts are set to non-blocking mode by test_signal.test_socket(). By default, SO_SNDBUF is 64 KiB (65 536 bytes) for the write end of the socket pair, and SO_RCVBUF is 64 KiB (65 536 bytes) for the read end of the socket pair.

It *seems* like adding the following call to test_socket() works around the bug:

write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)

I'm not sure if it's "correct" to use SO_SNDBUF=0. I'm not sure that SO_SNDBUF=0 really *fix* the bug: it's a race condition, so it's hard to check if it's really fixed or not.
msg321874 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 10:16
> write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)

Instead of that, I also tried:

read.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
write.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

But I'm able to reproduce the bug with TCP_NODELAY=1 on the write end. (The option should have no impact on the read end, but I added it anyway, just in case, I still reproduce the bug.)

MSDN documentation for TCP_NODELAY:

"Disables the Nagle algorithm for send coalescing.This socket option is included for backward compatibility with Windows Sockets 1.1 "

https://docs.microsoft.com/en-us/windows/desktop/api/winsock/nf-winsock-setsockopt
msg321875 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 10:18
Note: socket.socketpair() has been implemented for Windows in pure Python in Python 3.5: bpo-18643.
msg321876 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 10:28
Oh, I found a test written by Nathaniel in trio source code and he wrote that "if SNDBUF = 0 freezes, so that's useless" :-)

https://github.com/python-trio/trio/blob/master/notes-to-self/socketpair-buffering.py
msg321877 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 10:38
Using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)", I ran a stress test on test_socket() in 3 different terminals and I failed to reproduce the bug, whereas the same test (in 3 terminals) usually reproduces the bug in less than 30 seconds.

Using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)", I still reproduce the bug.

It seems like Windows behaves differently with SO_SNDBUF=0 as if the socket has really no kernel buffer and so respects user preference for SO_SNDBUF=0 ;-)
msg321879 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 10:48
I found two different articles on microsoft.com which confirms that SO_SNDBUF=0 is valid and is intended to completely disable kernel buffering:
* https://support.microsoft.com/en-gb/help/181611/socket-overlapped-i-o-versus-blocking-nonblocking-mode
* https://blogs.msdn.microsoft.com/larryosterman/2004/08/06/the-consequences-of-ignoring-nagling-and-delayed-acks/

"If SO_SNDBUF=0, each send() or WriteFile() must wait for an acknowledgement from the other side before returning. (That is the only way for the protocol to implement retransmissions)"
msg321880 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 12:37
I ran "python -m test test_signal -v -F" 4 times in parallel (in 4 terminals), and I reproduced test_warn_on_full_buffer() failure after one hour: 2 terminals got the failures (the 2 others are still running).

I ran this test using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)" fix on test_socket().
msg321881 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 13:14
I modified test_warn_on_full_buffer() to fill the socket pair using chunks of 64 KiB instead of sending b'x' byte per byte, just to make the test faster.

Using two terminals, I ran "python -m test test_signal -m test_warn_on_full_buffer -F": I reproduced the bug in 2 minutes.

Ok. The problem is in that is that the code responsible to full the socketpair is not reliable. *Sometimes*, send() blocks, whereas if I retry *immediately*, the send() complete (the socketpair is not full).

The issue seems to be related to the test_socket() failure where send() is "asynchronous".
msg321884 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 14:26
I wrote a simple fix for test_signal.test_socket(): PR 8326.

Just configure the read end of the socket pair as blocking, so read.recv(1) blocks until we get the signal handler.

This fix is simple and should be more portable than my previous attempt of workaround:

write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)
msg321889 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 15:16
I tried to modify the test_warn_on_full_buffer() code filling the socketpair buffer:

* iterate on multiple chunk sizes: 2**16, 2**10, 2**5, 1 bytes
* first try in non-blocking mode then retry with a timeout of 100 ms
* I also tried to play with SO_SNDBUF and SO_RCVBUF socket options

The problem is that I failed to fail a reliable way to fill the socketpair buffer.

I will stick with Nathaniel's very simple idea: retry once if the test fails. I implemented exactly this idea in my PR 8327 (with additional minor changes to enhance the test).
msg321890 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 15:44
New changeset 99bb6df66a42625367c4f38e6802c8bb527baf4a by Victor Stinner in branch 'master':
bpo-34130: Fix test_signal.test_socket() (GH-8326)
https://github.com/python/cpython/commit/99bb6df66a42625367c4f38e6802c8bb527baf4a
msg321892 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 16:29
New changeset 686b4b5ff219ed66714f3b811815776dafadc23b by Victor Stinner in branch 'master':
bpo-34130: Fix test_signal.test_warn_on_full_buffer() (GH-8327)
https://github.com/python/cpython/commit/686b4b5ff219ed66714f3b811815776dafadc23b
msg321893 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 16:48
Python 2.7 is not impacted: it doesn't have test_socket() nor test_warn_on_full_buffer() tests in test_signal.

Python 3.6 only has test_socket() in test_signal.
msg321894 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 16:49
New changeset 296e5720921f1837dfd09a16457ac5bba7cb1676 by Victor Stinner in branch '3.7':
[3.7] bpo-34130: Fix 2 race conditions in test_signal (GH-8329)
https://github.com/python/cpython/commit/296e5720921f1837dfd09a16457ac5bba7cb1676
msg321897 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 17:02
New changeset e13180bc4c40a9ac300590c62a6efbcaf10ae6df by Victor Stinner (Miss Islington (bot)) in branch '3.6':
bpo-34130: Fix test_signal.test_socket() (GH-8326) (GH-8330)
https://github.com/python/cpython/commit/e13180bc4c40a9ac300590c62a6efbcaf10ae6df
msg321898 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-07-18 17:03
Ok, test_signal should now be more reliable on Windows in 3.6, 3.7 and master branches.

If test_signal.test_warn_on_full_buffer() fails again, we should just increase the timeout of 50 ms to 100 ms or more.
History
Date User Action Args
2018-07-18 17:03:02vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg321898

stage: patch review -> resolved
2018-07-18 17:02:09vstinnersetmessages: + msg321897
2018-07-18 16:49:30vstinnersetmessages: + msg321894
2018-07-18 16:48:44vstinnersetmessages: + msg321893
versions: + Python 3.6, Python 3.7
2018-07-18 16:48:12miss-islingtonsetpull_requests: + pull_request7867
2018-07-18 16:34:51vstinnersetpull_requests: + pull_request7866
2018-07-18 16:29:57vstinnersetmessages: + msg321892
2018-07-18 15:44:07vstinnersetmessages: + msg321890
2018-07-18 15:16:02vstinnersetmessages: + msg321889
2018-07-18 15:12:51vstinnersetpull_requests: + pull_request7863
2018-07-18 14:26:59vstinnersetmessages: + msg321884
2018-07-18 14:25:35vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request7862
2018-07-18 13:14:53vstinnersetmessages: + msg321881
2018-07-18 12:37:02vstinnersetmessages: + msg321880
2018-07-18 10:48:02vstinnersetmessages: + msg321879
2018-07-18 10:38:34vstinnersetmessages: + msg321877
2018-07-18 10:28:24vstinnersetmessages: + msg321876
2018-07-18 10:18:01vstinnersetmessages: + msg321875
2018-07-18 10:16:58vstinnersetnosy: + yselivanov
messages: + msg321874
2018-07-18 10:09:54vstinnersetmessages: + msg321873
2018-07-17 11:29:23vstinnersetmessages: + msg321815
2018-07-17 00:43:00njssetmessages: + msg321783
2018-07-17 00:16:35vstinnersetmessages: + msg321781
2018-07-17 00:09:38vstinnersetmessages: + msg321780
2018-07-16 22:19:02vstinnercreate