Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_signal: test_warn_on_full_buffer() failed on AppVeyor #78311

Closed
vstinner opened this issue Jul 16, 2018 · 21 comments
Closed

test_signal: test_warn_on_full_buffer() failed on AppVeyor #78311

vstinner opened this issue Jul 16, 2018 · 21 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

BPO 34130
Nosy @pitrou, @vstinner, @njsmith, @1st1
PRs
  • bpo-34130: Fix test_signal.test_socket() #8326
  • bpo-34130: Fix test_signal.test_warn_on_full_buffer() #8327
  • [3.7] bpo-34130: Fix 2 race conditions in test_signal #8329
  • [3.6] bpo-34130: Fix test_signal.test_socket() (GH-8326) #8330
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-07-18.17:03:02.895>
    created_at = <Date 2018-07-16.22:19:01.981>
    labels = ['3.8', '3.7', 'tests']
    title = 'test_signal: test_warn_on_full_buffer() failed on AppVeyor'
    updated_at = <Date 2018-07-18.17:03:02.893>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2018-07-18.17:03:02.893>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-07-18.17:03:02.895>
    closer = 'vstinner'
    components = ['Tests']
    creation = <Date 2018-07-16.22:19:01.981>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 34130
    keywords = ['patch']
    message_count = 21.0
    messages = ['321775', '321780', '321781', '321783', '321815', '321873', '321874', '321875', '321876', '321877', '321879', '321880', '321881', '321884', '321889', '321890', '321892', '321893', '321894', '321897', '321898']
    nosy_count = 4.0
    nosy_names = ['pitrou', 'vstinner', 'njs', 'yselivanov']
    pr_nums = ['8326', '8327', '8329', '8330']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue34130'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @vstinner
    Copy link
    Member Author

    test_signal.test_warn_on_full_buffer() failed on AppVeyor.

    https://ci.appveyor.com/project/python/cpython/build/3.8build19372

    ======================================================================
    FAIL: test_warn_on_full_buffer (test.test_signal.WakeupSocketSignalTests)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "C:\projects\cpython\lib\test\test_signal.py", line 538, in test_warn_on_full_buffer
        assert_python_ok('-c', code)
      File "C:\projects\cpython\lib\test\support\script_helper.py", line 157, in assert_python_ok
        return _assert_python(True, *args, **env_vars)
      File "C:\projects\cpython\lib\test\support\script_helper.py", line 143, in _assert_python
        res.fail(cmd_line)
      File "C:\projects\cpython\lib\test\support\script_helper.py", line 84, in fail
        err))
    AssertionError: Process return code is 1
    command line: ['C:\\projects\\cpython\\PCbuild\\win32\\python.exe', '-X', 'faulthandler', '-I', '-c', 'if 1:\n        import errno\n        import signal\n        import socket\n        import sys\n        import time\n        import _testcapi\n        from test.support import captured_stderr\n\n        signum = signal.SIGINT\n\n        # This handler will be called, but we intentionally won\'t read from\n        # the wakeup fd.\n        def handler(signum, frame):\n            pass\n\n        signal.signal(signum, handler)\n\n        read, write = socket.socketpair()\n        read.setblocking(False)\n        write.setblocking(False)\n\n        # Fill the send buffer\n        try:\n            while True:\n                write.send(b"x")\n        except BlockingIOError:\n            pass\n\n        # By default, we get a warning when a signal arrives\n        signal.set_wakeup_fd(write.fileno())\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        # And also if warn_on_full_buffer=True\n        signal.set_wakeup_fd(write.fileno(), warn_on_full_buffer=True)\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        # But not if warn_on_full_buffer=False\n        signal.set_wakeup_fd(write.fileno(), warn_on_full_buffer=False)\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if err != "":\n            raise AssertionError("got unexpected output %r" % (err,))\n\n        # And then check the default again, to make sure warn_on_full_buffer\n        # settings don\'t leak across calls.\n        signal.set_wakeup_fd(write.fileno())\n\n        with captured_stderr() as err:\n            _testcapi.raise_signal(signum)\n\n        err = err.getvalue()\n        if (\'Exception ignored when trying to send to the signal wakeup fd\'\n            not in err):\n            raise AssertionError(err)\n\n        ']
    stdout:
    ---
    

    stderr:
    ---

    Traceback (most recent call last):
      File "<string>", line 39, in <module>
    AssertionError
    ---
    

    The test passed when run again sequentially ("Re-running test 'test_signal' in verbose mode").

    @vstinner vstinner added 3.8 only security fixes tests Tests in the Lib/test dir labels Jul 16, 2018
    @vstinner
    Copy link
    Member Author

    The test failed at:

    # Fill the send buffer
    try:
        while True:
            write.send(b"x")
    except BlockingIOError:
        pass
    
        # By default, we get a warning when a signal arrives
        signal.set_wakeup_fd(write.fileno())
    
        with captured_stderr() as err:
            _testcapi.raise_signal(signum)
        err = err.getvalue()
        if ('Exception ignored when trying to {action} to the signal wakeup fd'
            not in err):
            raise AssertionError(err) # <~~~~~~ HERE

    @vstinner
    Copy link
    Member Author

    Traceback (most recent call last):
      File "<string>", line 39, in <module>
    AssertionError

    According to the traceback, the captured stderr ('err' variable) is an empty string.

    The test uses test.support.capture_stderr() which replaces sys.stderr.

    The signal module calls PySys_WriteStderr(msg) which calls sys.stderr.write(msg). If the Python call fails, msg is supposed to be written into the C stderr stream. sys.stderr.flush() is not called, but it shouldn't be needed, since test.support.capture_stderr() replace sys.stderr with a io.StringIO object.

    @njsmith
    Copy link
    Contributor

    njsmith commented Jul 17, 2018

    Huh, that's weird. My first thought was some kind of race condition, but... raise_signal uses raise(), which on Windows should be invoking the signal handler synchronously, so the warning should definitely be printed before raise_signal() returns.

    Could the warning be trapped in some buffer? That would be weird too, usually stderr and warnings should not be buffered.

    Of course putting a retry loop around the test is an option if we can't figure out how to fix it properly.

    @vstinner
    Copy link
    Member Author

    I succeeded to reproduce the issue on Windows, but I also got a failure of test_signal.test_socket() as if the C signal handle has not been called immediately by _testcapi.raise_signal() and so the socket didn't contain the byte of the signal number. It seems quite easy to make test_signal fail on Windows.

    @vstinner
    Copy link
    Member Author

    On Windows 10, when I run "python -m test test_signal -v -F" in 3 terminals in parallel, sometimes test_socket() fails.

    I debugged test_socket() and I validated that the C signal handler called send() and that the signal byte (signal number) has been sent properly (send() returns 1) when the bug occurs. I added an internal "static size_t written = 0;" variable which is incremented when send() succeed and I added a signal.get_written() method to get the value after read.recv(1) fails in the unit test. But recv() fails with BlockingIOError, as if the single byte has not be transfered from the write end to the read end of the TCP socket pair.

    On Windows, a socket pair is a pair of two TCP sockets connected on the local link (IPv4 127.0.0.1). It's not a UNIX socket as on Linux.

    Questions.

    (*) Is the socket pair properly connected when the bug occurs? socket.socketpair() sets the socket as non-blocking to call connect() and ignores when connect() fails with BlockingIOError or InterruptedError (EINTR). I modified the Windows implementation of socketpair() (in socket.py) to use a blocking call and the bug occurs even if connect() is made in blocking mode. I also checked that csock.getpeername() == lsock.getsockname(). So the socket is connected.

    (*) Does the socket pair use buffering somewhere? Both socekts are set to non-blocking mode by test_signal.test_socket(). By default, SO_SNDBUF is 64 KiB (65 536 bytes) for the write end of the socket pair, and SO_RCVBUF is 64 KiB (65 536 bytes) for the read end of the socket pair.

    It *seems* like adding the following call to test_socket() works around the bug:

    write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)

    I'm not sure if it's "correct" to use SO_SNDBUF=0. I'm not sure that SO_SNDBUF=0 really *fix* the bug: it's a race condition, so it's hard to check if it's really fixed or not.

    @vstinner
    Copy link
    Member Author

    write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)

    Instead of that, I also tried:

    read.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
    write.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

    But I'm able to reproduce the bug with TCP_NODELAY=1 on the write end. (The option should have no impact on the read end, but I added it anyway, just in case, I still reproduce the bug.)

    MSDN documentation for TCP_NODELAY:

    "Disables the Nagle algorithm for send coalescing.This socket option is included for backward compatibility with Windows Sockets 1.1 "

    https://docs.microsoft.com/en-us/windows/desktop/api/winsock/nf-winsock-setsockopt

    @vstinner
    Copy link
    Member Author

    Note: socket.socketpair() has been implemented for Windows in pure Python in Python 3.5: bpo-18643.

    @vstinner
    Copy link
    Member Author

    Oh, I found a test written by Nathaniel in trio source code and he wrote that "if SNDBUF = 0 freezes, so that's useless" :-)

    https://github.com/python-trio/trio/blob/master/notes-to-self/socketpair-buffering.py

    @vstinner
    Copy link
    Member Author

    Using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)", I ran a stress test on test_socket() in 3 different terminals and I failed to reproduce the bug, whereas the same test (in 3 terminals) usually reproduces the bug in less than 30 seconds.

    Using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)", I still reproduce the bug.

    It seems like Windows behaves differently with SO_SNDBUF=0 as if the socket has really no kernel buffer and so respects user preference for SO_SNDBUF=0 ;-)

    @vstinner
    Copy link
    Member Author

    I found two different articles on microsoft.com which confirms that SO_SNDBUF=0 is valid and is intended to completely disable kernel buffering:

    "If SO_SNDBUF=0, each send() or WriteFile() must wait for an acknowledgement from the other side before returning. (That is the only way for the protocol to implement retransmissions)"

    @vstinner
    Copy link
    Member Author

    I ran "python -m test test_signal -v -F" 4 times in parallel (in 4 terminals), and I reproduced test_warn_on_full_buffer() failure after one hour: 2 terminals got the failures (the 2 others are still running).

    I ran this test using "write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)" fix on test_socket().

    @vstinner
    Copy link
    Member Author

    I modified test_warn_on_full_buffer() to fill the socket pair using chunks of 64 KiB instead of sending b'x' byte per byte, just to make the test faster.

    Using two terminals, I ran "python -m test test_signal -m test_warn_on_full_buffer -F": I reproduced the bug in 2 minutes.

    Ok. The problem is in that is that the code responsible to full the socketpair is not reliable. *Sometimes*, send() blocks, whereas if I retry *immediately*, the send() complete (the socketpair is not full).

    The issue seems to be related to the test_socket() failure where send() is "asynchronous".

    @vstinner
    Copy link
    Member Author

    I wrote a simple fix for test_signal.test_socket(): PR 8326.

    Just configure the read end of the socket pair as blocking, so read.recv(1) blocks until we get the signal handler.

    This fix is simple and should be more portable than my previous attempt of workaround:

    write.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 0)

    @vstinner
    Copy link
    Member Author

    I tried to modify the test_warn_on_full_buffer() code filling the socketpair buffer:

    • iterate on multiple chunk sizes: 2**16, 2**10, 2**5, 1 bytes
    • first try in non-blocking mode then retry with a timeout of 100 ms
    • I also tried to play with SO_SNDBUF and SO_RCVBUF socket options

    The problem is that I failed to fail a reliable way to fill the socketpair buffer.

    I will stick with Nathaniel's very simple idea: retry once if the test fails. I implemented exactly this idea in my PR 8327 (with additional minor changes to enhance the test).

    @vstinner
    Copy link
    Member Author

    New changeset 99bb6df by Victor Stinner in branch 'master':
    bpo-34130: Fix test_signal.test_socket() (GH-8326)
    99bb6df

    @vstinner
    Copy link
    Member Author

    New changeset 686b4b5 by Victor Stinner in branch 'master':
    bpo-34130: Fix test_signal.test_warn_on_full_buffer() (GH-8327)
    686b4b5

    @vstinner
    Copy link
    Member Author

    Python 2.7 is not impacted: it doesn't have test_socket() nor test_warn_on_full_buffer() tests in test_signal.

    Python 3.6 only has test_socket() in test_signal.

    @vstinner vstinner added the 3.7 (EOL) end of life label Jul 18, 2018
    @vstinner
    Copy link
    Member Author

    New changeset 296e572 by Victor Stinner in branch '3.7':
    [3.7] bpo-34130: Fix 2 race conditions in test_signal (GH-8329)
    296e572

    @vstinner
    Copy link
    Member Author

    New changeset e13180b by Victor Stinner (Miss Islington (bot)) in branch '3.6':
    bpo-34130: Fix test_signal.test_socket() (GH-8326) (GH-8330)
    e13180b

    @vstinner
    Copy link
    Member Author

    Ok, test_signal should now be more reliable on Windows in 3.6, 3.7 and master branches.

    If test_signal.test_warn_on_full_buffer() fails again, we should just increase the timeout of 50 ms to 100 ms or more.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants