Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic freeze in test_interrupted_write_retry_text #67868

Closed
pitrou opened this issue Mar 16, 2015 · 14 comments
Closed

Sporadic freeze in test_interrupted_write_retry_text #67868

pitrou opened this issue Mar 16, 2015 · 14 comments
Labels
tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@pitrou
Copy link
Member

pitrou commented Mar 16, 2015

BPO 23680
Nosy @pitrou, @vstinner, @gerritholl, @vadmium
Files
  • write-retry.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2015-03-16.15:53:27.284>
    labels = ['type-bug', 'tests']
    title = 'Sporadic freeze in test_interrupted_write_retry_text'
    updated_at = <Date 2017-01-04.15:32:49.190>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2017-01-04.15:32:49.190>
    actor = 'Gerrit.Holl'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Tests']
    creation = <Date 2015-03-16.15:53:27.284>
    creator = 'pitrou'
    dependencies = []
    files = ['43146']
    hgrepos = []
    issue_num = 23680
    keywords = ['patch', 'buildbot']
    message_count = 13.0
    messages = ['238215', '238216', '238217', '238218', '238221', '238224', '238225', '238226', '238227', '238229', '266690', '267099', '284642']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'vstinner', 'neologix', 'Gerrit.Holl', 'python-dev', 'martin.panter']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue23680'
    versions = ['Python 2.7', 'Python 3.5', 'Python 3.6']

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 16, 2015

    Sometimes the test suite freezes in test_interrupted_write_retry_text (test.test_io.CSignalsTest). Corresponding strace is:

    write(1, "test_interrupted_write_retry_tex"..., 66) = 66
    rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
    pipe2([4, 5], O_CLOEXEC) = 0
    rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
    fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    ioctl(5, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd0944d40) = -1 EINVAL (Invalid argument)
    lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    alarm(1) = 0
    brk(0x3360000) = 0x3360000
    --- SIGALRM (Alarm clock) @ 0 (0) ---
    rt_sigreturn(0x2640030) = 53870592
    write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4194305

    A successful run looks like this:

    write(1, "test_interrupted_write_retry_tex"..., 66) = 66
    rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
    pipe2([4, 5], O_CLOEXEC) = 0
    rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
    fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
    ioctl(5, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd0944d40) = -1 EINVAL (Invalid argument)
    lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
    alarm(1) = 0
    brk(0x2f60000) = 0x2f60000
    brk(0x3360000) = 0x3360000
    write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4194305) = 65536
    --- SIGALRM (Alarm clock) @ 0 (0) ---
    rt_sigreturn(0x2640030) = 65536
    rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
    alarm(1) = 0
    write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4128769) = ? ERESTARTSYS (To be restarted)
    --- SIGALRM (Alarm clock) @ 0 (0) ---
    rt_sigreturn(0x2640030) = -1 EINTR (Interrupted system call)
    clone(child_stack=0x7fd1a8852ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SE
    TTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd1a88539d0, tls=0x7fd1a8853700, child_tidptr=0x7fd1a88539d0) = 3611
    futex(0x24b5140, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
    futex(0x9a1904, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1563, {1426520501, 20155000}, ffffffff) = 0
    futex(0x9a1940, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
    futex(0x9a1940, FUTEX_WAKE_PRIVATE, 1) = 0
    write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4128769) = 4128769
    futex(0x9a1904, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9a1900, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
    futex(0x9a1940, FUTEX_WAKE_PRIVATE, 1) = 1
    futex(0x2631c50, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
    close(5) = 0
    close(4) = 0
    rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
    write(1, "ok\n", 3) = 3

    @pitrou pitrou added tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error labels Mar 16, 2015
    @vstinner
    Copy link
    Member

    The bug only occurs in Python 3.5, right?

    Sometimes the test suite freezes in test_interrupted_write_retry_text (test.test_io.CSignalsTest).

    How do you run the test suite? Is your system heavily loaded? Is it "fast"?

    This is an obvious race condition in the test if SIGARLM is send before write() is called:

                signal.alarm(1)
                # Expected behaviour:
                # - first raw write() is partial (because of the limited pipe buffer
                #   and the first alarm)
                # - second raw write() returns EINTR (because of the second alarm)
                # - subsequent write()s are successful (either partial or complete)
                self.assertEqual(N, wio.write(item * N))

    1 second should be enough :-)

    --

    Or it is maybe a regression caused by the changeset 5b63010be19e of issue bpo-23285 (PEP-475).

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 16, 2015

    The system is rather slow (a shared VPS instance). In the trace you can SIGALRM is triggered before the first write() call (or so it seems).

    @vstinner
    Copy link
    Member

    The system is rather slow (a shared VPS instance). In the trace you can SIGALRM is triggered before the first write() call (or so it seems).

    Yep. It remembers me my old idea to make "sleep configurable" in tests: issue bpo-20910.

    Most of the time, 1 second is enough. But on such very slow setup, it's annoying to get random failures because of race conditions. It would be better to use longer timeout (ex: 5 seconds), without making tests longer on other buildbots.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 16, 2015

    Note that PIPE_MAX_SIZE can be large. Perhaps move the memory allocation (i.e. item * (support.PIPE_MAX_SIZE // len(item) + 1)) before the alarm(1) call?

    Or what if alarm_interrupt is simply set up to retrigger the signal? e.g. instead of:

        def alarm_interrupt(self, sig, frame):
            1/0

    write:

        def alarm_interrupt(self, sig, frame):
            signal.alarm(1)
            1/0

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 16, 2015

    New changeset a18f7508649b by Victor Stinner in branch 'default':
    Issue bpo-23680: Reduce risk of race condition in check_interrupted_write_retry()
    https://hg.python.org/cpython/rev/a18f7508649b

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 16, 2015

    Also, I think there another issue in that test. It uses N = support.PIPE_MAX_SIZE but I think you want N = support.PIPE_MAX_SIZE + 1 if you want the raw write() to block.

    There is another test that can have a race condition: check_interrupted_write().

    @vstinner
    Copy link
    Member

    2015-03-16 17:41 GMT+01:00 Antoine Pitrou report@bugs.python.org:

    Note that PIPE_MAX_SIZE can be large. Perhaps move the memory allocation (i.e. item * (support.PIPE_MAX_SIZE // len(item) + 1)) before the alarm(1) call?

    Good idea, the first strace shows that SIGALRM was received while
    Python was creating the large data.

    I made this simple change. Can you tell me if my change fixes the issue?

    Or what if alarm_interrupt is simply set up to retrigger the signal?

    It may make the test more reliable and I don't see how it can fix the
    issue showed in the strace. Feel free to make the test more reliable
    with such change if you want.

    @vstinner
    Copy link
    Member

    Also, I think there another issue in that test. It uses N = support.PIPE_MAX_SIZE but I think you want N = support.PIPE_MAX_SIZE + 1 if you want the raw write() to block.

    PIPE_MAX_SIZE is much larger than the effictive size of a pipe on
    Linux. It's defined as:

    # A constant likely larger than the underlying OS pipe buffer size, to
    # make writes blocking.
    # Windows limit seems to be around 512 B, and many Unix kernels have a
    # 64 KiB pipe buffer size or 16 * PAGE_SIZE: take a few megs to be sure.
    # (see issue python/cpython#62035 for a discussion of this number).
    PIPE_MAX_SIZE = 4 * 1024 * 1024 + 1

    I don't think that PIPE_MAX_SIZE+1 makes a difference here.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 16, 2015

    New changeset 10acab2d4a88 by Victor Stinner in branch 'default':
    Issue bpo-23680: Reduce risk of race condition in check_interrupted_write() of
    https://hg.python.org/cpython/rev/10acab2d4a88

    @vadmium
    Copy link
    Member

    vadmium commented May 30, 2016

    I found a relatively recent case of this failing on a buildbot:

    http://buildbot.python.org/all/builders/AMD64%20FreeBSD%209.x%203.x/builds/4339/steps/test/logs/stdio
    == CPython 3.6.0a1+ (default:eaee5aed6fbc, May 27 2016, 05:52:21) [GCC 4.2.1 20070831 patched [FreeBSD]]
    0:19:34 [401/401/1] test_io crashed
    Timeout (0:15:00)!
    Thread 0x0000000801807400 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_io.py", line 3808 in check_interrupted_write_retry
    File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_io.py", line 3834 in test_interrupted_write_retry_text

    Line 3808 is the write() call, so it could be hanging if the second signal is delivered just before Python makes the write() syscall. Perhaps it might help to use setitimer() so that even if a signal is delivered at an inconvenient moment, the write() call will still be interrupted by a later signal, and read thread will still be spawned.

    Also see bpo-22331 for an idea to avoid the EBADF hack.

    @vadmium
    Copy link
    Member

    vadmium commented Jun 3, 2016

    Here is my suggested change to use setitimer(). I also closed the pipe, which means there is no need to time out the read, speeding the test up a bit more.

    @gerritholl
    Copy link
    Mannequin

    gerritholl mannequin commented Jan 4, 2017

    I experience this problem when trying to build/test Python 3.6 on the JASMIN Analysis Platform which runs Red Hat Enterprise Linux Server release 6.8 on a machine with 48 × Intel(R) Xeon(R) CPU E7-4860 v2 @ 2.60GHz, 2 TiB RAM, and a PanFSⓇ distributed file system. I experience this problem regardless whether I build/test Python 3.6 from within a PanFS filesystem or otherwise.

    ./python -m test -v test_io
    == CPython 3.6.0 (default, Jan 4 2017, 15:08:07) [GCC 4.4.7 20120313 (Red Hat 4.4.7-17)]
    == Linux-2.6.32-642.6.2.el6.x86_64-x86_64-with-redhat-6.8-Santiago little-endian
    == hash algorithm: siphash24 64bit
    == cwd: /dev/shm/gerrit/Python-3.6.0/build/test_python_34342
    == encodings: locale=UTF-8, FS=utf-8
    Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1, isolated=0)
    Run tests sequentially
    0:00:00 [1/1] test_io
    […snip many lines…]
    test_interrupted_write_retry_text (test.test_io.CSignalsTest) ...

    after which there is no more output.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @kumaraditya303
    Copy link
    Contributor

    Closing as it is out of date now and our CI and buildbots are green.

    @kumaraditya303 kumaraditya303 closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants