Sporadic freeze in test_interrupted_write_retry_text #67868

pitrou · 2015-03-16T15:53:27Z

BPO	23680
Nosy	@pitrou, @vstinner, @gerritholl, @vadmium
Files	write-retry.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-03-16.15:53:27.284>
labels = ['type-bug', 'tests']
title = 'Sporadic freeze in test_interrupted_write_retry_text'
updated_at = <Date 2017-01-04.15:32:49.190>
user = 'https://github.com/pitrou'

bugs.python.org fields:

activity = <Date 2017-01-04.15:32:49.190>
actor = 'Gerrit.Holl'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Tests']
creation = <Date 2015-03-16.15:53:27.284>
creator = 'pitrou'
dependencies = []
files = ['43146']
hgrepos = []
issue_num = 23680
keywords = ['patch', 'buildbot']
message_count = 13.0
messages = ['238215', '238216', '238217', '238218', '238221', '238224', '238225', '238226', '238227', '238229', '266690', '267099', '284642']
nosy_count = 6.0
nosy_names = ['pitrou', 'vstinner', 'neologix', 'Gerrit.Holl', 'python-dev', 'martin.panter']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue23680'
versions = ['Python 2.7', 'Python 3.5', 'Python 3.6']

pitrou · 2015-03-16T15:53:27Z

Sometimes the test suite freezes in test_interrupted_write_retry_text (test.test_io.CSignalsTest). Corresponding strace is:

write(1, "test_interrupted_write_retry_tex"..., 66) = 66
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
pipe2([4, 5], O_CLOEXEC) = 0
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
ioctl(5, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd0944d40) = -1 EINVAL (Invalid argument)
lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
alarm(1) = 0
brk(0x3360000) = 0x3360000
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0x2640030) = 53870592
write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4194305

A successful run looks like this:

write(1, "test_interrupted_write_retry_tex"..., 66) = 66
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
pipe2([4, 5], O_CLOEXEC) = 0
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
ioctl(5, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd0944d40) = -1 EINVAL (Invalid argument)
lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
alarm(1) = 0
brk(0x2f60000) = 0x2f60000
brk(0x3360000) = 0x3360000
write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4194305) = 65536
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0x2640030) = 65536
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
alarm(1) = 0
write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4128769) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0x2640030) = -1 EINTR (Interrupted system call)
clone(child_stack=0x7fd1a8852ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SE
TTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd1a88539d0, tls=0x7fd1a8853700, child_tidptr=0x7fd1a88539d0) = 3611
futex(0x24b5140, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x9a1904, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1563, {1426520501, 20155000}, ffffffff) = 0
futex(0x9a1940, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x9a1940, FUTEX_WAKE_PRIVATE, 1) = 0
write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4128769) = 4128769
futex(0x9a1904, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9a1900, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9a1940, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2631c50, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
close(5) = 0
close(4) = 0
rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
write(1, "ok\n", 3) = 3

vstinner · 2015-03-16T16:01:04Z

The bug only occurs in Python 3.5, right?

Sometimes the test suite freezes in test_interrupted_write_retry_text (test.test_io.CSignalsTest).

How do you run the test suite? Is your system heavily loaded? Is it "fast"?

This is an obvious race condition in the test if SIGARLM is send before write() is called:

            signal.alarm(1)
            # Expected behaviour:
            # - first raw write() is partial (because of the limited pipe buffer
            #   and the first alarm)
            # - second raw write() returns EINTR (because of the second alarm)
            # - subsequent write()s are successful (either partial or complete)
            self.assertEqual(N, wio.write(item * N))

1 second should be enough :-)

--

Or it is maybe a regression caused by the changeset 5b63010be19e of issue bpo-23285 (PEP-475).

pitrou · 2015-03-16T16:18:05Z

The system is rather slow (a shared VPS instance). In the trace you can SIGALRM is triggered before the first write() call (or so it seems).

vstinner · 2015-03-16T16:28:12Z

The system is rather slow (a shared VPS instance). In the trace you can SIGALRM is triggered before the first write() call (or so it seems).

Yep. It remembers me my old idea to make "sleep configurable" in tests: issue bpo-20910.

Most of the time, 1 second is enough. But on such very slow setup, it's annoying to get random failures because of race conditions. It would be better to use longer timeout (ex: 5 seconds), without making tests longer on other buildbots.

pitrou · 2015-03-16T16:41:09Z

Note that PIPE_MAX_SIZE can be large. Perhaps move the memory allocation (i.e. item * (support.PIPE_MAX_SIZE // len(item) + 1)) before the alarm(1) call?

Or what if alarm_interrupt is simply set up to retrigger the signal? e.g. instead of:

    def alarm_interrupt(self, sig, frame):
        1/0

write:

    def alarm_interrupt(self, sig, frame):
        signal.alarm(1)
        1/0

python-dev · 2015-03-16T16:58:38Z

New changeset a18f7508649b by Victor Stinner in branch 'default':
Issue bpo-23680: Reduce risk of race condition in check_interrupted_write_retry()
https://hg.python.org/cpython/rev/a18f7508649b

pitrou · 2015-03-16T17:00:17Z

Also, I think there another issue in that test. It uses N = support.PIPE_MAX_SIZE but I think you want N = support.PIPE_MAX_SIZE + 1 if you want the raw write() to block.

There is another test that can have a race condition: check_interrupted_write().

vstinner · 2015-03-16T17:00:59Z

2015-03-16 17:41 GMT+01:00 Antoine Pitrou report@bugs.python.org:

Note that PIPE_MAX_SIZE can be large. Perhaps move the memory allocation (i.e. item * (support.PIPE_MAX_SIZE // len(item) + 1)) before the alarm(1) call?

Good idea, the first strace shows that SIGALRM was received while
Python was creating the large data.

I made this simple change. Can you tell me if my change fixes the issue?

Or what if alarm_interrupt is simply set up to retrigger the signal?

It may make the test more reliable and I don't see how it can fix the
issue showed in the strace. Feel free to make the test more reliable
with such change if you want.

vstinner · 2015-03-16T17:05:03Z

Also, I think there another issue in that test. It uses N = support.PIPE_MAX_SIZE but I think you want N = support.PIPE_MAX_SIZE + 1 if you want the raw write() to block.

PIPE_MAX_SIZE is much larger than the effictive size of a pipe on
Linux. It's defined as:

# A constant likely larger than the underlying OS pipe buffer size, to
# make writes blocking.
# Windows limit seems to be around 512 B, and many Unix kernels have a
# 64 KiB pipe buffer size or 16 * PAGE_SIZE: take a few megs to be sure.
# (see issue python/cpython#62035 for a discussion of this number).
PIPE_MAX_SIZE = 4 * 1024 * 1024 + 1

I don't think that PIPE_MAX_SIZE+1 makes a difference here.

python-dev · 2015-03-16T17:07:42Z

New changeset 10acab2d4a88 by Victor Stinner in branch 'default':
Issue bpo-23680: Reduce risk of race condition in check_interrupted_write() of
https://hg.python.org/cpython/rev/10acab2d4a88

vadmium · 2016-05-30T12:41:44Z

I found a relatively recent case of this failing on a buildbot:

http://buildbot.python.org/all/builders/AMD64%20FreeBSD%209.x%203.x/builds/4339/steps/test/logs/stdio
== CPython 3.6.0a1+ (default:eaee5aed6fbc, May 27 2016, 05:52:21) [GCC 4.2.1 20070831 patched [FreeBSD]]
0:19:34 [401/401/1] test_io crashed
Timeout (0:15:00)!
Thread 0x0000000801807400 (most recent call first):
File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_io.py", line 3808 in check_interrupted_write_retry
File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_io.py", line 3834 in test_interrupted_write_retry_text

Line 3808 is the write() call, so it could be hanging if the second signal is delivered just before Python makes the write() syscall. Perhaps it might help to use setitimer() so that even if a signal is delivered at an inconvenient moment, the write() call will still be interrupted by a later signal, and read thread will still be spawned.

Also see bpo-22331 for an idea to avoid the EBADF hack.

vadmium · 2016-06-03T13:48:12Z

Here is my suggested change to use setitimer(). I also closed the pipe, which means there is no need to time out the read, speeding the test up a bit more.

gerritholl · 2017-01-04T15:32:49Z

I experience this problem when trying to build/test Python 3.6 on the JASMIN Analysis Platform which runs Red Hat Enterprise Linux Server release 6.8 on a machine with 48 × Intel(R) Xeon(R) CPU E7-4860 v2 @ 2.60GHz, 2 TiB RAM, and a PanFSⓇ distributed file system. I experience this problem regardless whether I build/test Python 3.6 from within a PanFS filesystem or otherwise.

./python -m test -v test_io
== CPython 3.6.0 (default, Jan 4 2017, 15:08:07) [GCC 4.4.7 20120313 (Red Hat 4.4.7-17)]
== Linux-2.6.32-642.6.2.el6.x86_64-x86_64-with-redhat-6.8-Santiago little-endian
== hash algorithm: siphash24 64bit
== cwd: /dev/shm/gerrit/Python-3.6.0/build/test_python_34342
== encodings: locale=UTF-8, FS=utf-8
Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1, isolated=0)
Run tests sequentially
0:00:00 [1/1] test_io
[…snip many lines…]
test_interrupted_write_retry_text (test.test_io.CSignalsTest) ...

after which there is no more output.

kumaraditya303 · 2022-05-27T11:13:49Z

Closing as it is out of date now and our CI and buildbots are green.

pitrou added tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error labels Mar 16, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

kumaraditya303 closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic freeze in test_interrupted_write_retry_text #67868

Sporadic freeze in test_interrupted_write_retry_text #67868

pitrou commented Mar 16, 2015

pitrou commented Mar 16, 2015

vstinner commented Mar 16, 2015

pitrou commented Mar 16, 2015

vstinner commented Mar 16, 2015

pitrou commented Mar 16, 2015

python-dev mannequin commented Mar 16, 2015

pitrou commented Mar 16, 2015

vstinner commented Mar 16, 2015

vstinner commented Mar 16, 2015

python-dev mannequin commented Mar 16, 2015

vadmium commented May 30, 2016

vadmium commented Jun 3, 2016

gerritholl mannequin commented Jan 4, 2017

kumaraditya303 commented May 27, 2022

Sporadic freeze in test_interrupted_write_retry_text #67868

Sporadic freeze in test_interrupted_write_retry_text #67868

Comments

pitrou commented Mar 16, 2015

pitrou commented Mar 16, 2015

vstinner commented Mar 16, 2015

pitrou commented Mar 16, 2015

vstinner commented Mar 16, 2015

pitrou commented Mar 16, 2015

python-dev mannequin commented Mar 16, 2015

pitrou commented Mar 16, 2015

vstinner commented Mar 16, 2015

vstinner commented Mar 16, 2015

python-dev mannequin commented Mar 16, 2015

vadmium commented May 30, 2016

vadmium commented Jun 3, 2016

gerritholl mannequin commented Jan 4, 2017

kumaraditya303 commented May 27, 2022