classification
Title: Sporadic freeze in test_interrupted_write_retry_text
Type: behavior Stage: patch review
Components: Tests Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Gerrit.Holl, martin.panter, neologix, pitrou, python-dev, vstinner
Priority: normal Keywords: buildbot, patch

Created on 2015-03-16 15:53 by pitrou, last changed 2017-01-04 15:32 by Gerrit.Holl.

Files
File name Uploaded Description Edit
write-retry.patch martin.panter, 2016-06-03 13:48 review
Messages (13)
msg238215 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-03-16 15:53
Sometimes the test suite freezes in test_interrupted_write_retry_text (test.test_io.CSignalsTest). Corresponding strace is:


write(1, "test_interrupted_write_retry_tex"..., 66) = 66
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
pipe2([4, 5], O_CLOEXEC)                = 0
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
ioctl(5, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd0944d40) = -1 EINVAL (Invalid argument)
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
alarm(1)                                = 0
brk(0x3360000)                          = 0x3360000
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0x2640030)                 = 53870592
write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4194305


A successful run looks like this:

write(1, "test_interrupted_write_retry_tex"..., 66) = 66
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
pipe2([4, 5], O_CLOEXEC)                = 0
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
ioctl(5, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd0944d40) = -1 EINVAL (Invalid argument)
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
lseek(5, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
alarm(1)                                = 0
brk(0x2f60000)                          = 0x2f60000
brk(0x3360000)                          = 0x3360000
write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4194305) = 65536
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0x2640030)                 = 65536
rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
alarm(1)                                = 0
write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4128769) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0x2640030)                 = -1 EINTR (Interrupted system call)
clone(child_stack=0x7fd1a8852ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SE
TTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd1a88539d0, tls=0x7fd1a8853700, child_tidptr=0x7fd1a88539d0) = 3611
futex(0x24b5140, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x9a1904, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1563, {1426520501, 20155000}, ffffffff) = 0
futex(0x9a1940, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
futex(0x9a1940, FUTEX_WAKE_PRIVATE, 1)  = 0
write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4128769) = 4128769
futex(0x9a1904, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9a1900, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x9a1940, FUTEX_WAKE_PRIVATE, 1)  = 1
futex(0x2631c50, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
close(5)                                = 0
close(4)                                = 0
rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0
write(1, "ok\n", 3)                     = 3
msg238216 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-16 16:01
The bug only occurs in Python 3.5, right?

> Sometimes the test suite freezes in test_interrupted_write_retry_text (test.test_io.CSignalsTest).

How do you run the test suite? Is your system heavily loaded? Is it "fast"?

This is an obvious race condition in the test if SIGARLM is send before write() is called:

            signal.alarm(1)
            # Expected behaviour:
            # - first raw write() is partial (because of the limited pipe buffer
            #   and the first alarm)
            # - second raw write() returns EINTR (because of the second alarm)
            # - subsequent write()s are successful (either partial or complete)
            self.assertEqual(N, wio.write(item * N))

1 second should be enough :-)

--

Or it is maybe a regression caused by the changeset 5b63010be19e of issue #23285 (PEP 475).
msg238217 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-03-16 16:18
The system is rather slow (a shared VPS instance). In the trace you can SIGALRM is triggered before the first write() call (or so it seems).
msg238218 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-16 16:28
> The system is rather slow (a shared VPS instance). In the trace you can SIGALRM is triggered before the first write() call (or so it seems).

Yep. It remembers me my old idea to make "sleep configurable" in tests: issue #20910.

Most of the time, 1 second is enough. But on such very slow setup, it's annoying to get random failures because of race conditions. It would be better to use longer timeout (ex: 5 seconds), without making tests longer on other buildbots.
msg238221 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-03-16 16:41
Note that PIPE_MAX_SIZE can be large. Perhaps move the memory allocation (i.e. `item * (support.PIPE_MAX_SIZE // len(item) + 1)`) before the alarm(1) call?

Or what if alarm_interrupt is simply set up to retrigger the signal? e.g. instead of:

    def alarm_interrupt(self, sig, frame):
        1/0

write:

    def alarm_interrupt(self, sig, frame):
        signal.alarm(1)
        1/0
msg238224 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-03-16 16:58
New changeset a18f7508649b by Victor Stinner in branch 'default':
Issue #23680: Reduce risk of race condition in check_interrupted_write_retry()
https://hg.python.org/cpython/rev/a18f7508649b
msg238225 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-03-16 17:00
Also, I think there another issue in that test. It uses `N = support.PIPE_MAX_SIZE` but I think you want `N = support.PIPE_MAX_SIZE + 1` if you want the raw write() to block.

There is another test that can have a race condition: check_interrupted_write().
msg238226 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-16 17:00
2015-03-16 17:41 GMT+01:00 Antoine Pitrou <report@bugs.python.org>:
> Note that PIPE_MAX_SIZE can be large. Perhaps move the memory allocation (i.e. `item * (support.PIPE_MAX_SIZE // len(item) + 1)`) before the alarm(1) call?

Good idea, the first strace shows that SIGALRM was received while
Python was creating the large data.

I made this simple change. Can you tell me if my change fixes the issue?

> Or what if alarm_interrupt is simply set up to retrigger the signal?

It may make the test more reliable and I don't see how it can fix the
issue showed in the strace. Feel free to make the test more reliable
with such change if you want.
msg238227 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-03-16 17:05
> Also, I think there another issue in that test. It uses `N = support.PIPE_MAX_SIZE` but I think you want `N = support.PIPE_MAX_SIZE + 1` if you want the raw write() to block.

PIPE_MAX_SIZE is much larger than the effictive size of a pipe on
Linux. It's defined as:

# A constant likely larger than the underlying OS pipe buffer size, to
# make writes blocking.
# Windows limit seems to be around 512 B, and many Unix kernels have a
# 64 KiB pipe buffer size or 16 * PAGE_SIZE: take a few megs to be sure.
# (see issue #17835 for a discussion of this number).
PIPE_MAX_SIZE = 4 * 1024 * 1024 + 1

I don't think that PIPE_MAX_SIZE+1 makes a difference here.
msg238229 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-03-16 17:07
New changeset 10acab2d4a88 by Victor Stinner in branch 'default':
Issue #23680: Reduce risk of race condition in check_interrupted_write() of
https://hg.python.org/cpython/rev/10acab2d4a88
msg266690 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-05-30 12:41
I found a relatively recent case of this failing on a buildbot:

http://buildbot.python.org/all/builders/AMD64%20FreeBSD%209.x%203.x/builds/4339/steps/test/logs/stdio
== CPython 3.6.0a1+ (default:eaee5aed6fbc, May 27 2016, 05:52:21) [GCC 4.2.1 20070831 patched [FreeBSD]]
0:19:34 [401/401/1] test_io crashed
Timeout (0:15:00)!
Thread 0x0000000801807400 (most recent call first):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_io.py", line 3808 in check_interrupted_write_retry
  File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_io.py", line 3834 in test_interrupted_write_retry_text

Line 3808 is the write() call, so it could be hanging if the second signal is delivered just before Python makes the write() syscall. Perhaps it might help to use setitimer() so that even if a signal is delivered at an inconvenient moment, the write() call will still be interrupted by a later signal, and read thread will still be spawned.

Also see Issue 22331 for an idea to avoid the EBADF hack.
msg267099 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-03 13:48
Here is my suggested change to use setitimer(). I also closed the pipe, which means there is no need to time out the read, speeding the test up a bit more.
msg284642 - (view) Author: Gerrit Holl (Gerrit.Holl) * Date: 2017-01-04 15:32
I experience this problem when trying to build/test Python 3.6 on the [JASMIN Analysis Platform](http://www.jasmin.ac.uk/services/jasmin-analysis-platform/) which runs Red Hat Enterprise Linux Server release 6.8 on a machine with 48 × Intel(R) Xeon(R) CPU E7-4860 v2 @ 2.60GHz, 2 TiB RAM, and a [PanFSⓇ](http://www.panasas.com/products/panfs) distributed file system.  I experience this problem regardless whether I build/test Python 3.6 from within a PanFS filesystem or otherwise.

./python -m test -v test_io
== CPython 3.6.0 (default, Jan 4 2017, 15:08:07) [GCC 4.4.7 20120313 (Red Hat 4.4.7-17)]
==   Linux-2.6.32-642.6.2.el6.x86_64-x86_64-with-redhat-6.8-Santiago little-endian
==   hash algorithm: siphash24 64bit
==  cwd: /dev/shm/gerrit/Python-3.6.0/build/test_python_34342
==  encodings: locale=UTF-8, FS=utf-8
Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1, isolated=0)
Run tests sequentially
0:00:00 [1/1] test_io
[…snip many lines…]
test_interrupted_write_retry_text (test.test_io.CSignalsTest) ...

after which there is no more output.
History
Date User Action Args
2017-01-04 15:32:49Gerrit.Hollsetnosy: + Gerrit.Holl
messages: + msg284642
2016-06-03 13:48:13martin.pantersetfiles: + write-retry.patch
versions: + Python 2.7
messages: + msg267099

keywords: + patch
stage: patch review
2016-05-30 12:41:43martin.pantersetversions: + Python 3.6
nosy: + martin.panter

messages: + msg266690

keywords: + buildbot
2015-03-16 17:07:42python-devsetmessages: + msg238229
2015-03-16 17:05:02vstinnersetmessages: + msg238227
2015-03-16 17:00:59vstinnersetmessages: + msg238226
2015-03-16 17:00:17pitrousetmessages: + msg238225
2015-03-16 16:58:38python-devsetnosy: + python-dev
messages: + msg238224
2015-03-16 16:41:08pitrousetmessages: + msg238221
2015-03-16 16:28:12vstinnersetmessages: + msg238218
2015-03-16 16:18:05pitrousetmessages: + msg238217
2015-03-16 16:01:04vstinnersetmessages: + msg238216
2015-03-16 15:53:35pitrousettype: behavior
components: + Tests
2015-03-16 15:53:27pitroucreate