Issue23680
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2015-03-16 15:53 by pitrou, last changed 2022-04-11 14:58 by admin.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
write-retry.patch | martin.panter, 2016-06-03 13:48 | review |
Messages (13) | |||
---|---|---|---|
msg238215 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2015-03-16 15:53 | |
Sometimes the test suite freezes in test_interrupted_write_retry_text (test.test_io.CSignalsTest). Corresponding strace is: write(1, "test_interrupted_write_retry_tex"..., 66) = 66 rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0 pipe2([4, 5], O_CLOEXEC) = 0 rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0 fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 ioctl(5, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd0944d40) = -1 EINVAL (Invalid argument) lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) alarm(1) = 0 brk(0x3360000) = 0x3360000 --- SIGALRM (Alarm clock) @ 0 (0) --- rt_sigreturn(0x2640030) = 53870592 write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4194305 A successful run looks like this: write(1, "test_interrupted_write_retry_tex"..., 66) = 66 rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0 pipe2([4, 5], O_CLOEXEC) = 0 rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0 fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 fstat(5, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 ioctl(5, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffd0944d40) = -1 EINVAL (Invalid argument) lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) lseek(5, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) alarm(1) = 0 brk(0x2f60000) = 0x2f60000 brk(0x3360000) = 0x3360000 write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4194305) = 65536 --- SIGALRM (Alarm clock) @ 0 (0) --- rt_sigreturn(0x2640030) = 65536 rt_sigaction(SIGALRM, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0 alarm(1) = 0 write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4128769) = ? ERESTARTSYS (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- rt_sigreturn(0x2640030) = -1 EINTR (Interrupted system call) clone(child_stack=0x7fd1a8852ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SE TTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd1a88539d0, tls=0x7fd1a8853700, child_tidptr=0x7fd1a88539d0) = 3611 futex(0x24b5140, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0x9a1904, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1563, {1426520501, 20155000}, ffffffff) = 0 futex(0x9a1940, FUTEX_WAIT_PRIVATE, 2, NULL) = 0 futex(0x9a1940, FUTEX_WAKE_PRIVATE, 1) = 0 write(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4128769) = 4128769 futex(0x9a1904, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9a1900, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x9a1940, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x2631c50, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 close(5) = 0 close(4) = 0 rt_sigaction(SIGALRM, {SIG_DFL, [], SA_RESTORER, 0x7fd1ada2d0a0}, {0x444400, [], SA_RESTORER, 0x7fd1ada2d0a0}, 8) = 0 write(1, "ok\n", 3) = 3 |
|||
msg238216 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-03-16 16:01 | |
The bug only occurs in Python 3.5, right? > Sometimes the test suite freezes in test_interrupted_write_retry_text (test.test_io.CSignalsTest). How do you run the test suite? Is your system heavily loaded? Is it "fast"? This is an obvious race condition in the test if SIGARLM is send before write() is called: signal.alarm(1) # Expected behaviour: # - first raw write() is partial (because of the limited pipe buffer # and the first alarm) # - second raw write() returns EINTR (because of the second alarm) # - subsequent write()s are successful (either partial or complete) self.assertEqual(N, wio.write(item * N)) 1 second should be enough :-) -- Or it is maybe a regression caused by the changeset 5b63010be19e of issue #23285 (PEP 475). |
|||
msg238217 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2015-03-16 16:18 | |
The system is rather slow (a shared VPS instance). In the trace you can SIGALRM is triggered before the first write() call (or so it seems). |
|||
msg238218 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-03-16 16:28 | |
> The system is rather slow (a shared VPS instance). In the trace you can SIGALRM is triggered before the first write() call (or so it seems). Yep. It remembers me my old idea to make "sleep configurable" in tests: issue #20910. Most of the time, 1 second is enough. But on such very slow setup, it's annoying to get random failures because of race conditions. It would be better to use longer timeout (ex: 5 seconds), without making tests longer on other buildbots. |
|||
msg238221 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2015-03-16 16:41 | |
Note that PIPE_MAX_SIZE can be large. Perhaps move the memory allocation (i.e. `item * (support.PIPE_MAX_SIZE // len(item) + 1)`) before the alarm(1) call? Or what if alarm_interrupt is simply set up to retrigger the signal? e.g. instead of: def alarm_interrupt(self, sig, frame): 1/0 write: def alarm_interrupt(self, sig, frame): signal.alarm(1) 1/0 |
|||
msg238224 - (view) | Author: Roundup Robot (python-dev) | Date: 2015-03-16 16:58 | |
New changeset a18f7508649b by Victor Stinner in branch 'default': Issue #23680: Reduce risk of race condition in check_interrupted_write_retry() https://hg.python.org/cpython/rev/a18f7508649b |
|||
msg238225 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2015-03-16 17:00 | |
Also, I think there another issue in that test. It uses `N = support.PIPE_MAX_SIZE` but I think you want `N = support.PIPE_MAX_SIZE + 1` if you want the raw write() to block. There is another test that can have a race condition: check_interrupted_write(). |
|||
msg238226 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-03-16 17:00 | |
2015-03-16 17:41 GMT+01:00 Antoine Pitrou <report@bugs.python.org>: > Note that PIPE_MAX_SIZE can be large. Perhaps move the memory allocation (i.e. `item * (support.PIPE_MAX_SIZE // len(item) + 1)`) before the alarm(1) call? Good idea, the first strace shows that SIGALRM was received while Python was creating the large data. I made this simple change. Can you tell me if my change fixes the issue? > Or what if alarm_interrupt is simply set up to retrigger the signal? It may make the test more reliable and I don't see how it can fix the issue showed in the strace. Feel free to make the test more reliable with such change if you want. |
|||
msg238227 - (view) | Author: STINNER Victor (vstinner) * | Date: 2015-03-16 17:05 | |
> Also, I think there another issue in that test. It uses `N = support.PIPE_MAX_SIZE` but I think you want `N = support.PIPE_MAX_SIZE + 1` if you want the raw write() to block. PIPE_MAX_SIZE is much larger than the effictive size of a pipe on Linux. It's defined as: # A constant likely larger than the underlying OS pipe buffer size, to # make writes blocking. # Windows limit seems to be around 512 B, and many Unix kernels have a # 64 KiB pipe buffer size or 16 * PAGE_SIZE: take a few megs to be sure. # (see issue #17835 for a discussion of this number). PIPE_MAX_SIZE = 4 * 1024 * 1024 + 1 I don't think that PIPE_MAX_SIZE+1 makes a difference here. |
|||
msg238229 - (view) | Author: Roundup Robot (python-dev) | Date: 2015-03-16 17:07 | |
New changeset 10acab2d4a88 by Victor Stinner in branch 'default': Issue #23680: Reduce risk of race condition in check_interrupted_write() of https://hg.python.org/cpython/rev/10acab2d4a88 |
|||
msg266690 - (view) | Author: Martin Panter (martin.panter) * | Date: 2016-05-30 12:41 | |
I found a relatively recent case of this failing on a buildbot: http://buildbot.python.org/all/builders/AMD64%20FreeBSD%209.x%203.x/builds/4339/steps/test/logs/stdio == CPython 3.6.0a1+ (default:eaee5aed6fbc, May 27 2016, 05:52:21) [GCC 4.2.1 20070831 patched [FreeBSD]] 0:19:34 [401/401/1] test_io crashed Timeout (0:15:00)! Thread 0x0000000801807400 (most recent call first): File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_io.py", line 3808 in check_interrupted_write_retry File "/usr/home/buildbot/python/3.x.koobs-freebsd9/build/Lib/test/test_io.py", line 3834 in test_interrupted_write_retry_text Line 3808 is the write() call, so it could be hanging if the second signal is delivered just before Python makes the write() syscall. Perhaps it might help to use setitimer() so that even if a signal is delivered at an inconvenient moment, the write() call will still be interrupted by a later signal, and read thread will still be spawned. Also see Issue 22331 for an idea to avoid the EBADF hack. |
|||
msg267099 - (view) | Author: Martin Panter (martin.panter) * | Date: 2016-06-03 13:48 | |
Here is my suggested change to use setitimer(). I also closed the pipe, which means there is no need to time out the read, speeding the test up a bit more. |
|||
msg284642 - (view) | Author: Gerrit Holl (Gerrit.Holl) * | Date: 2017-01-04 15:32 | |
I experience this problem when trying to build/test Python 3.6 on the [JASMIN Analysis Platform](http://www.jasmin.ac.uk/services/jasmin-analysis-platform/) which runs Red Hat Enterprise Linux Server release 6.8 on a machine with 48 × Intel(R) Xeon(R) CPU E7-4860 v2 @ 2.60GHz, 2 TiB RAM, and a [PanFSⓇ](http://www.panasas.com/products/panfs) distributed file system. I experience this problem regardless whether I build/test Python 3.6 from within a PanFS filesystem or otherwise. ./python -m test -v test_io == CPython 3.6.0 (default, Jan 4 2017, 15:08:07) [GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] == Linux-2.6.32-642.6.2.el6.x86_64-x86_64-with-redhat-6.8-Santiago little-endian == hash algorithm: siphash24 64bit == cwd: /dev/shm/gerrit/Python-3.6.0/build/test_python_34342 == encodings: locale=UTF-8, FS=utf-8 Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1, isolated=0) Run tests sequentially 0:00:00 [1/1] test_io […snip many lines…] test_interrupted_write_retry_text (test.test_io.CSignalsTest) ... after which there is no more output. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:13 | admin | set | github: 67868 |
2017-01-04 15:32:49 | Gerrit.Holl | set | nosy:
+ Gerrit.Holl messages: + msg284642 |
2016-06-03 13:48:13 | martin.panter | set | files:
+ write-retry.patch versions: + Python 2.7 messages: + msg267099 keywords: + patch stage: patch review |
2016-05-30 12:41:43 | martin.panter | set | versions:
+ Python 3.6 nosy: + martin.panter messages: + msg266690 keywords: + buildbot |
2015-03-16 17:07:42 | python-dev | set | messages: + msg238229 |
2015-03-16 17:05:02 | vstinner | set | messages: + msg238227 |
2015-03-16 17:00:59 | vstinner | set | messages: + msg238226 |
2015-03-16 17:00:17 | pitrou | set | messages: + msg238225 |
2015-03-16 16:58:38 | python-dev | set | nosy:
+ python-dev messages: + msg238224 |
2015-03-16 16:41:08 | pitrou | set | messages: + msg238221 |
2015-03-16 16:28:12 | vstinner | set | messages: + msg238218 |
2015-03-16 16:18:05 | pitrou | set | messages: + msg238217 |
2015-03-16 16:01:04 | vstinner | set | messages: + msg238216 |
2015-03-16 15:53:35 | pitrou | set | type: behavior components: + Tests |
2015-03-16 15:53:27 | pitrou | create |