Issue19293
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2013-10-19 04:20 by David.Edelsohn, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
issue19293.patch | David.Edelsohn, 2013-10-19 15:40 | Skip some signal tests in test_events.py | review | |
asyncio_signal.diff | neologix, 2013-10-19 19:53 | review | ||
sockpair.diff | gvanrossum, 2013-10-21 17:08 | |||
issue19293.patch2 | David.Edelsohn, 2013-10-22 02:47 | review | ||
issue19293.patch3 | David.Edelsohn, 2013-10-22 03:55 | review | ||
aixfix2.diff | gvanrossum, 2013-10-22 04:27 | review |
Messages (72) | |||
---|---|---|---|
msg200364 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-19 04:20 | |
test_asyncio times out after one hour on AIX and leaves a process consuming 100% of a thread. [145/382/3] test_asyncio Timeout (1:00:00)! Thread 0x00000001: File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/selectors.py", line 265 in select File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 576 in _run_once File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 153 in run_forever File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/asyncio/base_events.py", line 172 in run_until_complete File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/test_asyncio/test_events.py", line 1012 in test_subprocess_interactive File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/case.py", line 571 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/case.py", line 610 in __call__ File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__ File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__ File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 117 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/suite.py", line 79 in __call__ File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/unittest/runner.py", line 168 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/support/__init__.py", line 1661 in _run_suite File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/support/__init__.py", line 1695 in run_unittest File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/test_asyncio/__init__.py", line 26 in test_main File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 1276 in runtest_inner File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 965 in runtest File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 761 in main File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/test/__main__.py", line 3 in <module> File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/runpy.py", line 73 in _run_code File "/home/shager/cpython-buildarea/3.x.edelsohn-aix-ppc64/build/Lib/runpy.py", line 160 in _run_module_as_main make: 1254-004 The error code from the last command is 1. |
|||
msg200372 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 05:07 | |
Is that traceback upside down compared to normal Python tracebacks? |
|||
msg200374 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 05:12 | |
Assuming it is, it is hanging in the poll() syscall, here: http://hg.python.org/cpython/file/f3a2dab1623b/Lib/selectors.py#265 Could you see if it works better with select()? Comment out lines 402, 403 in that file. |
|||
msg200426 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-19 15:40 | |
Commenting out poll in selectors.py does not help. AIX has its own epoll/kqueue efficient I/O event polling feature "pollset", which I will open as a separate enhancement request issue. Some of the subprocess tests in test_asyncio/test_events.py related to signals are hanging on AIX: test_subprocess_interactive, test_subprocess_kill and test_subprocess_send_signal all hang. This probably is related to Issue #18238 and the tests skipped in test_signal.py. The attached patch skips those tests on AIX. |
|||
msg200433 - (view) | Author: Roundup Robot (python-dev) ![]() |
Date: 2013-10-19 16:10 | |
New changeset 41a841fac7fb by Guido van Rossum in branch 'default': Disable some subprocess tests that hang on AIX. http://hg.python.org/cpython/rev/41a841fac7fb |
|||
msg200435 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 16:18 | |
Thanks for the investigation and the patch. As to the deeper issue, I'm not sure I understand what's going on -- is this a bug in Python or in AIX? Is there a better work-around than disabling the test? The issue 18238 that you mention seems to be closed by disabling tests too, but that seems unsatisfactory. Is the real issue that select() and poll() on AIX cannot be interrupted by signals? That sounds really sad. |
|||
msg200437 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-19 16:25 | |
Does AIX have an equivalent to Linux's strace? If yes, having a sample output would be helpful. Otherwise, maybe just adding manual debug traces to PollSelector.select() would help to see which events are reported. |
|||
msg200438 - (view) | Author: Stefan Krah (skrah) * ![]() |
Date: 2013-10-19 16:26 | |
I think the failure is not only on AIX. The FreeBSD bots had similar failures, and here is one on PPC Linux: http://buildbot.python.org/all/builders/PPC64%20PowerLinux%203.x/builds/881/steps/test/logs/stdio Timeout (1:00:00)! Thread 0x0000008074e46670: File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/selectors.py", line 313 in select File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 576 in _run_once File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 153 in run_forever File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/asyncio/base_events.py", line 172 in run_until_complete File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/test_asyncio/test_events.py", line 1078 in test_subprocess_close_after_finish File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/case.py", line 571 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/case.py", line 610 in __call__ File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__ File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__ File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 117 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/suite.py", line 79 in __call__ File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/unittest/runner.py", line 168 in run File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/support/__init__.py", line 1661 in _run_suite File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/support/__init__.py", line 1695 in run_unittest File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/test_asyncio/__init__.py", line 26 in test_main File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 1276 in runtest_inner File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 965 in runtest File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 761 in main File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/test/__main__.py", line 3 in <module> File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/runpy.py", line 73 in _run_code File "/home/shager/cpython-buildarea/3.x.edelsohn-powerlinux-ppc64/build/Lib/runpy.py", line 160 in _run_module_as_main make: *** [buildbottest] Error 1 |
|||
msg200440 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 16:40 | |
[Off-topic: Heck, another one of those upside-down tracebacks. WHY???] I'll need a lot of help from folks on the affected platforms in debugging this, since these tests work fine when I run them on the platforms to which I have access (Ubuntu and OS X). I would also like to point out http://code.google.com/p/tulip/issues/detail?id=67 (and the thread referenced in my last message there) as well as http://code.google.com/p/tulip/issues/detail?id=68. Perhaps there are clues there. I hope that we'll be able to sort this out before the beta release. |
|||
msg200449 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-19 17:36 | |
My understanding of the issue is POSIX and The Open Group specification are ambiguous about the handling of Unix socket API calls when interrupted by a signal. Linux implements a more liberal, accommodating behavior and the Python Async I/O implementation is assuming those semantics. AIX inherited a lot from OSF, Mac OS X/Darwin inherited a lot from FreeBSD, both of which adopted more "unforgiving behavior". The following web page seems to summarize the issue: http://www.madore.org/~david/computers/connect-intr.html I don't know if the description still is accurate for BSDs, but AIX likely implemented more pedantic, unforgiving semantics. The standard seems to allow the behavior seen in AIX and FreeBSD, regardless of whether it is less useful / pleasant. From a pedantic standpoint, I think this is a bug in the Python implementation of Async I/O because it assumes behavior (implemented by Linux) not required by an implementation that conforms to the POSIX and Open Group specifications. |
|||
msg200452 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-19 17:47 | |
Also, the Python buildbot for AIX is running on a POWER AIX 7.1 system at OSU OSL that is a member of the "GCC Compile Farm" cluster on which FOSS developers can obtain free accounts. If you want access to an AIX system for testing, it's available. |
|||
msg200456 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-19 18:03 | |
Wait, I think there's a misunderstanding here. POSIX does indeed leave undefined the behavior of syscalls when a signal is received: whether or not it fails with EINTR depends on the operating system, the SA_RESTART flag, etc. But the tests we're talking about send signals whose default actions are to terminate the process (e.g. SIGTERM, SIGHUP, etc): I cannot beleive AIX doesn't terminate the processes in these situations, I think it's a bug in asyncio which prevent the termination from being detected (David, you can just use "ps" to check that the processes are indeed terminated). |
|||
msg200461 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-19 18:09 | |
Check that which processes are terminated? The test eventually timed out after 1 hour and the process running echo.py continued to run. |
|||
msg200463 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 18:11 | |
Actually, I think we're talking about the actual select() syscal not returning when SIGCHLD is received and handled. Running the C-level handler isn't enough, the syscall must return with EINTR so the Python handler can run (which then writes a byte to the self-pipe to ensure select() returns, but that seems too late). Maybe twiddling with SA_RESTART can fix this? |
|||
msg200466 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-19 18:17 | |
poll() vs select() does not make a difference. Using the default poll(), if I run python in GDB, test_process_interactive hangs in poll(). The top of the traceback looks like: #0 0xd02389d4 in __fd_poll () from /usr/lib/libc.a(shr.o) #1 0xd717e604 in poll (__listptr=0x2023e150, __nfds=1, __timeout=-1) at /usr/include/sys/poll.h:107 #2 0xd71800f8 in poll_poll (self=0x30696d10, args=0x304723b4) at /home/dje/src/cpython/Modules/selectmodule.c:543 |
|||
msg200470 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-19 18:24 | |
> Guido van Rossum added the comment: > > Actually, I think we're talking about the actual select() syscal not > returning when SIGCHLD is received and handled. Running the C-level handler > isn't enough, the syscall must return with EINTR so the Python handler can > run (which then writes a byte to the self-pipe to ensure select() returns, > but that seems too late). Maybe twiddling with SA_RESTART can fix this? Hum, I just check asyncio's code, and it's not doing what I thought it was doing. SA_RESTART won't help: PyOS_setsig() already doesn't set it, and also calls siginterrupt() to ensure the syscall will be interrupted, but as can been seen here, that's not portable. The correct way to do this, IMO, would be to simply change BaseSelectorEventLoop._read_from_self() to read the signal number from the wakeup FD (Python's signal handler writes the signal number to this FD): then, we could call directly call the relevant signal handler. This way, this would succeed even if select()/poll()/whatever doesn't fail with EINTR (it'll also be a bit simpler/faster). I thought that's how it was done in tulip! |
|||
msg200475 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-10-19 18:44 | |
> [Off-topic: Heck, another one of those upside-down tracebacks. WHY???] These are tracebacks emitted by the faulthandler module. They don't use the normal traceback printing machinery because it is impossible to do so in the conditions where those tracebacks are printed (e.g. from a signal handler). |
|||
msg200487 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-19 19:53 | |
Here's a patch changing asyncio to not rely on a Python-level signal handler at all: it reads the signal number from the self-pipe. It should therefore work even if the select()/poll()/etc syscall doesn't fail with EINTR, and incidentally require only one wakup per signal. Also, it gets rid of BaseEventLoop._add_callback_signalsafe(). I just had to update a test which was timing-dependent (basically, before when a signal was received you had to add the handler to the ready list and issue a self-write to wakeup the main loop, now the handler is added to the ready list and the loop is automatically woken up). This should fix the PPC Linux and FreeBSD failures. I'm not so sure about AIX, since the error looks different. |
|||
msg200494 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-19 20:35 | |
The patch cleans up a lot of the other failures for asyncio on AIX, but does not fix the hang. |
|||
msg200505 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-19 22:09 | |
> The patch cleans up a lot of the other failures for asyncio on AIX Great. > but does not fix the hang. I was expecting this :-) Note that the hang is really strange, especially since the first run_until_complete() succeeds. Without more information, it'll be really hard to debug. If you don't have an strace equivalent, then the only option is to add ad-hoc debug statements to see: - if the data gets written to the pipe - if data is read from the pipe There's something strange in the stack you posted above: there seems to be a single FD in the poll() set, whereas there should be at least 3 (stdin, stdout and stderr). |
|||
msg200510 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 22:50 | |
I guess it's a shame that the two issues got merged, since now it looks like they may have to be unmerged again. I sent a few comments to neologix' patch that should be easy to address; for AIX I'm afraid we're still looking for more info from David. |
|||
msg200513 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-10-19 23:13 | |
The exact same hanging seems to happen under Ubuntu 8.04 LTS: http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8803/steps/test/logs/stdio [176/383] test_asyncio Timeout (1:00:00)! Thread 0x404e48c0: File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/selectors.py", line 313 in select File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 576 in _run_once File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 153 in run_forever File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/asyncio/base_events.py", line 172 in run_until_complete File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_asyncio/test_events.py", line 1081 in test_subprocess_close_after_finish File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 571 in run File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/case.py", line 610 in __call__ File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__ File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__ File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 117 in run File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/suite.py", line 79 in __call__ File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/unittest/runner.py", line 168 in run File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 1661 in _run_suite File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/support/__init__.py", line 1695 in run_unittest File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_asyncio/__init__.py", line 31 in test_main File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 1276 in runtest_inner File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 965 in runtest File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 761 in main File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/regrtest.py", line 1560 in main_in_temp_cwd File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/__main__.py", line 3 in <module> File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 73 in _run_code File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/runpy.py", line 160 in _run_module_as_main |
|||
msg200514 - (view) | Author: Larry Hastings (larry) * ![]() |
Date: 2013-10-19 23:14 | |
There's also a test that hangs for 1 hour on the Ubuntu x86 buildbot. Sample failure: http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/8803/ |
|||
msg200516 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 23:26 | |
If CF's patch fixes the Ubuntu hang, commit it! We can iterate on my nits post-alpha. |
|||
msg200518 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-19 23:34 | |
> Guido van Rossum added the comment: > > Actually, I think we're talking about the actual select() syscal not > returning when SIGCHLD is received and handled. Running the C-level handler > isn't enough, the syscall must return with EINTR so the Python handler can > run (which then writes a byte to the self-pipe to ensure select() returns, > but that seems too late). I just realized the following: there's no need for EINTR to break out of select(), that's what the self-pipe is for: since the write-end is registered as wakeup FD, and the read-end is registered in the selector, when the C-level signal handler is executed, it writes to the wakeup FD, therefore the select() syscall should return with the read-end ready, or more likely fail with EINTR. And then, the Python-level signal handler should be executed (from the main loop), scheduling a callback, and re-trigerring a wakup through the self-pipe. If this doesn't work as-is, I only see two explanations: - the C-level signal handler doesn't get called, which seems impossible - the python-level signal handler doesn't get executed before the main loop re-enters select(), but since the new GIL signal handlers are called at each eval loop In short, I'm puzzled. I can tell why my patch should work, but I'm not sure why the current approach doesn't. |
|||
msg200520 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 23:42 | |
Hm... Maybe the wakeup_fd machinery is broken on some systems? Has anybody been able to confirm that CF's patch fixes the hang on any non-AIX system? If so, maybe we should commit first, understand later... |
|||
msg200521 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-10-19 23:43 | |
I don't have access to an Ubuntu 8.04 box, but if David Bolen can test the patch it would be great. |
|||
msg200525 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-19 23:55 | |
Separate question for David: if CF's patch were committed, would we still need your patch to disable three subprocess-related tests on AIX? (That can be dealt with after the alpha release though -- in fact I propose not to wait for AIX if we can get the other platforms to stop hanging.) |
|||
msg200530 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-10-20 01:29 | |
RHEL 6 too: http://buildbot.python.org/all/builders/x86%20RHEL%206%203.x/builds/2911/steps/test/logs/stdio |
|||
msg200531 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-20 01:31 | |
AIX has an equivalent to strace (called truss). I have recorded all AIX system calls and signals for test_process_interactive, which hangs, following all children created by fock. The uncompressed file is 82MB or 939KB compressed. The highlights are: test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ... 4588380: 64422135: close(3) = 0 4588380: 64422135: statx("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", 0x2FF20870, 128, 010) = 0 4588380: 64422135: kopen("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", O_RDONLY|O_LARGEFILE) = 3 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY 4588380: 64422135: kfcntl(3, F_SETFD, 0x00000001) = 0 4588380: 64422135: fstatx(0x00000003, 0x2FF229E8, 0x00000080, 0x00000008) = 0x00000000 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY 4588380: 64422135: lseek(0x00000003, 0x00000000, 0x00000001) = 0x00000000 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 110 4588380: 64422135: klseek(3, 4294967295, -108, 0x00000001) = 0 4588380: 64422135: klseek(3, 0, 0, 0x00000000) = 0 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 110 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 0 4588380: 64422135: close(3) = 0 4588380: 64422135: kread(0, " P y t h o n � � � � �".., 1024) = 7 4588380: 64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000007) = 0x00000007 4588380: 0x00000000: " P y t h o n � � � � �".. 59244954: 33161535: _poll(0x2067A970, 3, -1) = 1 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 59244954: 33161535: __libc_sbrk(0x00000000) = 0x20691220 4588380: 64422135: kwrite(0x00000001, 0x200CD140, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD190, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 59244954: 33161535: kread(9, " P y t h o n � � � � �".., 262144) = 7 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD1E0, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 59244954: 33161535: _poll(0x2067A970, 3, 0) = 0 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200D66C0, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 59244954: 33161535: _poll(0x2067A970, 3, 0) = 0 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 59244954: 33161535: _poll(0x2067A970, 3, 0) = 04588380: 64422135: kread(0, " � � � � � � � � � � � �".. , 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 59244954: 33161535: _poll(0x2067A970, 3, 0) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD140, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD190, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD1E0, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200D66C0, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 4588380: 64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000000) = 0x00000000 4588380: 0x00000000: " � � � � � � � � � � � �".. And it sits in this read/write loop with some intermittent _poll timeouts until I send SIGINT. |
|||
msg200532 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-20 01:35 | |
Separate question for David: if CF's patch were committed, would we still need your patch to disable three subprocess-related tests on AIX? (That can be dealt with after the alpha release though -- in fact I propose not to wait for AIX if we can get the other platforms to stop hanging.) CF's patch does not fix the hangs on AIX and my patch still is needed. And I never was suggesting that correct functionality on AIX is a blocker. |
|||
msg200533 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-10-20 01:40 | |
> test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ... > 4588380: 64422135: close(3) = 0 > 4588380: 64422135: statx("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", 0x2FF20870, 128, 010) = 0 > 4588380: 64422135: kopen("/home/dje/src/cpython/Lib/test/test_asyncio/echo.py", O_RDONLY|O_LARGEFILE) = 3 > 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY > 4588380: 64422135: kfcntl(3, F_SETFD, 0x00000001) = 0 > 4588380: 64422135: fstatx(0x00000003, 0x2FF229E8, 0x00000080, 0x00000008) = 0x00000000 > 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY > 4588380: 64422135: lseek(0x00000003, 0x00000000, 0x00000001) = 0x00000000 > 4588380: 64422135: kioctl(3, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY > 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 110 > 4588380: 64422135: klseek(3, 4294967295, -108, 0x00000001) = 0 > 4588380: 64422135: klseek(3, 0, 0, 0x00000000) = 0 > 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 110 > 4588380: 64422135: kread(3, " i m p o r t o s\n\n i".., 4096) = 0 > 4588380: 64422135: close(3) = 0 > 4588380: 64422135: kread(0, " P y t h o n � � � � �".., 1024) = 7 > 4588380: 64422135: kwrite(0x00000001, 0x200CD0A0, 0x00000007) = 0x00000007 > 4588380: 0x00000000: " P y t h o n � � � � �".. > 59244954: 33161535: _poll(0x2067A970, 3, -1) = 1 > 4588380: 64422135: kread(0, " � � � � � � � � � � � �".., 1024) = 0 > 4588380: 64422135: kwrite(0x00000001, 0x200CD0F0, 0x00000000) = 0x00000000 > 4588380: 0x00000000: " � � � � � � � � � � � �".. It seems echo.py should at least exit on EOF, which would prevent the hangs (right now it's stuck in a busy loop calling os.read(0) which always returns immediately). (same for echo3.py) |
|||
msg200534 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-20 02:45 | |
I added if not buf: break after buf = os.read() to echo.py, echo2.py and echo3.py. They now exit and do not spin, but the test still hangs. |
|||
msg200541 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-20 03:56 | |
I wonder if subprocess pipes have a version of the race condition I fixed for sockets yesterday: http://code.google.com/p/tulip/source/detail?r=2d392599fa067038b5fc9eeaf051011307aeadee The pipe code seems derived from the socket code long ago. |
|||
msg200543 - (view) | Author: David Bolen (db3l) * | Date: 2013-10-20 04:20 | |
I've been trying to test this on the Ubuntu 8.04 buildbot, but so far, using the latest hg default head, haven't gotten test_asyncio to hang even before the patch. It also looks like the last 4 regular buildbot builds were fine too. I even tried rolling back to one of the revisions that did fail a buildbot build, but it went ok as well. I'm guessing it's just a timing or race condition but without some way to reproduce reliably, I can't say much about the patch. |
|||
msg200544 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-20 04:30 | |
Let's just release the alpha with this as it stands; I will look at the race condition later, and also at CF's patch (which may be a good idea regardless of what it fixes). I did add the breaks to echo and echo3 since the busy-waiting just makes tracing this that much harder. |
|||
msg200547 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-20 06:37 | |
I'm trying to let go of the AIX hang. Here's a brain dump of what I've figured out so far. * There were a lot of red herrings in the early discussion. This hang doesn't seem to have anything to do with nonblocking connect() or sockets, nor even signals. * Summary of what the test (test_subprocess_interactive) tries to do: it starts an echo subprocess, writes a string to it, reads the string back, writes another string to it, reads that back, and then closes the transport. * The test hangs after seeing the first string echoed back but not the second, and in between somehow the stdin pipe is broken. * If I read David's truss log correctly, the following things have happened: - the parent wrote 'Python ' to the pipe for the subprocess's stdin (this is not shown in the extract but it must have happened because we see the string arrive in the subprocess) - the echo.py subprocess started and began to read from stdin - the subprocess read 'Python ' from its stdin - the subprocess wrote 'Python ' back to its stdout - poll() in the parent woke up - the parent allocated some memory and read 'Python ' from the pipe for the subprocess's stdout At this point apparently the pipe for the subprocess stdin got closed so the subprocess received an EOF (over and over due to the missing test+break). We also know that the parent now hangs in the last run_until_complete() call, which means that it has at least attempted to write 'The Winner' -- but there is no evidence of this in the truss extract so it is possible that that string is still in the transport's write buffer. It is also possible that David simply missed it in the endless stream of ineffective calls due to the looping bug. I'm actually curious why it seems that poll() keeps returning 0 in the parent -- shouldn't it have an infinite timeout, since there's nothing left to do? Another theory is one or more *connection_lost() methods on the protocol are actually being called but the test stubbornly keeps waiting until proto.got_data[1] becomes set. I'd be very interested in the truss output with the fix to echo.py in place (which is now in the repo). |
|||
msg200551 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-20 08:26 | |
> David Edelsohn added the comment: > > AIX has an equivalent to strace (called truss). I have recorded all AIX system calls and signals for test_process_interactive, which hangs, following all children created by fock. The uncompressed file is 82MB or 939KB compressed. The highlights are: Apparently, the stdout pipe was closed by the parent process: it would be nice to see where/how (it should be FD 7 according to strace on my Linux box). |
|||
msg200552 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-20 08:42 | |
> Apparently, the stdout pipe was closed by the parent process Could it be that selecting for *read* on the *write* end of a pipe is always ready? In _UnixWritePipeTransport there's a read handler that immediately closes the pipe as soon as it called. I vaguely remember a discussion on python-tulip that this might be Linux-specific behavior. (The reason is that otherwise you can't find out whether the other end was closed unless you attempt to write to the pipe.) To test this theory, it should be sufficient to comment out line 280 self._loop.add_reader(self._fileno, self._read_ready) from unix_events.py. This will make test_subprocess_kill() hang but it should not affect test_subprocess_interactive. (So it is not a fix, just a way to confirm the theory.) |
|||
msg200553 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-20 08:52 | |
> Guido van Rossum added the comment: > >> Apparently, the stdout pipe was closed by the parent process > > Could it be that selecting for *read* on the *write* end of a pipe is > always ready? That's exactly what I was thinking when I read the code below: that's definitely a possibility on AIX. > In _UnixWritePipeTransport there's a read handler that > immediately closes the pipe as soon as it called. I vaguely remember a > discussion on python-tulip that this might be Linux-specific behavior. (The > reason is that otherwise you can't find out whether the other end was > closed unless you attempt to write to the pipe.) Normally, if the pipe is closed, it should be reported by a POLLHUP/write ready. Then you only consider it closed when write fails with EPIPE. |
|||
msg200582 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-20 13:44 | |
> To test this theory, it should be sufficient to comment out > self._loop.add_reader(self._fileno, self._read_ready) When I comment out this line, test_subprocess_interactive succeeds on AIX. |
|||
msg200589 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-20 14:08 | |
For completeness, the highlights of the new truss trace output after the echo.py change and only tracing the main process to avoid confusion from the interleaved output: test_subprocess_interactive (test.test_asyncio.test_events.PollEventLoopTests) ... socketpair(1, 1, 0, 0x2FF112F0) = 0 kfcntl(5, F_SETFD, 0x00000001) = 0 kfcntl(6, F_SETFD, 0x00000001) = 0 kfcntl(5, F_GETFL, 0x00000000) = 2 kioctl(5, -2147195266, 0x2FF12610, 0x00000000) = 0 kioctl(5, -2147195267, 0x2FF12610, 0x00000000) = 0 kfcntl(5, F_SETFL, 0x00000006) = 0 kfcntl(6, F_GETFL, 0x00000000) = 2 kioctl(6, -2147195266, 0x2FF12610, 0x00000000) = 0 kioctl(6, -2147195267, 0x2FF12610, 0x00000000) = 0 kfcntl(6, F_SETFL, 0x00000006) = 0 _poll(0x201A9B00, 1, 0) = 0 fstatx(6, 0x2FF0F0F0, 128, 010) = 0 sigprocmask(2, 0xF0296960, 0x2FF0F0A0) = 0 _sigaction(20, 0x2FF0F0F8, 0x2FF0F108) = 0 thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0x14260153, 0x00000003, 0x00000110, 0x00000000, 0x00000000) = 0x00000000 pipe(0x2FF0E2F0) = 0 kfcntl(7, F_SETFD, 0x00000001) = 0 kfcntl(8, F_SETFD, 0x00000001) = 0 pipe(0x2FF0E2F0) = 0 kfcntl(9, F_SETFD, 0x00000001) = 0 kfcntl(10, F_SETFD, 0x00000001) = 0 pipe(0x2FF0E2F0) = 0 kfcntl(11, F_SETFD, 0x00000001) = 0 kfcntl(12, F_SETFD, 0x00000001) = 0 fstatx(8, 0x2FF0E748, 128, 010) = 0 fstatx(8, 0x2FF0E748, 128, 010) = 0 kioctl(8, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY fstatx(9, 0x2FF0E748, 128, 010) = 0 fstatx(9, 0x2FF0E748, 128, 010) = 0 kioctl(9, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY fstatx(11, 0x2FF0E748, 128, 010) = 0 fstatx(11, 0x2FF0E748, 128, 010) = 0 kioctl(11, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY pipe(0x2FF0E1C0) = 0 kfcntl(13, F_SETFD, 0x00000001) = 0 kfcntl(14, F_SETFD, 0x00000001) = 0 sigprocmask(0, 0xF0296960, 0xF0296968) = 0 kfork() = 51773770 thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0xD0528F00, 0x00000000, 0x14260153, 0x14260153, 0x00000000) = 0x00000000 _getpid() = 65077750 close(14) = 0 close(7) = 0 close(10) = 0 close(12) = 0 kread(13, " � � � � � � � � � � � �".., 50000) = 0 close(13) = 0 fstatx(8, 0x2FF0DB58, 128, 010) = 0 kfcntl(8, F_GETFL, 0x00000000) = 1 kioctl(8, -2147195266, 0x2FF0CDD0, 0x00000000) = 0 kioctl(8, -2147195267, 0x2FF0CDD0, 0x00000000) = 0 kfcntl(8, F_SETFL, 0x00000005) = 0 _poll(0x204DBB90, 2, 0) = 0 _poll(0x204DBB90, 2, 0) = 0 kfcntl(9, F_GETFL, 0x00000000) = 0 kioctl(9, -2147195266, 0x2FF0C360, 0x00000000) = 0 kioctl(9, -2147195267, 0x2FF0C360, 0x00000000) = 0 kfcntl(9, F_SETFL, 0x00000004) = 0 _poll(0x204DBB90, 3, 0) = 0 _poll(0x204DBB90, 3, 0) = 0 kfcntl(11, F_GETFL, 0x00000000) = 0 kioctl(11, -2147195266, 0x2FF0C360, 0x00000000) = 0 kioctl(11, -2147195267, 0x2FF0C360, 0x00000000) = 0 kfcntl(11, F_SETFL, 0x00000004) = 0 _poll(0x2050B8C0, 4, 0) = 0 _poll(0x2050B8C0, 4, 0) = 0 _poll(0x2050B8C0, 4, 0) = 0 _poll(0x2050B8C0, 4, 0) = 0 kwrite(8, " P y t h o n ", 7) = 7 _poll(0x2050B8C0, 4, 0) = 1 _poll(0x2050B8C0, 3, 0) = 0 close(8) = 0 _poll(0x2050B8C0, 3, 0) = 0 _poll(0x2050B8C0, 3, -1) = 1 __libc_sbrk(0x00000000) = 0x20541130 kread(9, " P y t h o n � � � � �".., 262144) = 7 _poll(0x2050B8C0, 3, 0) = 0 _poll(0x2050B8C0, 3, 0) = 0 _poll(0x2050B8C0, 3, 0) = 0 _poll(0x2050B8C0, 3, 0) = 0 _poll(0x2050B8C0, 3, -1) = 2 kread(9, " � � � � � � � � � � � �".., 262144) = 0 kread(11, " � � � � � � � � � � � �".., 262144) = 0 _poll(0x2050B8C0, 1, 0) = 0 close(9) = 0 close(11) = 0 Received signal #20, SIGCHLD [caught] _getpid() = 65077750 kwrite(6, "14", 1) = 1 ksetcontext_sigreturn(0x2FF11D70, 0x200868EC, 0x00000008, 0xD0528CA4, 0x00000000, 0x14260153, 0x2FF22FF8, 0x00000000) _poll(0x2050B8C0, 1, 0) = 1 _erecv(5, 0x3075C820, 1, 0, 0x00000000) = 1 _poll(0x2050B8C0, 1, 0) = 0 kwaitpid(0x2FF12134, 0, 5, 0x00000000, 0x00000000) = 51773770 _poll(0x2050B8C0, 1, 0) = 0 _poll(0x2050B8C0, 1, -1) (sleeping...) |
|||
msg200597 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-20 16:55 | |
> > Could it be that selecting for *read* on the *write* end of a pipe is > > always ready? > > That's exactly what I was thinking when I read the code below: that's > definitely a possibility on AIX. David confirmed that it is the _read_ready() that closes the pipe prematurely. (Thanks David!) > > In _UnixWritePipeTransport there's a read handler that > > immediately closes the pipe as soon as it called. I vaguely remember a > > discussion on python-tulip that this might be Linux-specific behavior. (The > > reason is that otherwise you can't find out whether the other end was > > closed unless you attempt to write to the pipe.) > > Normally, if the pipe is closed, it should be reported by a POLLHUP/write ready. > Then you only consider it closed when write fails with EPIPE. That's no good though unless the app wants to write -- the write end also reports write ready when it is empty (or nearly so), so this won't help an app that has done all the writing it wants to (at least for a while) but still wants to be notified when the other end has closed it. Now, there are other indications, such as the pipe in the other direction giving an EOF, or the process exiting, but sometimes neither of those happens, and then we're stuck. I guess we'll have to write platform-dependent code and make this an optional feature. (Essentially, on platforms like AIX, for a write-pipe, connection_lost() won't be called unless you try to write some more bytes to it.) I do believe that so far this problem only occurs on AIX so I am tempted to make it an explicit test for AIX -- if it's AIX, don't register the _read_ready handler. We'll also have to skip or adjust one or two tests that will fail if this feature is missing. |
|||
msg200598 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-20 17:36 | |
> I guess we'll have to write platform-dependent code and make this an > optional feature. (Essentially, on platforms like AIX, for a > write-pipe, connection_lost() won't be called unless you try to write > some more bytes to it.) > > I do believe that so far this problem only occurs on AIX so I am > tempted to make it an explicit test for AIX -- if it's AIX, don't > register the _read_ready handler. We'll also have to skip or adjust > one or two tests that will fail if this feature is missing. Hmm... Not calling connection_lost() when the read-end of the pipe is closed is OK, since there's no portable way to detect this on e.g. AIX, but AFAICT the current process-termination logic is buggy: When SIGCHLD is received, _sig_chld() is executed: def _sig_chld(self): [...] transp = self._subprocesses.get(pid) if transp is not None: transp._process_exited(returncode) Then, here's _process_exited(): def _process_exited(self, returncode): assert returncode is not None, returncode assert self._returncode is None, self._returncode self._returncode = returncode self._loop._subprocess_closed(self) self._call(self._protocol.process_exited) self._try_finish() And here's _try_finish(): def _try_finish(self): assert not self._finished if self._returncode is None: return if all(p is not None and p.disconnected for p in self._pipes.values()): self._finished = True self._loop.call_soon(self._call_connection_lost, None) Thus, _UnixSubprocessTransport protocol's connection_lost is only called if the all() expression is true: and it's true only if all the subprocess pipes have been disconnected (or if we didn't setup any pipe). Unfortunately, this might very well never happen: imagine that the subprocess forks a process: this grand-child process inherits the child process's pipes, so when the child process exits, we won't receive any notification, since this grand-child process still has open FDs pointing to the original child's stdin/stdout/stderr. The following simple test will hang until the background 'sleep' exits: """ diff -r 47618b00405b Lib/test/test_asyncio/test_events.py --- a/Lib/test/test_asyncio/test_events.py Sat Oct 19 10:45:48 2013 +0300 +++ b/Lib/test/test_asyncio/test_events.py Sun Oct 20 19:32:37 2013 +0200 @@ -1059,6 +1059,23 @@ @unittest.skipIf(sys.platform == 'win32', "Don't support subprocess for Windows yet") + def test_subprocess_inherit_fds(self): + proto = None + + @tasks.coroutine + def connect(): + nonlocal proto + transp, proto = yield from self.loop.subprocess_shell( + functools.partial(MySubprocessProtocol, self.loop), + 'sleep 60 &') + self.assertIsInstance(proto, MySubprocessProtocol) + + self.loop.run_until_complete(connect()) + self.loop.run_until_complete(proto.completed) + self.assertEqual(0, proto.returncode) + + @unittest.skipIf(sys.platform == 'win32', + "Don't support subprocess for Windows yet") def test_subprocess_close_after_finish(self): proto = None transp = None """ If waitpid() returns a process's PID, then the process is done, there's no reason to further wait for pipe's disconnection: they can be used as a hint that the process terminated, but there's definitely not required... |
|||
msg200599 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2013-10-20 17:37 | |
> I guess we'll have to write platform-dependent code and make this an > optional feature. (Essentially, on platforms like AIX, for a > write-pipe, connection_lost() won't be called unless you try to write > some more bytes to it.) If we are not capturing stdout/stderr then we could "leak" the write end of a pipe to the child. When the read end becomes readable we can call the process protocol's connection_lost(). Or we could just call connection_lost() when reaping the pid. |
|||
msg200615 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-20 19:01 | |
No, there's a use case for reading after the child exited, if there is a grandchild still writing. --Guido van Rossum (sent from Android phone) On Oct 20, 2013 10:37 AM, "Richard Oudkerk" <report@bugs.python.org> wrote: > > Richard Oudkerk added the comment: > > > I guess we'll have to write platform-dependent code and make this an > > optional feature. (Essentially, on platforms like AIX, for a > > write-pipe, connection_lost() won't be called unless you try to write > > some more bytes to it.) > > If we are not capturing stdout/stderr then we could "leak" the write end > of a pipe to the child. When the read end becomes readable we can call the > process protocol's connection_lost(). > > Or we could just call connection_lost() when reaping the pid. > > ---------- > nosy: +sbt > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue19293> > _______________________________________ > |
|||
msg200666 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2013-10-21 00:19 | |
[Off-topic: Heck, another one of those upside-down tracebacks. WHY???] faulthandler is really a very low-level module. I wrote it to analyze deadlocks, memory corruptions and other sad cases. faulthandler is designed to provide as much information as possible even if Python internals are corrupted. So it only reads data structures, without processing them too much. For example, frames are stored from the most recent to the oldest. faulthandler also truncates long traceback to 100 frames (write "..." if there are more), to stop the loop if there is a cycle in the chained list of frames. The documentation also mentions other limitations: http://docs.python.org/dev/library/faulthandler.html |
|||
msg200671 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2013-10-21 00:28 | |
> Hm... Maybe the wakeup_fd machinery is broken on some systems? Hey! I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX? |
|||
msg200672 - (view) | Author: Tim Peters (tim.peters) * ![]() |
Date: 2013-10-21 00:33 | |
Just wondering: why is the timeout-to-failure set to a whole hour? Based on all the messages I've seen this weekend, we could save the buildbots countless years of time by reducing it to - say - 59 minutes ;-) |
|||
msg200689 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-21 03:06 | |
> I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX? @haypo: test_signal skips some tests on AIX. See Issue #18238 . |
|||
msg200713 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2013-10-21 08:09 | |
2013/10/21 David Edelsohn <report@bugs.python.org>: >> I added many tests when I added this feature to Python: see test_signal.py. By the way, it uses select.select() :-) Does test_signal pass successfully on AIX? > > @haypo: test_signal skips some tests on AIX. See Issue #18238 . Ah yes, the test on select+signal is skipped, but the basic feature (writing a signal number in a pipe using wakeup fd) works on AIX. |
|||
msg200720 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2013-10-21 08:37 | |
I'm completely lost: how many issues is this thread conflating? For example, David, I think you said that my patch regarding signal handling did fix some tests. Is that still the case? If yes, could you open a specific issue for the remaining failing tests on AIX? AFAICT, the original hang is due to AIX always reporting the write-end of a pipe as read-ready: we should only keep discussing this problem here. The above FreeBSD and Ubuntu failures seem completely unrelated, they should get there own issues. |
|||
msg200745 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2013-10-21 10:23 | |
Would it make sense to use socketpair() instead of pipe() on AIX? We could check for the "bug" directly rather than checking specifically for AIX. |
|||
msg200791 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-21 14:50 | |
OK, I've created issue 19334 to track test_asyncio hangs on other platforms. Let's please keep this issue restricted to AIX. I am aware of the following issue: - select for read on the write end of a pipe returns immediately Any other AIX issues also please open separate issues (and assign to me to get my immediate attention). I wonder if we can at least dynamically detect that this does not work in the _read_ready() handler or even in the setup? ISTM that reading from the write end should immediately fail with EBADF. (It does on my system, but I don't want to generalize. :-) |
|||
msg200794 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-21 14:55 | |
@Richard: socketpair() seems to work fine on AIX (we use it internally). Maybe we can always use that for subprocess management? But will the subprocess module support this? |
|||
msg200806 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2013-10-21 16:19 | |
The following uses socketpair() instead of pipe() for stdin, and works for me on Linux: diff -r 7d94e4a68b91 asyncio/unix_events.py --- a/asyncio/unix_events.py Sun Oct 20 20:25:04 2013 -0700 +++ b/asyncio/unix_events.py Mon Oct 21 17:15:19 2013 +0100 @@ -272,8 +272,6 @@ self._loop = loop self._pipe = pipe self._fileno = pipe.fileno() - if not stat.S_ISFIFO(os.fstat(self._fileno).st_mode): - raise ValueError("Pipe transport is for pipes only.") _set_nonblocking(self._fileno) self._protocol = protocol self._buffer = [] @@ -442,9 +440,16 @@ self._finished = False self._returncode = None + if stdin == subprocess.PIPE: + stdin_w, stdin_r = socket.socketpair() + else: + stdin_w = stdin_r = None self._proc = subprocess.Popen( - args, shell=shell, stdin=stdin, stdout=stdout, stderr=stderr, + args, shell=shell, stdin=stdin_r, stdout=stdout, stderr=stderr, universal_newlines=False, bufsize=bufsize, **kwargs) + if stdin_r is not None: + stdin_r.close() + self._proc.stdin = open(stdin_w.detach(), 'rb', buffering=bufsize) self._extra['subprocess'] = self._proc def close(self): |
|||
msg200808 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-21 16:28 | |
Brilliant. It works for me too on OS X. David, can you check this on AIX? |
|||
msg200811 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-21 16:34 | |
Attached is a version of Richard's patch that I like better (sockpair.diff). - instead of dropping the ValueError, check for ISFIFO and ISSOCK - pass the read end of the socket pair to Popen() |
|||
msg200814 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-21 17:08 | |
New, improved version of sockpair.diff. |
|||
msg200860 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-22 02:47 | |
The testsuite is hanging in test_asyncio again and test_subprocess_close_client_stream needs to be skipped. |
|||
msg200861 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-22 02:48 | |
Other than the skipped tests, test_asyncio now passes on AIX except for: FAIL: test_write_pipe (test.test_asyncio.test_events.PollEventLoopTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/dje/src/cpython/Lib/test/test_asyncio/test_events.py", line 875, in test_write_pipe self.assertEqual(b'2345', data) AssertionError: b'2345' != b'' |
|||
msg200867 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-22 03:31 | |
Thanks; I've applied and pushed your fix for the hang. I think it's time to commit the sockpair.diff change; the test_write_pipe() failure looks related -- as you recall the problem doesn't really cause a hang, it causes a premature closing of a write pipe, the hangs were just due to the test structure (unfortunate, really). I'll let you know when. |
|||
msg200868 - (view) | Author: Roundup Robot (python-dev) ![]() |
Date: 2013-10-22 03:37 | |
New changeset 2a0bda8d283d by Guido van Rossum in branch 'default': Switch subprocess stdin to a socketpair, attempting to fix issue #19293 (AIX hang). http://hg.python.org/cpython/rev/2a0bda8d283d |
|||
msg200869 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-22 03:38 | |
Let's see how this fares. If it passes on the AIX buildbot, maybe see which of the tests we skipped on AIX can be re-enabled again? |
|||
msg200872 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-22 03:55 | |
With the latest changeset applied to default, the test_subprocess tests no longer hang, but test_write_pipe_disconnect_on_close now hangs. The attached patch changes test_event.py to not skip test_subprocess_* but now skips test_write_pipe_disconnect_on_close. |
|||
msg200873 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-22 03:58 | |
I think I know what that is -- the test should use a socketpair now. I'll see how quickly I can give you a patch for that. The other one is now committed. |
|||
msg200874 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-22 04:09 | |
David, try this fix (aixfix.diff). |
|||
msg200876 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-22 04:23 | |
test_asyncio no longer hangs on AIX with the aixfix.diff patch. Thanks! |
|||
msg200879 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-22 04:27 | |
Third try's a charm. Corrected aix fix. |
|||
msg200880 - (view) | Author: Roundup Robot (python-dev) ![]() |
Date: 2013-10-22 04:29 | |
New changeset c2e018c54689 by Guido van Rossum in branch 'default': Fix asyncio issue #19293 (hangs on AIX). http://hg.python.org/cpython/rev/c2e018c54689 |
|||
msg200881 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2013-10-22 04:30 | |
"And there was much rejoicing." |
|||
msg200882 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2013-10-22 04:31 | |
You have to try harder to break it again. aixfix2.diff still works. ;-) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:52 | admin | set | github: 63492 |
2013-10-22 04:31:54 | David.Edelsohn | set | messages: + msg200882 |
2013-10-22 04:30:21 | gvanrossum | set | status: open -> closed resolution: fixed messages: + msg200881 stage: resolved |
2013-10-22 04:29:19 | python-dev | set | messages: + msg200880 |
2013-10-22 04:27:17 | gvanrossum | set | files: - aixfix.diff |
2013-10-22 04:27:01 | gvanrossum | set | files:
+ aixfix2.diff messages: + msg200879 |
2013-10-22 04:25:47 | gvanrossum | set | messages: - msg200878 |
2013-10-22 04:24:21 | gvanrossum | set | messages: + msg200878 |
2013-10-22 04:23:40 | gvanrossum | set | messages: - msg200877 |
2013-10-22 04:23:02 | gvanrossum | set | messages: + msg200877 |
2013-10-22 04:23:01 | David.Edelsohn | set | messages: + msg200876 |
2013-10-22 04:09:16 | gvanrossum | set | files:
+ aixfix.diff messages: + msg200874 |
2013-10-22 03:58:42 | gvanrossum | set | messages: + msg200873 |
2013-10-22 03:55:12 | David.Edelsohn | set | files:
+ issue19293.patch3 messages: + msg200872 |
2013-10-22 03:38:53 | gvanrossum | set | messages: + msg200869 |
2013-10-22 03:37:23 | python-dev | set | messages: + msg200868 |
2013-10-22 03:31:45 | gvanrossum | set | messages: + msg200867 |
2013-10-22 02:48:00 | David.Edelsohn | set | messages: + msg200861 |
2013-10-22 02:47:20 | David.Edelsohn | set | files:
+ issue19293.patch2 messages: + msg200860 |
2013-10-21 17:08:17 | gvanrossum | set | files: - sockpair.diff |
2013-10-21 17:08:04 | gvanrossum | set | files:
+ sockpair.diff messages: + msg200814 |
2013-10-21 16:34:47 | gvanrossum | set | files:
+ sockpair.diff messages: + msg200811 |
2013-10-21 16:28:23 | gvanrossum | set | messages: + msg200808 |
2013-10-21 16:19:05 | sbt | set | messages: + msg200806 |
2013-10-21 15:09:12 | flox | set | nosy:
+ flox |
2013-10-21 14:55:34 | gvanrossum | set | messages: + msg200794 |
2013-10-21 14:50:52 | gvanrossum | set | messages:
+ msg200791 title: test_asyncio hanging for 1 hour -> test_asyncio hanging for 1 hour (AIX version, hangs in test_subprocess_interactive) |
2013-10-21 10:23:03 | sbt | set | messages: + msg200745 |
2013-10-21 08:37:01 | neologix | set | messages: + msg200720 |
2013-10-21 08:09:05 | vstinner | set | messages: + msg200713 |
2013-10-21 03:06:05 | David.Edelsohn | set | messages: + msg200689 |
2013-10-21 00:33:29 | tim.peters | set | nosy:
+ tim.peters messages: + msg200672 |
2013-10-21 00:28:37 | vstinner | set | messages: + msg200671 |
2013-10-21 00:19:32 | vstinner | set | nosy:
+ vstinner messages: + msg200666 |
2013-10-20 19:01:42 | gvanrossum | set | messages: + msg200615 |
2013-10-20 17:37:10 | sbt | set | nosy:
+ sbt messages: + msg200599 |
2013-10-20 17:36:37 | neologix | set | messages: + msg200598 |
2013-10-20 16:55:54 | gvanrossum | set | messages: + msg200597 |
2013-10-20 14:08:36 | David.Edelsohn | set | messages: + msg200589 |
2013-10-20 13:44:20 | David.Edelsohn | set | messages: + msg200582 |
2013-10-20 08:52:08 | neologix | set | messages: + msg200553 |
2013-10-20 08:42:09 | gvanrossum | set | messages: + msg200552 |
2013-10-20 08:26:43 | neologix | set | messages: + msg200551 |
2013-10-20 06:37:40 | gvanrossum | set | messages: + msg200547 |
2013-10-20 04:30:40 | gvanrossum | set | messages: + msg200544 |
2013-10-20 04:20:58 | db3l | set | messages: + msg200543 |
2013-10-20 03:56:17 | gvanrossum | set | messages: + msg200541 |
2013-10-20 02:45:40 | David.Edelsohn | set | messages: + msg200534 |
2013-10-20 01:40:52 | pitrou | set | messages: + msg200533 |
2013-10-20 01:35:59 | David.Edelsohn | set | messages: + msg200532 |
2013-10-20 01:31:12 | David.Edelsohn | set | messages: + msg200531 |
2013-10-20 01:29:55 | pitrou | set | nosy:
+ ncoghlan messages: + msg200530 |
2013-10-19 23:55:54 | gvanrossum | set | messages: + msg200525 |
2013-10-19 23:43:55 | pitrou | set | nosy:
+ db3l messages: + msg200521 |
2013-10-19 23:42:31 | gvanrossum | set | messages: + msg200520 |
2013-10-19 23:34:07 | neologix | set | messages: + msg200518 |
2013-10-19 23:26:56 | gvanrossum | set | messages: + msg200516 |
2013-10-19 23:14:20 | larry | set | nosy:
+ larry messages: + msg200514 |
2013-10-19 23:13:51 | pitrou | set | messages: + msg200513 |
2013-10-19 22:50:22 | gvanrossum | set | messages: + msg200510 |
2013-10-19 22:09:28 | neologix | set | messages: + msg200505 |
2013-10-19 20:35:54 | David.Edelsohn | set | messages: + msg200494 |
2013-10-19 19:53:57 | neologix | set | files:
+ asyncio_signal.diff messages: + msg200487 |
2013-10-19 18:44:36 | pitrou | set | nosy:
+ pitrou messages: + msg200475 |
2013-10-19 18:24:17 | neologix | set | messages: + msg200470 |
2013-10-19 18:17:26 | David.Edelsohn | set | messages: + msg200466 |
2013-10-19 18:11:35 | gvanrossum | set | messages: + msg200463 |
2013-10-19 18:09:35 | David.Edelsohn | set | messages: + msg200461 |
2013-10-19 18:03:28 | neologix | set | messages: + msg200456 |
2013-10-19 17:47:40 | David.Edelsohn | set | messages: + msg200452 |
2013-10-19 17:36:21 | David.Edelsohn | set | messages: + msg200449 |
2013-10-19 16:41:06 | gvanrossum | set | assignee: gvanrossum |
2013-10-19 16:40:25 | gvanrossum | set | messages: + msg200440 |
2013-10-19 16:31:06 | skrah | set | title: test_asyncio failures on AIX -> test_asyncio hanging for 1 hour |
2013-10-19 16:29:42 | skrah | link | issue19285 superseder |
2013-10-19 16:26:33 | skrah | set | nosy:
+ skrah messages: + msg200438 |
2013-10-19 16:25:38 | neologix | set | nosy:
+ neologix messages: + msg200437 |
2013-10-19 16:18:25 | gvanrossum | set | messages: + msg200435 |
2013-10-19 16:10:46 | python-dev | set | nosy:
+ python-dev messages: + msg200433 |
2013-10-19 15:40:45 | David.Edelsohn | set | files:
+ issue19293.patch keywords: + patch messages: + msg200426 components: + Tests |
2013-10-19 05:12:28 | gvanrossum | set | messages: + msg200374 |
2013-10-19 05:07:22 | gvanrossum | set | nosy:
+ gvanrossum messages: + msg200372 |
2013-10-19 04:20:34 | David.Edelsohn | create |